automated-trading-systems

US_MacroRelease_HPT_Analysis

MARKET EFFICIENCY AND FIRST-MOVER ADVANTAGE AFTER U.S. MACROECONOMIC RELEASES AT EUREX AND CME

Table of Contents

  1. Executive Summary
  2. Directory Overview
  3. Global Data Structures & Architectural Definitions
  4. 1: GIT_Distribution_difference_CME.ipynb - Statistical Comparison on CME
  5. 2: GIT_Distribution_difference_Eurex.ipynb - Statistical Comparison on EUREX
  6. 3: GIT_pickle_analysis.ipynb - Raw Event Memory & Microsecond Latency Visualization
  7. 4: GIT_Final_plot_creator.ipynb - Unified Visual Aggregations & Cross-Exchange Comparisons
  8. 5: GIT_Price_change_analysis_EUREX_volume.ipynb - Eurex Latency Notional Analytics
  9. 6: GIT_PriceFormation_PnL_Analytics.ipynb - PnL Correlations & Predictive Modeling
  10. Prerequisites and Environment Setup
  11. Conclusion and Execution Workflows

Executive Summary

The directory contains six deeply interconnected Jupyter Notebooks dedicated to analyzing ultra-low-latency High-Frequency Trading (HFT) reactions triggered by broad macroeconomic announcements (like NFP, FOMC and ISM updates).

Covering both the Chicago Mercantile Exchange (CME) and Eurex, this suite of notebooks processes cached .pickle memory files and raw .csv trade histories to conduct comprehensive statistical regime testing, generate aggregated visual analyses, study cumulative Profit and Loss (P&L) dynamics, track notional traded volume at nanosecond precision, and perform predictive modeling using Random Forest architectures.

This extensive documentation provides a line-by-line understanding, mathematical breakdown, and architectural visualization of each notebook, helping researchers, quantitative analysts, and quantitative developers understand how the data structures are grouped, how the nonparametric algorithms are applied, and what metrics the graphical visuals aim to highlight.


Directory Overview

The repository consists of analytical Jupyter notebooks (.ipynb), processed tabular data files (.csv), custom caching structures (.pkl), and a dedicated Event_pickle_files subdirectory containing raw, granular event-driven memory dumps.

📓 Jupyter Notebooks

📊 Data Files (Root Directory)

📁 Event_pickle_files/ Directory

This critical subdirectory acts as the raw persistence layer housing various .pickle files which serialize massive Python objects (EventReactions mappings) representing ultra-high-frequency trades around macro-economic events matching nanosecond temporal limits.

📄 Overleaf_files/ Directory

This subdirectory contains the LaTeX source files and associated assets for generating the academic paper summarizing the research findings.


Global Data Structures & Architectural Definitions

Before detailing individual notebooks, it is fundamentally crucial to understand how trades are mapped in memory. Across multiple notebooks, the code repeatedly defines high-performance @dataclass objects to handle the gigantic volume of tick-level order book records imported from pickle artifacts.

Eurex Trade Dataclass

The standard Eurex Trade incorporates highly specific European market structures:

CME Trade Dataclass

The standard Chicago Mercantile Exchange (CME) equivalent incorporates specific mappings for US structures:

Both environments wrap their trades in an encompassing EventReactions dataclass, holding exactly the timestamp of the macroeconomic news, standard label tags (like “NFP” or “FOMC”), and arrays holding the preactions (before event) and reactions (after event).


1: GIT_Distribution_difference_CME.ipynb - Statistical Comparison on CME

Overview

This notebook is explicitly tasked with statistically analyzing whether the distribution of trades executed during the immediate $100ms$ chaos trailing a macro-release on CME differs fundamentally from the quiet periods preceding it, or differs significantly across asset classifications (Equity vs Fixed Income).

Components and Logic

  1. Non-parametric Statistical Suite: The notebook completely bypasses naive standard-normal assumptions and imports the scipy.stats module to harness strict, heavy-tail resistant calculations:
    • Kolmogorov-Smirnov (KS) Test (stats.ks_2samp): Checks if two independent empirical samples are governed by the exact same continuous distribution. It serves as the primary gauge for systemic regime changes (stochastic disruptions).
    • Two-Sided Mann-Whitney U Test (stats.mannwhitneyu): Tests whether the population spaces of trade values are identical or definitively shifted.
    • One-Sided Mann Whitney U Test: Mathematically tracks explicit directionality (to test if CME reaction volumes are strictly greater than pre-action volumes).
    • Permutation t-Test: A nonparametric method for testing if two groups differ significantly, without assuming a normal distribution. It works by shuffling data labels thousands of times to create a null distribution, calculating a t-statistic for each, and determining the p-value by comparing the original observed statistic to this distribution.
    • Bootstrap Mean Comparison: A non-parametric resampling technique used to determine if the difference between two group means is statistically significant without assuming a normal distribution. By repeatedly sampling with replacement from original data, it builds a distribution of mean differences to estimate confidence intervals and standard errors.
    • Wilcoxon Signed-Rank Test (stats.wilcoxon): Measures if the calculated Markout profits post-event are statistically greater than $0$ (signifying profitable execution against toxic order flow).
  2. Computational Validations Simulator: It implements rigorous randomized Bootstrapping (bootstrap_mean_comparison) traversing $5000$ loop variants mixing the post-event and pre-event distributions randomly to natively test if the mean of the arrays surpasses standard deviations naturally, as well as a pure Permutation T-Test doing identical loops for absolute independence verification.

  3. Comparison Execution Modes: The ultimate loop within the notebook reads the preprocessed CSV tables (CME_processed_individual_data.csv) natively and runs the aforementioned statistical gauntlet spanning cross-comparisons against:
    • CME vs Eurex cross-exchange structural differences across the 30-100 millisecond window.
    • CME Equities vs Non-Equities (Fixed Income Treasury Bonds) measuring whether fast-money targets specific asset groups faster (Comparing ranges of $0 \mu s$ to $200 \mu s$, $200 \mu s$ to $30 ms$, and $30 ms$ to $100 ms$).

All test statistics and generated p-values are exported continuously into an aggressively appended table named Distributional_difference.csv.


2: GIT_Distribution_difference_Eurex.ipynb - Statistical Comparison on EUREX

Overview

Functionally identical in mathematical premise to the CME variant, this notebook redirects the statistical frameworks strictly towards European data utilizing .pickle caches mapping the fractional reaction periods.

Components and Logic

  1. Granular Time Window Shifting: While the CME notebook focuses heavily on Equities vs Bonds, the Eurex logic dives heavily into sequential time shifts using explicit files representing precise windows:
    • economic_event_reactions_100ms.pickle (The baseline window).
    • economic_event_reactions_next_1.pickle vs economic_event_reactions_prev_1.pickle (Analyzing exactly $\pm 1$ Day).
    • .pickle objects ending in increments matching $0.0006944444…$ representing explicit normalized $\pm 1$ Minute fractionals in Pandas daytime architectures.
  2. Bulk Evaluation Pipeline (bulk_compute): The script constructs heavy array lists parsing nested events targeting purely independent global US macros like ISM and NFP. It accumulates individual trade PnLs (MarkoutPnl_10s) and Notional values across thousands of reactions and pipes them directly through the conduct_all_tests() engine defined identically as in the CME script.

3: GIT_pickle_analysis.ipynb - Raw Event Memory & Microsecond Latency Visualization

Overview

Considered the core exploratory visualizer for deep raw memory, this notebook pulls apart the raw bytes embedded inside the 100ms.pickle dictionaries and generates massive, highly technical scatter plot layers mapping how many nanoseconds it took for the matching engines to record trades following external news publications.

Components and Logic

  1. Hexagonal Color Engineering: The setup relies on manual Matplotlib Line2D and PathEffects parameters bounding color schemas specific to exact macroeconomic announcements:
    • ISM MANUFACTURING = Blue
    • ISM SERVICES = Light Blue
    • NFP (Non-Farm Payroll) = Green
    • FOMC = Orange
  2. CME Array Visualization (plot_cme):
    • The plot is constructed using an immense dual $1 \times 2$ vertical grid mapping $t_{gateway} \longrightarrow$ event versus event $\longrightarrow t_{gateway}$.
    • The Y-axis natively embraces a semilogy strict mathematical boundary, shifting bounds rapidly from 0 \mu s down through bounds measuring single-digit nanoseconds.
    • Grey alpha=0.25 shaded visual bounds physically demarcate explicit latency speed upgrades implemented historically on the CME engines (signaled by changes on dates surrounding mid-2021 and 2024).
  3. Eurex Array Visualization (plot_Eurex):
    • Generates mirroring twin-charts specifically for Eurex gateways.
    • Leverages bisect_left natively to sort temporal bounds across the X-axis tracking distinct years (2020 through 2025).
    • Accurately captures extreme boundary events logging minimal latency spikes natively appending a stylized dynamic arrow bounding and pointing mathematically towards the single fastest tick recorded in the set (e.g. tracking down to $\approx 3$ nanoseconds).
  4. Product PnL Splines:
    • Includes a cumulative trailing Markout plot tracking Latency ( $\mu s$ ) over Cumulative Markout PnL explicitly tracking individual products dynamically iterating over all subsets natively within the .pickle file framework.

4: GIT_Final_plot_creator.ipynb - Unified Visual Aggregations & Cross-Exchange Comparisons

Overview

This notebook acts as the centralized report generator, combining trades from both Eurex and CME simultaneously and processing them through synchronized binning functions to build comprehensive publication-ready multi-axis plots.

Components and Logic

  1. Precision Time Binning (bin_trades_CME / bin_trades_EUREX): Order execution counts and financial calculations are forced into tight localized segments:
    • Uniform Microsecond Bins: Trades are grouped explicitly into bounds tracking exactly $10 \mu s$ ($10,000$ nanosecond spans running to $100$ milliseconds).
    • Logarithmic Time Bins: A massive exponential funnel [0, 1000, 10000, 100000, 1000000, 10000000, 100000000] mapping raw processing delay curves.
  2. Core Plot Architecture (count_reactions):
    • Dynamically charts the summation of reactions plotted against Latency intervals (0 to 100 ms) using deep gray plots punctuated with black markers.
    • Displays Subplot (A) specific to CME reaction counts, directly merged with Subplot (B) handling Eurex counts.
    • Log scale Y-axis bounds stretch constraints capturing massive early-burst trades hitting the 10-15k tick ranges in mere fractions of a millisecond.
  3. Cumulative Flow Plots (plot_cumulative_PnL):
    • Generates four cohesive charts mapping the total monetary Notional velocity and cumulative $10s$ Markout PnL extracted uniformly via loops counting sequential sums np.cumsum(array).
    • Divides totals down to highly legible scales, utilizing (K) modifiers for PnL ($10^3$) and (M) modifiers for Notional amounts ($10^6$).
  4. Trade Posture Positioning Layout:
    • Identifies the explicit order ranking (Position 1st, 2nd, … 50th) of executions immediately triggered upon event release for all asset traded at Eurex and CME giving us a distribution of P&L and Notional at each position.
    • Uses ax.fill_between graphing functions interpolating the 5th and 95th Percentile bounds surrounding the Mean performance tracking exactly how toxic initial positions react compared to delayed trailing orders for both exchanges.

5: GIT_Price_change_analysis_EUREX_volume.ipynb - Eurex Latency Notional Analytics

Overview

Focusing strictly on orderbook saturation and bandwidth constraint visualizations, this script dissects cumulative European volume behavior across extreme ultra-low latency checkpoints utilizing a specially optimized cache (EUREX_ISM_reactions_volume_aggregated.pkl).

Components and Logic

  1. Volume Caching Ecosystem: Grouping multi-gigabyte files linearly takes immense time. The script operates off a serialized cache that pre-grouped high-liquidity assets versus low-liquidity assets mapping PriceChangeList, NotionalList, and MarkoutList. If the cache is absent, the script smartly defaults to an empty initialization structure dynamically awaiting hydration.

  2. Dashed Latency Demarcations (plot_Notional_PnL_evolution): The unique value of this notebook is mapping the Normalized Cumulative Notional trajectory utilizing a stepwise line approach (where='post').

    • A critical dashed boundary wall sits aggressively at exactly $907$ nanoseconds ($907 \mu s$). This highlights physical co-location advantages, representing the literal fiber optic limitations required for direct on-server interactions.
    • The second major boundary exists identically at $37$ milliseconds. This marks the well-established latency hurdle representing cross-Atlantic data cable transmissions representing exactly when US macro algorithms hit European order books.
    • The axes gracefully utilize a symlog (Symmetrical Log) configuration allowing negative millisecond visualization seamlessly traversing zero into positive post-event bounds flawlessly keeping the $X$-axis linearly legible across the zero bound.

6: GIT_PriceFormation_PnL_Analytics.ipynb - PnL Correlations & Predictive Modeling

Overview

A machine-learning and rigorous statistical analysis framework operating entirely on pre-compiled tabular outputs. This notebook seeks to quantify relationships sequentially mapping price impacts from atomic microsecond shifts straight into macro effects and forecasting subsequent momentum shifts dynamically.

Components and Logic

  1. Sequential Pearson, Spearman, & Kendall Mapping: Using data subsets exclusively bound to high-momentum changes (where Absolute Basis Point change is manually locked to abs() >= 0.25), this script uses core Pandas corr() functions to output tables assessing:
    • Does a disruption in $0 - 200 \mu s$ guarantee an identical velocity directionally in $200 \mu s - 30 ms$?
    • Applies the tests redundantly using pearson (linear), spearman (monotonic ranks), and kendall (tau ranking indices) matrices.
    • Employs deep distribution testing across intervals via Nonparametric kruskal implementations and heavy variance testing via friedmanchisquare imports detecting shifts mathematically.
  2. Skew Normal Probabilities (ss.skewnorm): Transforms standard frequency histograms mapped out via Seaborn (sns.histplot) utilizing multi-hue shaded KDE plots into fitted theoretical Skew Normal mappings tracking Probability Density frequencies (PDF). It graphs these overlays explicitly to verify non-normal distributions inherent in toxic HFT event trading.

  3. Predictive Analytics Ecosystem (Machine Learning): The climax of the script introduces a predictive supervised-learning pipeline engineered purely via sklearn attempting to guess the resulting basis point outcome for bounds $200 \mu s - 100 ms$ purely based upon the absolute initial shock happening inside $0 - 200 \mu s$.
    • Data Split: Utilizes train_test_split mapping datasets 60-20-20 (Training, Validation, Testing boundaries).
    • Random Forest Framework (RandomForestRegressor): Natively initializes an ensemble configuration processing 20 massive categorical trees (n_estimators=20) executing strictly utilizing true Out-Of-Bag (oob_score=True) error tracking checking continuous validity against overfitting structures.
    • Standard Ordinary Least Squares (OLS) Linear Framework: Employs deep regression utilizing a completely custom linreg_summary() function dissecting deep standard errors, R², F-statistics, and strictly checking $p(B_1)$ bounds identifying algorithmic reliability against noise.
    • Finally, renders a beautifully complex dual multi-axis chart predicting testing structures overlaying Green standard Random Forest predictive fits securely juxtaposed alongside Black normal linear arrays tracking the $x/y$ scatter planes.

Prerequisites and Environment Setup

To recreate and analyze the outputs generated globally across this entire Jupyter ecosystem seamlessly, identical architectures must exist locally:

Language Requirements

Library Requirements

Core packages must be configured locally:

Expected Directory Data Layers

Due to hardcoded paths across all $6$ files, a core directory structure referencing an explicit folder named /Event_pickle_files directly inside the primary domain must exist hosting all iterations of raw memory arrays (e.g. ism_reactions_CME.pickle, economic_event_reactions_100ms.pickle).

Furthermore, strictly structured outputs like CME_processed_individual_data.csv and EUREX_ISM_reactions_volume_aggregated.pkl must comfortably execute adjacent to the .ipynb documents explicitly controlling paths linearly ensuring immediate loading efficiency natively maintaining functional dependency trees safely avoiding pathing exceptions completely.


Conclusion and Execution Workflows

Together, these notebooks offer an incredibly detailed framework dedicated strictly toward dissecting the intricate architecture mapping physical latency speed and topological advantages strictly towards financial realization parameters mapping event reaction times against order book saturation linearly.

Analysts aiming to adopt this structure specifically should:

  1. Guarantee data synchronization starting cleanly from the processed CSVs mapping notebook (6) verifying standard correlations completely.
  2. Launch deep structural .pickle validations inside notebooks (3) and (5) targeting high-level visualizations checking system normalization and hardware delays explicitly targeting latency benchmarks (907 nanoseconds vs 37ms curves).
  3. Validate entire systematic regime disruptions across notebooks (1) and (2) relying heavily upon exhaustive Non-parametric mathematical algorithms determining stochastic shift confidence explicitly.
  4. Render the definitive multi-layered publication plots inside notebook (4) finalizing multi-exchange aggregation and final total cumulative evaluations flawlessly merging global US arrays with Eurex variants.