Resources

Curated datasets, tutorials, tools, and educational materials for machine learning in finance research.

Wiki

Key concepts and methodologies used in our research

Systematic Literature Reviews Explained

A systematic literature review (SLR) is a form of secondary research that uses a clearly defined, replicable methodology to identify, select, critically appraise, and synthesize all relevant primary studies addressing a specific research question. Unlike narrative reviews, which are valuable for expert commentary but vulnerable to selection bias and lack reproducibility, SLRs follow a predefined protocol that makes every methodological decision explicit and auditable.

Five core principles

Principle	Description
Transparency	every step is documented for scrutiny
Reproducibility	another researcher can replicate the process
Comprehensiveness	aims to identify all relevant studies
Explicit methodology	criteria and strategies specified a priori
Bias minimization	systematic procedures reduce subjective selection

Systematic vs. narrative reviews

Dimension	Narrative	Systematic
Protocol	typically absent	defined a priori
Search	selective, implicit	comprehensive, explicit
Study selection	reviewer discretion	predefined criteria
Quality appraisal	rarely conducted	standardized
Reproducibility	low	high

Finance-specific challenges

Challenge	Description
Working papers	SSRN/NBER serve as primary channels; versioning complicates deduplication
Conference-dominant fields	ML disseminates via NeurIPS, ICML, ArXiv rather than journals
Proprietary data	reliance on WRDS, Bloomberg, Refinitiv creates reproducibility barriers
Rapid evolution	reviews risk obsolescence; computational pipelines enable living reviews

Methodology Research Evidence Synthesis

Read the full article Download PDF

Narrative Risk in Empirical Finance

Narrative risk is the price-relevance of textual signals derived from corporate disclosures, news, and analyst language. The literature has progressed from dictionary-based sentiment measures through firm-level text-based exposures to LLM-extracted disclosure factors, with each generation surfacing new firm-level and aggregate risk exposures.

This article surveys 23 papers across 5 themes from a 2026-05-02 systematic literature review of empirical finance research using textual data.

The five themes

Theme	Papers
Narrative Asset Pricing & News-Based Factors	12
LLM-Based Disclosure & Document Extraction	7
Text-Based Firm-Level Exposures	7
Risk-Disclosure Information Content	4
Sentiment & Dictionary Baselines	4

Narrative Risk Asset Pricing NLP

Read the full article Download PDF

The Factor Zoo: Replication Crisis in Empirical Asset Pricing

The "factor zoo" refers to the proliferation of hundreds of statistically significant equity return predictors documented in empirical asset pricing research. When Harvey, Liu, and Zhu (2016) surveyed the literature they catalogued over 300 factor candidates, and subsequent work has pushed this count higher. The concern is that with so many factors tested against the same historical data, many apparent discoveries are false positives driven by data mining rather than genuine economic mechanisms, a concern now commonly framed as a replication crisis in empirical finance.

Key replication studies

Study	Scope	Headline finding
Harvey, Liu & Zhu (2016)	316 factors	Most fail when multiple-testing corrections are applied
McLean & Pontiff (2016)	97 predictors	Returns decay substantially post-publication
Hou, Xue & Zhang (2020)	452 anomalies	Roughly 18% survive with \|t\| > 2.78
Chen & Zimmermann (2022)	161 characteristics	98% reproduce with \|t\| > 1.96

Sources of the zoo

Cause	Description
Multiple testing	Hundreds of factors tested on one 50-year CRSP sample
Publication bias	Journals prefer novel, significant, contrarian findings
Data mining	Flexible model specifications chosen post-hoc to fit the data
Short samples	Noisy estimates in regime-dependent financial markets

Machine learning implications

Tension	Why it matters
High-dimensional feature spaces	ML amplifies multiple-testing problems when features are unscreened
Regularization	Lasso, ridge, and elastic net can prune spurious signals if tuned honestly
Honest evaluation	Walk-forward splits and out-of-sample holdouts are essential
Economic priors	Feature engineering grounded in theory reduces zoo-size inflation

Empirical Asset Pricing Replication Multiple Testing

Read the full article Download PDF

Open Science in Finance

Open science is a set of practices that make research transparent, verifiable, and cumulative. In finance, open science is complicated by proprietary data from providers like CRSP, Compustat, WRDS, and Bloomberg, by commercial research environments, and by a historical culture of closed codebases. Despite these barriers, a growing movement emphasizes pre-registration, open code, open data where possible, transparent pre-processing pipelines, and honest documentation of modelling choices as the foundation for credible empirical finance.

Core open-science practices

Practice	Description
Pre-registration	Commit hypotheses and analysis plans before seeing the data
Open code	Publish scripts, notebooks, and environment specifications
Open data	Share data where licensing allows; otherwise share processing code
Reproducible pipelines	Containerized, version-pinned, deterministic workflows
Transparent reporting	Report all specifications attempted, not just the winning one

Finance-specific barriers

Barrier	Description
Proprietary data licences	CRSP, Compustat, WRDS, Bloomberg restrict redistribution
Survival-biased archives	Historical databases often exclude delisted or merged firms
Point-in-time accuracy	Fundamentals revisions and restatements complicate replication
Industry partnerships	NDAs and commercial confidentiality limit full disclosure

Workable middle ground

Approach	What it delivers
Code with synthetic data	Reproducibility without violating data licences
Stable permalinks to snapshots	Researchers can verify against the exact data vintage
Pre-registration plus post-hoc disclosure	Documents both plan and deviations honestly
Public holdout sets	Uncontaminated evaluation data for shared benchmarks

Reproducibility Transparency Research Methodology

Read the full article Download PDF

Datasets

Open datasets for financial ML research ·

Loading datasets...

Tools & Libraries

Software tools for ML in finance research and development

PyTorch / TensorFlow

Deep learning frameworks for building and training neural networks.

Deep Learning

PyTorch TensorFlow

scikit-learn

Machine learning library for Python with classical algorithms and utilities.

ML

Visit

XGBoost / LightGBM

Gradient boosting libraries for high-performance ensemble learning.

Ensemble

XGBoost LightGBM

Zipline / Backtrader

Backtesting frameworks for testing trading strategies on historical data.

Backtesting

Zipline Backtrader

cvxpy

Convex optimization library for portfolio optimization and risk management.

Optimization

Visit

PyPortfolioOpt

Portfolio optimization library with mean-variance, risk parity, and more.

Portfolio

Visit

Stable-Baselines3

Reliable implementations of reinforcement learning algorithms in PyTorch.

RL

Visit

Riskfolio-Lib

Library for portfolio optimization, risk analysis, and visualization.

Risk

Visit

OpenAlex API

Open catalog of scholarly works for literature review and research discovery.

Research

Visit

Academic Resources

Journals, conferences, and academic outlets

Resources

Wiki

Datasets

Tools & Libraries

PyTorch / TensorFlow

scikit-learn

XGBoost / LightGBM

Zipline / Backtrader

cvxpy

PyPortfolioOpt

Stable-Baselines3

Riskfolio-Lib

OpenAlex API

Academic Resources

Journals

Conferences

Preprints & Discovery

Resources

Wiki

Datasets

Tools & Libraries

PyTorch / TensorFlow

scikit-learn

XGBoost / LightGBM

Zipline / Backtrader

cvxpy

PyPortfolioOpt

Stable-Baselines3

Riskfolio-Lib

OpenAlex API

Related Projects

AI4Finance Foundation

QuantLib

mlfinlab

Digital-AI-Finance

Academic Resources

Journals

Conferences

Preprints & Discovery