Applied-Machine-Learning-in-Empirical-Finance

Information

Property	Value
Language	HTML
Stars	0
Forks	0
Watchers	0
Open Issues	0
License	Other
Created	2025-12-16
Last Updated	2026-05-31
Last Push	2026-05-31
Contributors	2
Default Branch	main
Visibility	private

Notebooks

This repository contains 11 notebook(s):

Notebook	Language	Type

| 6_AMLEF_AuthorCollaboration | PYTHON | jupyter |

| 2_AMLEF_BM25Retrieval | PYTHON | jupyter |

| 4_AMLEF_BibCoupling | PYTHON | jupyter |

| 5_AMLEF_CoCitationNetwork | PYTHON | jupyter |

| 3_AMLEF_JaccardNearDuplicates | PYTHON | jupyter |

| 9_AMLEF_KeywordCooccurrence | PYTHON | jupyter |

| 8_AMLEF_LDATopics | PYTHON | jupyter |

| 10_AMLEF_PRISMAFlowDiagram | PYTHON | jupyter |

| 7_AMLEF_SPECTER2_UMAP | PYTHON | jupyter |

| 1_AMLEF_TFIDFRanking | PYTHON | jupyter |

| ravenpack_fetch | PYTHON | jupyter |

Datasets

This repository includes 38 dataset(s):

Dataset	Format	Size

| data | | 0.0 KB |

| publications.json | .json | 158.89 KB |

| research_questions.json | .json | 31.83 KB |

| resources.json | .json | 21.3 KB |

| tag_vocabulary.json | .json | 3.47 KB |

| team.json | .json | 3.84 KB |

| data | | 0.0 KB |

| editorial_boards | | 0.0 KB |

| clusters.json | .json | 229.5 KB |

| control_cohort_assignments.json | .json | 0.1 KB |

| corpus_authors.csv | .csv | 453.11 KB |

| corpus_journals.csv | .csv | 28.08 KB |

| corpus_pull_manifest.json | .json | 72.51 KB |

| editor_authors.csv | .csv | 87.79 KB |

| editor_match_audit.json | .json | 37.1 KB |

| editor_metrics.csv | .csv | 21.77 KB |

| editor_metrics_summary.json | .json | 0.62 KB |

| editor_pre_post.csv | .csv | 8.09 KB |

| editor_pre_post_summary.json | .json | 0.25 KB |

| tenure_coverage_report.json | .json | 21.45 KB |

| themes_manifest.json | .json | 215.52 KB |

| data | | 0.0 KB |

| author_degrees.csv | .csv | 3.96 KB |

| bm25_top10.csv | .csv | 4.17 KB |

| coupling_topk.csv | .csv | 4.63 KB |

| cocitation_communities.csv | .csv | 2.63 KB |

| near_duplicate_pairs.csv | .csv | 0.05 KB |

| keyword_edges.csv | .csv | 13.07 KB |

| topic_topwords.csv | .csv | 3.25 KB |

| prisma_flow.csv | .csv | 0.27 KB |

| umap_coords.csv | .csv | 9.58 KB |

| tfidf_top10.csv | .csv | 4.29 KB |

| corpus.json | .json | 272.86 KB |

| prisma_counts.json | .json | 0.21 KB |

| data | | 0.0 KB |

| scimago | | 0.0 KB |

| abstracts.sqlite | .sqlite | 21316.0 KB |

Reproducibility

This repository includes reproducibility tools:

Python requirements.txt
Conda environment.yml
Makefile for automation

Status

Issues: Enabled
Wiki: Enabled
Pages: Enabled

README

Applied Machine Learning in Empirical Finance

A collaborative PhD research project between University of Twente and Quoniam Asset Management, advancing the application of machine learning methods in portfolio optimization and risk management.

Project Overview

Start Date: December 2025
Duration: 3-4 years
Funding: Industry-funded by Quoniam Asset Management
License: MIT

Research Focus

Primary Themes: - ML for Portfolio Optimization - Risk Management & Forecasting

ML Methods: - Deep Learning (neural networks, transformers, LSTMs) - Reinforcement Learning - Ensemble Methods (random forests, gradient boosting) - Probabilistic ML (Bayesian methods, uncertainty quantification) - Statistical Learning Models

Asset Classes: - Equities, Fixed Income, Multi-Asset, Derivatives

Team

Name	Role	Affiliation
Joerg Osterrieder	Primary Supervisor & Industry Liaison	University of Twente
Xiaohong Huang	Co-Supervisor	University of Twente
Axel Gross-Klussmann	Industry Supervisor	Quoniam Asset Management
Dennis Hoffmann	PhD Student	Quoniam / University of Twente

Repository Structure

Applied-Machine-Learning-in-Empirical-Finance/
├── .claude/                        # Shared Claude Code config
│   ├── CLAUDE.md                   # Project conventions (both users)
│   ├── commands/                   # Custom slash commands
│   └── hooks/                     # Git & CI hooks
├── shared/                         # Shared resources
│   ├── claim_checker/             # Claim-citation verification (Gemini/Perplexity LLM screening, parity gate, Zotero export)
│   ├── claude_hooks/              # Deployment shims for Claude Code hooks (memory-index.mjs MEMORY.md auto-index)
│   ├── dashboard_kit/             # Reusable JS-free dashboard kit (theme/charts/build/audit/guards/verify); powers /diagnostic-dashboard
│   ├── parity_check/              # Standalone tex/pdf/html parity CLI (free, no API keys)
│   ├── reference_checker/         # BibTeX/TeX consistency + CrossRef/OpenAlex API verification
│   ├── data_sources.md            # External data source documentation
│   ├── research_proposal.tex      # PhD research proposal
│   └── templates/                 # Reusable templates
│       ├── beamer/                # Beamer presentation template
│       └── project/               # Scaffold for new papers/projects
├── literature_review/              # Literature-review umbrella
│   ├── api_info/                  # External-API registration + rate-limit docs (flat <provider>.md)
│   ├── notes/                     # General research notes (adversarial review, paper methodology draft)
│   ├── bibliographic_review/      # Bibliographic-review pipeline (Kessler coupling + Leiden clustering, Donthu-2021 audit)
│   │   ├── scripts/               # 8-step orchestration: 01_corpus_pull → 08_generate_dashboard
│   │   ├── configs/               # bib_review.yaml (search, networks, clustering, render)
│   │   ├── outputs/               # themes_manifest.{json,md}, clusters.json, dashboard.html, sentinels
│   │   ├── paper/                 # bib_review.tex (body) + standalone wrapper + references.bib + figures
│   │   ├── notes/                 # Donthu-2021 9-item checklist + dashboard story
│   │   ├── tests/                 # 63-test pytest suite (corpus, networks, clustering, manifest, render, dashboard)
│   │   └── README.md              # Pipeline diagram + run modes + outputs table + troubleshooting
│   └── systematic_literature_review/  # v2.3 SLR pipeline
│       ├── scripts/                   # Subfolders: steps/, lib/, tools/, _archive/, migrations/, tests/ (run_pipeline.py + post_flight_audit.py at root)
│       ├── scripts/tests/             # Pytest suite (259 tests)
│       ├── scripts/migrations/        # Config migrators (v2.1→v2.2→v2.3, seed_id v2)
│       ├── configs/                   # YAML search configurations
│       ├── runs/                      # Pipeline output per review (runs/old/ for archived pre-refactor runs)
│       ├── diagnostics/               # External-benchmark signal evaluation (Sezer 2020)
│       ├── notes/                     # Architecture spec, enhancement RFCs
│       ├── README.md                  # Usage guide
│       └── DEVELOPMENT.md             # Developer docs
├── qam_projects/                   # Quoniam industry projects (confidential)
│   └── strategy_specific_models/  # ML models for investment strategies
├── quantlets/                      # QuantLet platform planning + AMLEF submission guide
├── docs/                           # GitHub Pages website
│   ├── index.html                 # Landing page
│   ├── team.html                  # Team members & bios
│   ├── publications.html          # Publication browser
│   ├── research.html              # Research overview & gaps
│   ├── resources.html             # Tools & resources
│   ├── news.html                  # News & updates
│   ├── what-is-an-slr.html        # Wiki: SLR methodology
│   ├── factor-zoo.html            # Wiki: factor zoo & replication crisis
│   ├── cross-sectional-return-prediction.html  # Wiki: cross-sectional ML predictability
│   ├── signal-to-weights.html     # Wiki: portfolio construction
│   ├── open-science-in-finance.html  # Wiki: reproducibility & open science
│   ├── narrative-risk.html        # Wiki: narrative risk in empirical finance
│   ├── css/, js/, data/, assets/  # Website resources
│   └── scripts/                   # Python data collection scripts
│       ├── fetch_team_info.py     # Fetch ORCID IDs from OpenAlex
│       ├── fetch_openalex.py      # Fetch publications
│       ├── analyze_research_gaps.py # Identify research gaps
│       ├── verify_publications.py # Verify publication authors
│       ├── check_links.py         # Validate website links
│       ├── download_team_photos.py # Download team member photos
│       └── fetch_logos.py         # Fetch partner logos
├── environment.yml                 # Conda environment spec
├── README.md
├── CONTRIBUTING.md
└── LICENSE

Getting Started

View the Website

Visit: https://digital-ai-finance.github.io/Applied-Machine-Learning-in-Empirical-Finance/

Update Data

To refresh publication and team data from OpenAlex:

# Set up environment
conda env create -f environment.yml
conda activate applied-ml-finance

# Fetch team information
python docs/scripts/fetch_team_info.py

# Fetch publications
python docs/scripts/fetch_openalex.py

# Analyze research gaps
python docs/scripts/analyze_research_gaps.py

Verify References

Check BibTeX entries against CrossRef/OpenAlex and TeX citation consistency (free, no API keys required):

# Full check: citation consistency + API verification
python -m shared.reference_checker paper_1/paper.tex

# Consistency only (no API calls)
python -m shared.reference_checker paper_1/paper.tex --consistency-only

# Bib-only: verify entries against APIs
python -m shared.reference_checker paper_1/references.bib

See shared/reference_checker/README.md for all options.

Verify Claims (LLM screening)

Screen each (sentence, \cite{key}) pair against the cited paper's abstract using Gemini 2.5 Pro (default) or Perplexity Sonar. Requires GEMINI_API_KEY (and PERPLEXITY_API_KEY for the default Gemini fallback) in .env. Costs ~$0.03–0.38 per wiki-article run.

# Single article -> Excel report + Zotero push
python -m shared.claim_checker docs/assets/wiki/factor-zoo/factor-zoo.tex \
    -o out/factor-zoo.xlsx

See shared/claim_checker/README.md for backend selection, cost reference, and the API-key URL table. The standalone python -m shared.parity_check <tex> CLI exposes the tex/pdf/html parity gate without invoking the paid pipeline (see shared/parity_check/README.md).

SLR Pipeline Flow

Queries and seeds are peer entry sources (step 01); only seeds drive citation snowballing (step 02). All sources then merge at the 02a dedup union, so query breadth and seed citation-network depth are both preserved. Mirrors the per-run dashboard flow diagram (step 10).

STEP 01: query_openalex    (entry sources, all on the same level)

┌──────────────────────┐   ┌──────────────────────┐   ┌──────────────────────┐
│ Queries              │   │ Seeds                │   │ Practitioner         │
│ query_groups →       │   │ curated anchor       │   │ discovery            │
│ OpenAlex hits        │   │ papers               │   │ (grey-lit/industry)  │
└───────────┬──────────┘   └───────────┬──────────┘   └───────────┬──────────┘
            │                          │                          │
            │                          ▼                          │
            │              ┌──────────────────────┐               │
            │              │ STEP 02: snowball    │               │
            │              │ forward + backward   │               │
            │              │ FROM SEEDS ONLY      │               │
            │              │ depth 1..N           │               │
            │              └───────────┬──────────┘               │
            │                          │ seed neighbourhood       │
            ▼                          ▼                          ▼
┌────────────────────────────────────────────────────────────────────────────┐
│ STEP 02a: dedup        UNION of all sources  (01 ∪ 02)                     │
│ 3-pass: exact-ID · working-paper→journal · cross-journal collapse          │
└──────────────────────────────────────┬─────────────────────────────────────┘
                                       │ 02_deduped_corpus.json
                                       ▼
┌────────────────────────────────────────────────────────────────────────────┐
│ STEP 02b: prefilter + enrich    (6-stage cascade, short-circuiting)        │
│ 0 relevance → 1 year → 2 Gate-B → 3 abstract-enrich (4 APIs)               │
│             → 4 abstract-presence → 5 signal-enrich                        │
└──────────────────────────────────────┬─────────────────────────────────────┘
                                       ▼
┌────────────────────────────────────────────────────────────────────────────┐
│ STEP 03 score (relevance + quality, P10)  →  03c signal analytics          │
│   →  STEP 04 per-theme selection  (6 themes, target_corpus 55)             │
└──────────────────────────────────────┬─────────────────────────────────────┘
                                       ▼
┌────────────────────────────────────────────────────────────────────────────┐
│ STEP 05-10  bibtex · zotero · latex · obsidian · PRISMA viz · dashboard    │
└────────────────────────────────────────────────────────────────────────────┘

Post-Flight Audit (V1, V3, V5, V9)

After running the SLR pipeline, run the post-flight audit script to populate the verification banner artifacts:

conda activate applied-ml-finance
python literature_review/systematic_literature_review/scripts/post_flight_audit.py \
  --run-dir literature_review/systematic_literature_review/runs/<run-id>

This writes: - {run-dir}/selection_audit.md — V1 (borderline-case rationale) + V3 (leak-path reproduction). - {run-dir}/pytest_summary.txt — V5 (full SLR pytest suite) + V9 (partial-signal aggregate test).

Use --skip-pytest to write only the audit markdown (faster iteration).

Run Bibliographic Review

The bibliographic review maps the field's intellectual structure via bibliographic coupling (Kessler 1963) and co-citation networks. It shares the SLR's lexical perimeter (same query groups + tilt-keyword AND filter), so structural divergence is interpretable as citation behaviour rather than retrieval scope. Methodology audited against Donthu et al. (2021) nine-item bibliometric checklist.

conda activate applied-ml-finance
cd literature_review/bibliographic_review
python scripts/run_bib_review.py --config configs/bib_review.yaml --non-interactive

Outputs (selected): outputs/themes_manifest.json (Pydantic-validated, canonical machine output), outputs/dashboard.html (offline single-file dashboard, Jaccard heatmap + cluster cards), paper/build/bib_review_standalone.pdf (>= 8-page LaTeX paper). See literature_review/bibliographic_review/README.md for environment setup, pipeline diagram, run modes (single-step, dry-run, smoke), output catalogue, cluster-labelling workflow, reference-checker hook, and troubleshooting.

Verify Source Correctness (unified)

For end-to-end source verification, the check-sources Claude Code skill orchestrates reference_checker first (free, fail-fast) and only escalates to the paid claim_checker when the cite/key graph resolves. Inside Claude Code:

check sources in docs/assets/wiki/factor-zoo/factor-zoo.tex

See .claude/skills/source-checker/SKILL.md and the Source-Checker wiki page.

Project Management

We use GitHub's built-in tools for tracking progress and collaboration:

Issues — Task tracking with direct links to code, commits, and branches
Projects (Kanban Board) — Overview of PhD progress and milestones
Wiki — Comprehensive project documentation: setup guides, pipeline docs, conventions, and architecture

Research Questions

Key open questions driving this research:

Portfolio Optimization

How can deep reinforcement learning handle multi-asset portfolios with realistic constraints?
What is the optimal way to incorporate transaction costs into ML-based portfolio optimization?
Can transformer architectures capture cross-asset dependencies effectively?

Risk Management

How can we develop uncertainty-aware deep learning models for VaR/ES estimation?
What architectures work best for real-time risk monitoring with streaming data?
How should ML risk models be validated to meet regulatory requirements?

Methodology

What pre-training strategies work for financial time series foundation models?
How can causal inference be integrated with ML predictions for portfolio decisions?
What transfer learning approaches work across financial markets?

Partners

University of Twente

BMS Financial Engineering - Academic research partner providing theoretical foundations and research methodology.

Quoniam Asset Management

Frankfurt-based quantitative asset manager providing industry context, practical use cases, and data access.

Research Networks

COST Action Fintech and AI in Finance
MSCA Digital Finance

Publications

Team publications are automatically fetched from OpenAlex and displayed on the website. The data includes:

160+ publications from team members
80+ ML+Finance relevant papers
1,000+ total citations

License

This project is licensed under the MIT License - see the LICENSE file for details.

Contact

Academic: University of Twente, BMS Financial Engineering
Industry: Quoniam Asset Management, Frankfurt

Acknowledgments

OpenAlex for open academic publication data
Digital-AI-Finance organization for hosting
COST Action CA19130 Fintech and AI in Finance
MSCA Digital Finance network

(c) Joerg Osterrieder 2025-2026