Applied-Machine-Learning-in-Empirical-Finance
Information
| Property | Value |
|---|---|
| Language | HTML |
| Stars | 0 |
| Forks | 0 |
| Watchers | 0 |
| Open Issues | 0 |
| License | Other |
| Created | 2025-12-16 |
| Last Updated | 2026-05-31 |
| Last Push | 2026-05-31 |
| Contributors | 2 |
| Default Branch | main |
| Visibility | private |
Notebooks
This repository contains 11 notebook(s):
| Notebook | Language | Type |
|---|---|---|
| 6_AMLEF_AuthorCollaboration | PYTHON | jupyter |
| 2_AMLEF_BM25Retrieval | PYTHON | jupyter |
| 4_AMLEF_BibCoupling | PYTHON | jupyter |
| 5_AMLEF_CoCitationNetwork | PYTHON | jupyter |
| 3_AMLEF_JaccardNearDuplicates | PYTHON | jupyter |
| 9_AMLEF_KeywordCooccurrence | PYTHON | jupyter |
| 8_AMLEF_LDATopics | PYTHON | jupyter |
| 10_AMLEF_PRISMAFlowDiagram | PYTHON | jupyter |
| 7_AMLEF_SPECTER2_UMAP | PYTHON | jupyter |
| 1_AMLEF_TFIDFRanking | PYTHON | jupyter |
| ravenpack_fetch | PYTHON | jupyter |
Datasets
This repository includes 38 dataset(s):
| Dataset | Format | Size |
|---|---|---|
| data | | 0.0 KB |
| publications.json | .json | 158.89 KB |
| research_questions.json | .json | 31.83 KB |
| resources.json | .json | 21.3 KB |
| tag_vocabulary.json | .json | 3.47 KB |
| team.json | .json | 3.84 KB |
| data | | 0.0 KB |
| data | | 0.0 KB |
| editorial_boards | | 0.0 KB |
| clusters.json | .json | 229.5 KB |
| control_cohort_assignments.json | .json | 0.1 KB |
| corpus_authors.csv | .csv | 453.11 KB |
| corpus_journals.csv | .csv | 28.08 KB |
| corpus_pull_manifest.json | .json | 72.51 KB |
| editor_authors.csv | .csv | 87.79 KB |
| editor_match_audit.json | .json | 37.1 KB |
| editor_metrics.csv | .csv | 21.77 KB |
| editor_metrics_summary.json | .json | 0.62 KB |
| editor_pre_post.csv | .csv | 8.09 KB |
| editor_pre_post_summary.json | .json | 0.25 KB |
| tenure_coverage_report.json | .json | 21.45 KB |
| themes_manifest.json | .json | 215.52 KB |
| data | | 0.0 KB |
| author_degrees.csv | .csv | 3.96 KB |
| bm25_top10.csv | .csv | 4.17 KB |
| coupling_topk.csv | .csv | 4.63 KB |
| cocitation_communities.csv | .csv | 2.63 KB |
| near_duplicate_pairs.csv | .csv | 0.05 KB |
| keyword_edges.csv | .csv | 13.07 KB |
| topic_topwords.csv | .csv | 3.25 KB |
| prisma_flow.csv | .csv | 0.27 KB |
| umap_coords.csv | .csv | 9.58 KB |
| tfidf_top10.csv | .csv | 4.29 KB |
| corpus.json | .json | 272.86 KB |
| prisma_counts.json | .json | 0.21 KB |
| data | | 0.0 KB |
| scimago | | 0.0 KB |
| abstracts.sqlite | .sqlite | 21316.0 KB |
Reproducibility
This repository includes reproducibility tools:
-
Python requirements.txt
-
Conda environment.yml
-
Makefile for automation
Status
- Issues: Enabled
- Wiki: Enabled
- Pages: Enabled
README
Applied Machine Learning in Empirical Finance
A collaborative PhD research project between University of Twente and Quoniam Asset Management, advancing the application of machine learning methods in portfolio optimization and risk management.
Project Overview
- Start Date: December 2025
- Duration: 3-4 years
- Funding: Industry-funded by Quoniam Asset Management
- License: MIT
Research Focus
Primary Themes: - ML for Portfolio Optimization - Risk Management & Forecasting
ML Methods: - Deep Learning (neural networks, transformers, LSTMs) - Reinforcement Learning - Ensemble Methods (random forests, gradient boosting) - Probabilistic ML (Bayesian methods, uncertainty quantification) - Statistical Learning Models
Asset Classes: - Equities, Fixed Income, Multi-Asset, Derivatives
Team
| Name | Role | Affiliation |
|---|---|---|
| Joerg Osterrieder | Primary Supervisor & Industry Liaison | University of Twente |
| Xiaohong Huang | Co-Supervisor | University of Twente |
| Axel Gross-Klussmann | Industry Supervisor | Quoniam Asset Management |
| Dennis Hoffmann | PhD Student | Quoniam / University of Twente |
Repository Structure
Applied-Machine-Learning-in-Empirical-Finance/
├── .claude/ # Shared Claude Code config
│ ├── CLAUDE.md # Project conventions (both users)
│ ├── commands/ # Custom slash commands
│ └── hooks/ # Git & CI hooks
├── shared/ # Shared resources
│ ├── claim_checker/ # Claim-citation verification (Gemini/Perplexity LLM screening, parity gate, Zotero export)
│ ├── claude_hooks/ # Deployment shims for Claude Code hooks (memory-index.mjs MEMORY.md auto-index)
│ ├── dashboard_kit/ # Reusable JS-free dashboard kit (theme/charts/build/audit/guards/verify); powers /diagnostic-dashboard
│ ├── parity_check/ # Standalone tex/pdf/html parity CLI (free, no API keys)
│ ├── reference_checker/ # BibTeX/TeX consistency + CrossRef/OpenAlex API verification
│ ├── data_sources.md # External data source documentation
│ ├── research_proposal.tex # PhD research proposal
│ └── templates/ # Reusable templates
│ ├── beamer/ # Beamer presentation template
│ └── project/ # Scaffold for new papers/projects
├── literature_review/ # Literature-review umbrella
│ ├── api_info/ # External-API registration + rate-limit docs (flat <provider>.md)
│ ├── notes/ # General research notes (adversarial review, paper methodology draft)
│ ├── bibliographic_review/ # Bibliographic-review pipeline (Kessler coupling + Leiden clustering, Donthu-2021 audit)
│ │ ├── scripts/ # 8-step orchestration: 01_corpus_pull → 08_generate_dashboard
│ │ ├── configs/ # bib_review.yaml (search, networks, clustering, render)
│ │ ├── outputs/ # themes_manifest.{json,md}, clusters.json, dashboard.html, sentinels
│ │ ├── paper/ # bib_review.tex (body) + standalone wrapper + references.bib + figures
│ │ ├── notes/ # Donthu-2021 9-item checklist + dashboard story
│ │ ├── tests/ # 63-test pytest suite (corpus, networks, clustering, manifest, render, dashboard)
│ │ └── README.md # Pipeline diagram + run modes + outputs table + troubleshooting
│ └── systematic_literature_review/ # v2.3 SLR pipeline
│ ├── scripts/ # Subfolders: steps/, lib/, tools/, _archive/, migrations/, tests/ (run_pipeline.py + post_flight_audit.py at root)
│ ├── scripts/tests/ # Pytest suite (259 tests)
│ ├── scripts/migrations/ # Config migrators (v2.1→v2.2→v2.3, seed_id v2)
│ ├── configs/ # YAML search configurations
│ ├── runs/ # Pipeline output per review (runs/old/ for archived pre-refactor runs)
│ ├── diagnostics/ # External-benchmark signal evaluation (Sezer 2020)
│ ├── notes/ # Architecture spec, enhancement RFCs
│ ├── README.md # Usage guide
│ └── DEVELOPMENT.md # Developer docs
├── qam_projects/ # Quoniam industry projects (confidential)
│ └── strategy_specific_models/ # ML models for investment strategies
├── quantlets/ # QuantLet platform planning + AMLEF submission guide
├── docs/ # GitHub Pages website
│ ├── index.html # Landing page
│ ├── team.html # Team members & bios
│ ├── publications.html # Publication browser
│ ├── research.html # Research overview & gaps
│ ├── resources.html # Tools & resources
│ ├── news.html # News & updates
│ ├── what-is-an-slr.html # Wiki: SLR methodology
│ ├── factor-zoo.html # Wiki: factor zoo & replication crisis
│ ├── cross-sectional-return-prediction.html # Wiki: cross-sectional ML predictability
│ ├── signal-to-weights.html # Wiki: portfolio construction
│ ├── open-science-in-finance.html # Wiki: reproducibility & open science
│ ├── narrative-risk.html # Wiki: narrative risk in empirical finance
│ ├── css/, js/, data/, assets/ # Website resources
│ └── scripts/ # Python data collection scripts
│ ├── fetch_team_info.py # Fetch ORCID IDs from OpenAlex
│ ├── fetch_openalex.py # Fetch publications
│ ├── analyze_research_gaps.py # Identify research gaps
│ ├── verify_publications.py # Verify publication authors
│ ├── check_links.py # Validate website links
│ ├── download_team_photos.py # Download team member photos
│ └── fetch_logos.py # Fetch partner logos
├── environment.yml # Conda environment spec
├── README.md
├── CONTRIBUTING.md
└── LICENSE
Getting Started
View the Website
Visit: https://digital-ai-finance.github.io/Applied-Machine-Learning-in-Empirical-Finance/
Update Data
To refresh publication and team data from OpenAlex:
# Set up environment
conda env create -f environment.yml
conda activate applied-ml-finance
# Fetch team information
python docs/scripts/fetch_team_info.py
# Fetch publications
python docs/scripts/fetch_openalex.py
# Analyze research gaps
python docs/scripts/analyze_research_gaps.py
Verify References
Check BibTeX entries against CrossRef/OpenAlex and TeX citation consistency (free, no API keys required):
# Full check: citation consistency + API verification
python -m shared.reference_checker paper_1/paper.tex
# Consistency only (no API calls)
python -m shared.reference_checker paper_1/paper.tex --consistency-only
# Bib-only: verify entries against APIs
python -m shared.reference_checker paper_1/references.bib
See shared/reference_checker/README.md for all options.
Verify Claims (LLM screening)
Screen each (sentence, \cite{key}) pair against the cited paper's abstract using Gemini 2.5 Pro (default) or Perplexity Sonar. Requires GEMINI_API_KEY (and PERPLEXITY_API_KEY for the default Gemini fallback) in .env. Costs ~$0.03–0.38 per wiki-article run.
# Single article -> Excel report + Zotero push
python -m shared.claim_checker docs/assets/wiki/factor-zoo/factor-zoo.tex \
-o out/factor-zoo.xlsx
See shared/claim_checker/README.md for backend selection, cost reference, and the API-key URL table. The standalone python -m shared.parity_check <tex> CLI exposes the tex/pdf/html parity gate without invoking the paid pipeline (see shared/parity_check/README.md).
SLR Pipeline Flow
Queries and seeds are peer entry sources (step 01); only seeds drive citation snowballing (step 02). All sources then merge at the 02a dedup union, so query breadth and seed citation-network depth are both preserved. Mirrors the per-run dashboard flow diagram (step 10).
STEP 01: query_openalex (entry sources, all on the same level)
┌──────────────────────┐ ┌──────────────────────┐ ┌──────────────────────┐
│ Queries │ │ Seeds │ │ Practitioner │
│ query_groups → │ │ curated anchor │ │ discovery │
│ OpenAlex hits │ │ papers │ │ (grey-lit/industry) │
└───────────┬──────────┘ └───────────┬──────────┘ └───────────┬──────────┘
│ │ │
│ ▼ │
│ ┌──────────────────────┐ │
│ │ STEP 02: snowball │ │
│ │ forward + backward │ │
│ │ FROM SEEDS ONLY │ │
│ │ depth 1..N │ │
│ └───────────┬──────────┘ │
│ │ seed neighbourhood │
▼ ▼ ▼
┌────────────────────────────────────────────────────────────────────────────┐
│ STEP 02a: dedup UNION of all sources (01 ∪ 02) │
│ 3-pass: exact-ID · working-paper→journal · cross-journal collapse │
└──────────────────────────────────────┬─────────────────────────────────────┘
│ 02_deduped_corpus.json
▼
┌────────────────────────────────────────────────────────────────────────────┐
│ STEP 02b: prefilter + enrich (6-stage cascade, short-circuiting) │
│ 0 relevance → 1 year → 2 Gate-B → 3 abstract-enrich (4 APIs) │
│ → 4 abstract-presence → 5 signal-enrich │
└──────────────────────────────────────┬─────────────────────────────────────┘
▼
┌────────────────────────────────────────────────────────────────────────────┐
│ STEP 03 score (relevance + quality, P10) → 03c signal analytics │
│ → STEP 04 per-theme selection (6 themes, target_corpus 55) │
└──────────────────────────────────────┬─────────────────────────────────────┘
▼
┌────────────────────────────────────────────────────────────────────────────┐
│ STEP 05-10 bibtex · zotero · latex · obsidian · PRISMA viz · dashboard │
└────────────────────────────────────────────────────────────────────────────┘
Post-Flight Audit (V1, V3, V5, V9)
After running the SLR pipeline, run the post-flight audit script to populate the verification banner artifacts:
conda activate applied-ml-finance
python literature_review/systematic_literature_review/scripts/post_flight_audit.py \
--run-dir literature_review/systematic_literature_review/runs/<run-id>
This writes:
- {run-dir}/selection_audit.md — V1 (borderline-case rationale) + V3 (leak-path reproduction).
- {run-dir}/pytest_summary.txt — V5 (full SLR pytest suite) + V9 (partial-signal aggregate test).
Use --skip-pytest to write only the audit markdown (faster iteration).
Run Bibliographic Review
The bibliographic review maps the field's intellectual structure via bibliographic coupling (Kessler 1963) and co-citation networks. It shares the SLR's lexical perimeter (same query groups + tilt-keyword AND filter), so structural divergence is interpretable as citation behaviour rather than retrieval scope. Methodology audited against Donthu et al. (2021) nine-item bibliometric checklist.
conda activate applied-ml-finance
cd literature_review/bibliographic_review
python scripts/run_bib_review.py --config configs/bib_review.yaml --non-interactive
Outputs (selected): outputs/themes_manifest.json (Pydantic-validated, canonical machine output), outputs/dashboard.html (offline single-file dashboard, Jaccard heatmap + cluster cards), paper/build/bib_review_standalone.pdf (>= 8-page LaTeX paper). See literature_review/bibliographic_review/README.md for environment setup, pipeline diagram, run modes (single-step, dry-run, smoke), output catalogue, cluster-labelling workflow, reference-checker hook, and troubleshooting.
Verify Source Correctness (unified)
For end-to-end source verification, the check-sources Claude Code skill orchestrates reference_checker first (free, fail-fast) and only escalates to the paid claim_checker when the cite/key graph resolves. Inside Claude Code:
check sources in
docs/assets/wiki/factor-zoo/factor-zoo.tex
See .claude/skills/source-checker/SKILL.md and the Source-Checker wiki page.
Project Management
We use GitHub's built-in tools for tracking progress and collaboration:
- Issues — Task tracking with direct links to code, commits, and branches
- Projects (Kanban Board) — Overview of PhD progress and milestones
- Wiki — Comprehensive project documentation: setup guides, pipeline docs, conventions, and architecture
Research Questions
Key open questions driving this research:
Portfolio Optimization
- How can deep reinforcement learning handle multi-asset portfolios with realistic constraints?
- What is the optimal way to incorporate transaction costs into ML-based portfolio optimization?
- Can transformer architectures capture cross-asset dependencies effectively?
Risk Management
- How can we develop uncertainty-aware deep learning models for VaR/ES estimation?
- What architectures work best for real-time risk monitoring with streaming data?
- How should ML risk models be validated to meet regulatory requirements?
Methodology
- What pre-training strategies work for financial time series foundation models?
- How can causal inference be integrated with ML predictions for portfolio decisions?
- What transfer learning approaches work across financial markets?
Partners
University of Twente
BMS Financial Engineering - Academic research partner providing theoretical foundations and research methodology.
Quoniam Asset Management
Frankfurt-based quantitative asset manager providing industry context, practical use cases, and data access.
Research Networks
- COST Action Fintech and AI in Finance
- MSCA Digital Finance
Publications
Team publications are automatically fetched from OpenAlex and displayed on the website. The data includes:
- 160+ publications from team members
- 80+ ML+Finance relevant papers
- 1,000+ total citations
License
This project is licensed under the MIT License - see the LICENSE file for details.
Contact
- Academic: University of Twente, BMS Financial Engineering
- Industry: Quoniam Asset Management, Frankfurt
Acknowledgments
- OpenAlex for open academic publication data
- Digital-AI-Finance organization for hosting
- COST Action CA19130 Fintech and AI in Finance
- MSCA Digital Finance network
(c) Joerg Osterrieder 2025-2026