Skip to content

Applied-Machine-Learning-in-Empirical-Finance

View on GitHub


Information

Property Value
Language HTML
Stars 0
Forks 0
Watchers 0
Open Issues 0
License Other
Created 2025-12-16
Last Updated 2026-05-31
Last Push 2026-05-31
Contributors 2
Default Branch main
Visibility private

Notebooks

This repository contains 11 notebook(s):

Notebook Language Type

| 6_AMLEF_AuthorCollaboration | PYTHON | jupyter |

| 2_AMLEF_BM25Retrieval | PYTHON | jupyter |

| 4_AMLEF_BibCoupling | PYTHON | jupyter |

| 5_AMLEF_CoCitationNetwork | PYTHON | jupyter |

| 3_AMLEF_JaccardNearDuplicates | PYTHON | jupyter |

| 9_AMLEF_KeywordCooccurrence | PYTHON | jupyter |

| 8_AMLEF_LDATopics | PYTHON | jupyter |

| 10_AMLEF_PRISMAFlowDiagram | PYTHON | jupyter |

| 7_AMLEF_SPECTER2_UMAP | PYTHON | jupyter |

| 1_AMLEF_TFIDFRanking | PYTHON | jupyter |

| ravenpack_fetch | PYTHON | jupyter |

Datasets

This repository includes 38 dataset(s):

Dataset Format Size

| data | | 0.0 KB |

| publications.json | .json | 158.89 KB |

| research_questions.json | .json | 31.83 KB |

| resources.json | .json | 21.3 KB |

| tag_vocabulary.json | .json | 3.47 KB |

| team.json | .json | 3.84 KB |

| data | | 0.0 KB |

| data | | 0.0 KB |

| editorial_boards | | 0.0 KB |

| clusters.json | .json | 229.5 KB |

| control_cohort_assignments.json | .json | 0.1 KB |

| corpus_authors.csv | .csv | 453.11 KB |

| corpus_journals.csv | .csv | 28.08 KB |

| corpus_pull_manifest.json | .json | 72.51 KB |

| editor_authors.csv | .csv | 87.79 KB |

| editor_match_audit.json | .json | 37.1 KB |

| editor_metrics.csv | .csv | 21.77 KB |

| editor_metrics_summary.json | .json | 0.62 KB |

| editor_pre_post.csv | .csv | 8.09 KB |

| editor_pre_post_summary.json | .json | 0.25 KB |

| tenure_coverage_report.json | .json | 21.45 KB |

| themes_manifest.json | .json | 215.52 KB |

| data | | 0.0 KB |

| author_degrees.csv | .csv | 3.96 KB |

| bm25_top10.csv | .csv | 4.17 KB |

| coupling_topk.csv | .csv | 4.63 KB |

| cocitation_communities.csv | .csv | 2.63 KB |

| near_duplicate_pairs.csv | .csv | 0.05 KB |

| keyword_edges.csv | .csv | 13.07 KB |

| topic_topwords.csv | .csv | 3.25 KB |

| prisma_flow.csv | .csv | 0.27 KB |

| umap_coords.csv | .csv | 9.58 KB |

| tfidf_top10.csv | .csv | 4.29 KB |

| corpus.json | .json | 272.86 KB |

| prisma_counts.json | .json | 0.21 KB |

| data | | 0.0 KB |

| scimago | | 0.0 KB |

| abstracts.sqlite | .sqlite | 21316.0 KB |

Reproducibility

This repository includes reproducibility tools:

  • Python requirements.txt

  • Conda environment.yml

  • Makefile for automation

Status

  • Issues: Enabled
  • Wiki: Enabled
  • Pages: Enabled

README

Applied Machine Learning in Empirical Finance

License: MIT GitHub Pages

A collaborative PhD research project between University of Twente and Quoniam Asset Management, advancing the application of machine learning methods in portfolio optimization and risk management.

Project Overview

  • Start Date: December 2025
  • Duration: 3-4 years
  • Funding: Industry-funded by Quoniam Asset Management
  • License: MIT

Research Focus

Primary Themes: - ML for Portfolio Optimization - Risk Management & Forecasting

ML Methods: - Deep Learning (neural networks, transformers, LSTMs) - Reinforcement Learning - Ensemble Methods (random forests, gradient boosting) - Probabilistic ML (Bayesian methods, uncertainty quantification) - Statistical Learning Models

Asset Classes: - Equities, Fixed Income, Multi-Asset, Derivatives

Team

Name Role Affiliation
Joerg Osterrieder Primary Supervisor & Industry Liaison University of Twente
Xiaohong Huang Co-Supervisor University of Twente
Axel Gross-Klussmann Industry Supervisor Quoniam Asset Management
Dennis Hoffmann PhD Student Quoniam / University of Twente

Repository Structure

Applied-Machine-Learning-in-Empirical-Finance/
├── .claude/                        # Shared Claude Code config
│   ├── CLAUDE.md                   # Project conventions (both users)
│   ├── commands/                   # Custom slash commands
│   └── hooks/                     # Git & CI hooks
├── shared/                         # Shared resources
│   ├── claim_checker/             # Claim-citation verification (Gemini/Perplexity LLM screening, parity gate, Zotero export)
│   ├── claude_hooks/              # Deployment shims for Claude Code hooks (memory-index.mjs MEMORY.md auto-index)
│   ├── dashboard_kit/             # Reusable JS-free dashboard kit (theme/charts/build/audit/guards/verify); powers /diagnostic-dashboard
│   ├── parity_check/              # Standalone tex/pdf/html parity CLI (free, no API keys)
│   ├── reference_checker/         # BibTeX/TeX consistency + CrossRef/OpenAlex API verification
│   ├── data_sources.md            # External data source documentation
│   ├── research_proposal.tex      # PhD research proposal
│   └── templates/                 # Reusable templates
│       ├── beamer/                # Beamer presentation template
│       └── project/               # Scaffold for new papers/projects
├── literature_review/              # Literature-review umbrella
│   ├── api_info/                  # External-API registration + rate-limit docs (flat <provider>.md)
│   ├── notes/                     # General research notes (adversarial review, paper methodology draft)
│   ├── bibliographic_review/      # Bibliographic-review pipeline (Kessler coupling + Leiden clustering, Donthu-2021 audit)
│   │   ├── scripts/               # 8-step orchestration: 01_corpus_pull → 08_generate_dashboard
│   │   ├── configs/               # bib_review.yaml (search, networks, clustering, render)
│   │   ├── outputs/               # themes_manifest.{json,md}, clusters.json, dashboard.html, sentinels
│   │   ├── paper/                 # bib_review.tex (body) + standalone wrapper + references.bib + figures
│   │   ├── notes/                 # Donthu-2021 9-item checklist + dashboard story
│   │   ├── tests/                 # 63-test pytest suite (corpus, networks, clustering, manifest, render, dashboard)
│   │   └── README.md              # Pipeline diagram + run modes + outputs table + troubleshooting
│   └── systematic_literature_review/  # v2.3 SLR pipeline
│       ├── scripts/                   # Subfolders: steps/, lib/, tools/, _archive/, migrations/, tests/ (run_pipeline.py + post_flight_audit.py at root)
│       ├── scripts/tests/             # Pytest suite (259 tests)
│       ├── scripts/migrations/        # Config migrators (v2.1→v2.2→v2.3, seed_id v2)
│       ├── configs/                   # YAML search configurations
│       ├── runs/                      # Pipeline output per review (runs/old/ for archived pre-refactor runs)
│       ├── diagnostics/               # External-benchmark signal evaluation (Sezer 2020)
│       ├── notes/                     # Architecture spec, enhancement RFCs
│       ├── README.md                  # Usage guide
│       └── DEVELOPMENT.md             # Developer docs
├── qam_projects/                   # Quoniam industry projects (confidential)
│   └── strategy_specific_models/  # ML models for investment strategies
├── quantlets/                      # QuantLet platform planning + AMLEF submission guide
├── docs/                           # GitHub Pages website
│   ├── index.html                 # Landing page
│   ├── team.html                  # Team members & bios
│   ├── publications.html          # Publication browser
│   ├── research.html              # Research overview & gaps
│   ├── resources.html             # Tools & resources
│   ├── news.html                  # News & updates
│   ├── what-is-an-slr.html        # Wiki: SLR methodology
│   ├── factor-zoo.html            # Wiki: factor zoo & replication crisis
│   ├── cross-sectional-return-prediction.html  # Wiki: cross-sectional ML predictability
│   ├── signal-to-weights.html     # Wiki: portfolio construction
│   ├── open-science-in-finance.html  # Wiki: reproducibility & open science
│   ├── narrative-risk.html        # Wiki: narrative risk in empirical finance
│   ├── css/, js/, data/, assets/  # Website resources
│   └── scripts/                   # Python data collection scripts
│       ├── fetch_team_info.py     # Fetch ORCID IDs from OpenAlex
│       ├── fetch_openalex.py      # Fetch publications
│       ├── analyze_research_gaps.py # Identify research gaps
│       ├── verify_publications.py # Verify publication authors
│       ├── check_links.py         # Validate website links
│       ├── download_team_photos.py # Download team member photos
│       └── fetch_logos.py         # Fetch partner logos
├── environment.yml                 # Conda environment spec
├── README.md
├── CONTRIBUTING.md
└── LICENSE

Getting Started

View the Website

Visit: https://digital-ai-finance.github.io/Applied-Machine-Learning-in-Empirical-Finance/

Update Data

To refresh publication and team data from OpenAlex:

# Set up environment
conda env create -f environment.yml
conda activate applied-ml-finance

# Fetch team information
python docs/scripts/fetch_team_info.py

# Fetch publications
python docs/scripts/fetch_openalex.py

# Analyze research gaps
python docs/scripts/analyze_research_gaps.py

Verify References

Check BibTeX entries against CrossRef/OpenAlex and TeX citation consistency (free, no API keys required):

# Full check: citation consistency + API verification
python -m shared.reference_checker paper_1/paper.tex

# Consistency only (no API calls)
python -m shared.reference_checker paper_1/paper.tex --consistency-only

# Bib-only: verify entries against APIs
python -m shared.reference_checker paper_1/references.bib

See shared/reference_checker/README.md for all options.

Verify Claims (LLM screening)

Screen each (sentence, \cite{key}) pair against the cited paper's abstract using Gemini 2.5 Pro (default) or Perplexity Sonar. Requires GEMINI_API_KEY (and PERPLEXITY_API_KEY for the default Gemini fallback) in .env. Costs ~$0.03–0.38 per wiki-article run.

# Single article -> Excel report + Zotero push
python -m shared.claim_checker docs/assets/wiki/factor-zoo/factor-zoo.tex \
    -o out/factor-zoo.xlsx

See shared/claim_checker/README.md for backend selection, cost reference, and the API-key URL table. The standalone python -m shared.parity_check <tex> CLI exposes the tex/pdf/html parity gate without invoking the paid pipeline (see shared/parity_check/README.md).

SLR Pipeline Flow

Queries and seeds are peer entry sources (step 01); only seeds drive citation snowballing (step 02). All sources then merge at the 02a dedup union, so query breadth and seed citation-network depth are both preserved. Mirrors the per-run dashboard flow diagram (step 10).

STEP 01: query_openalex    (entry sources, all on the same level)

┌──────────────────────┐   ┌──────────────────────┐   ┌──────────────────────┐
│ Queries              │   │ Seeds                │   │ Practitioner         │
│ query_groups →       │   │ curated anchor       │   │ discovery            │
│ OpenAlex hits        │   │ papers               │   │ (grey-lit/industry)  │
└───────────┬──────────┘   └───────────┬──────────┘   └───────────┬──────────┘
            │                          │                          │
            │                          ▼                          │
            │              ┌──────────────────────┐               │
            │              │ STEP 02: snowball    │               │
            │              │ forward + backward   │               │
            │              │ FROM SEEDS ONLY      │               │
            │              │ depth 1..N           │               │
            │              └───────────┬──────────┘               │
            │                          │ seed neighbourhood       │
            ▼                          ▼                          ▼
┌────────────────────────────────────────────────────────────────────────────┐
│ STEP 02a: dedup        UNION of all sources  (01 ∪ 02)                     │
│ 3-pass: exact-ID · working-paper→journal · cross-journal collapse          │
└──────────────────────────────────────┬─────────────────────────────────────┘
                                       │ 02_deduped_corpus.json
┌────────────────────────────────────────────────────────────────────────────┐
│ STEP 02b: prefilter + enrich    (6-stage cascade, short-circuiting)        │
│ 0 relevance → 1 year → 2 Gate-B → 3 abstract-enrich (4 APIs)               │
│             → 4 abstract-presence → 5 signal-enrich                        │
└──────────────────────────────────────┬─────────────────────────────────────┘
┌────────────────────────────────────────────────────────────────────────────┐
│ STEP 03 score (relevance + quality, P10)  →  03c signal analytics          │
│   →  STEP 04 per-theme selection  (6 themes, target_corpus 55)             │
└──────────────────────────────────────┬─────────────────────────────────────┘
┌────────────────────────────────────────────────────────────────────────────┐
│ STEP 05-10  bibtex · zotero · latex · obsidian · PRISMA viz · dashboard    │
└────────────────────────────────────────────────────────────────────────────┘

Post-Flight Audit (V1, V3, V5, V9)

After running the SLR pipeline, run the post-flight audit script to populate the verification banner artifacts:

conda activate applied-ml-finance
python literature_review/systematic_literature_review/scripts/post_flight_audit.py \
  --run-dir literature_review/systematic_literature_review/runs/<run-id>

This writes: - {run-dir}/selection_audit.md — V1 (borderline-case rationale) + V3 (leak-path reproduction). - {run-dir}/pytest_summary.txt — V5 (full SLR pytest suite) + V9 (partial-signal aggregate test).

Use --skip-pytest to write only the audit markdown (faster iteration).

Run Bibliographic Review

The bibliographic review maps the field's intellectual structure via bibliographic coupling (Kessler 1963) and co-citation networks. It shares the SLR's lexical perimeter (same query groups + tilt-keyword AND filter), so structural divergence is interpretable as citation behaviour rather than retrieval scope. Methodology audited against Donthu et al. (2021) nine-item bibliometric checklist.

conda activate applied-ml-finance
cd literature_review/bibliographic_review
python scripts/run_bib_review.py --config configs/bib_review.yaml --non-interactive

Outputs (selected): outputs/themes_manifest.json (Pydantic-validated, canonical machine output), outputs/dashboard.html (offline single-file dashboard, Jaccard heatmap + cluster cards), paper/build/bib_review_standalone.pdf (>= 8-page LaTeX paper). See literature_review/bibliographic_review/README.md for environment setup, pipeline diagram, run modes (single-step, dry-run, smoke), output catalogue, cluster-labelling workflow, reference-checker hook, and troubleshooting.

Verify Source Correctness (unified)

For end-to-end source verification, the check-sources Claude Code skill orchestrates reference_checker first (free, fail-fast) and only escalates to the paid claim_checker when the cite/key graph resolves. Inside Claude Code:

check sources in docs/assets/wiki/factor-zoo/factor-zoo.tex

See .claude/skills/source-checker/SKILL.md and the Source-Checker wiki page.

Project Management

We use GitHub's built-in tools for tracking progress and collaboration:

  • Issues — Task tracking with direct links to code, commits, and branches
  • Projects (Kanban Board) — Overview of PhD progress and milestones
  • Wiki — Comprehensive project documentation: setup guides, pipeline docs, conventions, and architecture

Research Questions

Key open questions driving this research:

Portfolio Optimization

  • How can deep reinforcement learning handle multi-asset portfolios with realistic constraints?
  • What is the optimal way to incorporate transaction costs into ML-based portfolio optimization?
  • Can transformer architectures capture cross-asset dependencies effectively?

Risk Management

  • How can we develop uncertainty-aware deep learning models for VaR/ES estimation?
  • What architectures work best for real-time risk monitoring with streaming data?
  • How should ML risk models be validated to meet regulatory requirements?

Methodology

  • What pre-training strategies work for financial time series foundation models?
  • How can causal inference be integrated with ML predictions for portfolio decisions?
  • What transfer learning approaches work across financial markets?

Partners

University of Twente

BMS Financial Engineering - Academic research partner providing theoretical foundations and research methodology.

Quoniam Asset Management

Frankfurt-based quantitative asset manager providing industry context, practical use cases, and data access.

Research Networks

  • COST Action Fintech and AI in Finance
  • MSCA Digital Finance

Publications

Team publications are automatically fetched from OpenAlex and displayed on the website. The data includes:

  • 160+ publications from team members
  • 80+ ML+Finance relevant papers
  • 1,000+ total citations

License

This project is licensed under the MIT License - see the LICENSE file for details.

Contact

  • Academic: University of Twente, BMS Financial Engineering
  • Industry: Quoniam Asset Management, Frankfurt

Acknowledgments

  • OpenAlex for open academic publication data
  • Digital-AI-Finance organization for hosting
  • COST Action CA19130 Fintech and AI in Finance
  • MSCA Digital Finance network

(c) Joerg Osterrieder 2025-2026