automatic-research
Reproducibility framework for LLM-agent research in finance: statistical methodology, agent-replication harness, research guidelines.
Publications
Can Large Language Models Trade? Testing Financial Theories with LLM Agents in Market Simulations
This paper presents a realistic simulated stock market where large language models (LLMs) act as heterogeneous competing trading agents. The open-source framework incorporates a persistent order book with market and limit orders, partial fills, dividends, and equilibrium clearing alongside agents wi...
| Property | Value |
|---|---|
| arXiv | 2504.10789 |
| Year | 2025 |
Authors: Alejandro Lopez-Lira
Information
| Property | Value |
|---|---|
| Language | Python |
| Stars | 0 |
| Forks | 0 |
| Watchers | 0 |
| Open Issues | 0 |
| License | MIT License |
| Created | 2026-04-28 |
| Last Updated | 2026-04-28 |
| Last Push | 2026-04-28 |
| Contributors | 1 |
| Default Branch | main |
| Visibility | public |
Topics
conformal-prediction finance llm-agents multiple-testing reproducibility bayesian-hierarchical
Datasets
This repository includes 1 dataset(s):
| Dataset | Format | Size |
|---|---|---|
| prd.json | .json | 30.47 KB |
Reproducibility
No specific reproducibility files found.
Status
- Issues: Enabled
- Wiki: Disabled
- Pages: Enabled
README
automatic-research
Reproducibility framework for LLM-agent research in finance.
This repository is the inaugural artifact of the digital-ai-finance GitHub organisation. It hosts a methodological program around one question: when different LLM agents perform the same research task, do they produce different solutions, and what statistical machinery, multiple-testing methodology, and reporting standards do we need so that conclusions remain trustworthy?
What v0.1 ships
Four written artifacts plus a Python package skeleton plus three toy notebooks.
- Critical review of Lopez-Lira (2025), "Can Large Language Models Trade?" (arXiv:2504.10789). 10 to 15 pages. Sympathetic but rigorous. Covers four critique angles: reproducibility, validity of the "real-market features" claim, statistical inference under multiple comparisons, and generalizability across LLM versions. Source in
paper/review/. - Statistical framework reference document. 20 to 30 pages. Defines a variability taxonomy (model, prompt, run) and three statistical lenses that coexist under it: frequentist multiple-testing with dependent-test corrections (Romano-Wolf step-down, bootstrap reality check), Bayesian hierarchical partial-pooling models, and decision-theoretic robustness via split conformal prediction with Mondrian conditioning. Source in
paper/framework/. - Research guidelines and checklists. A
reporting-standards.mdthat maps the framework to existing reproducibility standards (ACM/NeurIPS, AEA, PRISMA 2020, TRIPOD-AI), anessentialone-page checklist and anadvancedchecklist, plus three short annexes for academic finance, ML, and industry-quant audiences. Inguidelines/. - Python package skeleton plus three toy notebooks.
automatic_researchpackage with stats stubs (multiple_testing,agreement,bayesian,decision_theoretic), aResearchTaskManifestschema, aResearchAgentAdapterProtocol, a liveLangGraphAdapterplus Protocol-conformingAutoGenAdapterandCrewAIAdaptersibling stubs, and three toy notebooks illustrating the variability decomposition, multiple-testing corrections, and the canonical PyMC hierarchical model. Insrc/automatic_research/andnotebooks/.
Multi-paper agenda
This repo is the kickoff of a multi-paper research stream:
- Paper 1 (this v0.1). Position paper, critical review, statistical framework reference document.
- Paper 2 (v0.2). Systematic survey of LLM-finance research through the reproducibility lens, PRISMA 2020 compliant.
- Paper 3 (v0.3). Framework methodology paper plus large-scale empirical replication (multi-LLM, multi-task) with the framework self-applied.
- Paper 4+. Domain applications: asset pricing, behavioral finance, market microstructure with LLM agents.
What v0.1 does NOT ship
By design: no empirical replication of Lopez-Lira; no large-scale multi-LLM API runs; no production-grade harness; no web app or dashboard; no formal proofs; no peer-review submission; no live AutoGen or CrewAI adapters; no multi-language stack.
Empirical credibility is earned in paper 3. v0.1 is methodological. Lock-in commitments (notation, LangGraph adapter) are explicit, time-bounded, and surfaced in paper/framework/CHANGELOG.md and the v0.1.0 release notes (KnownLimitations).
Vendor agnosticism stance
The framework is Protocol-first, not vendor-first. All agent integrations implement a common automatic_research.adapters.ResearchAgentAdapter Protocol with a run(manifest, model_id) -> RunArtifact method. v0.1 ships only the LangGraphAdapter alive; AutoGenAdapter and CrewAIAdapter exist as sibling Protocol-conforming stubs that raise NotImplementedError, demonstrating non-lock-in by structure rather than by promise. Methodology prose in paper/framework/sections/ is gated by a vendor-agnosticism check (no vendor model names in the methodology body); vendor identifiers live only in adapter code and a model_versions field of the task manifest.
Repository layout
automatic-research/
├── LICENSE (MIT)
├── README.md (this file)
├── CLAUDE.md (repo-local agent instructions, house style)
├── prd.json (v0.1 PRD with 10 stories and acceptance criteria)
├── pyproject.toml (Python 3.11+, ruff, mypy, pytest, nbconvert)
├── claude-logs/ (per-session process logs)
├── prompt-logs/ (per-prompt logs)
├── planning-logs/ (planning artifacts)
├── paper/
│ ├── shared/ (refs.bib, style.sty, macros.tex)
│ ├── review/ (Lopez-Lira critical review)
│ └── framework/ (statistical framework reference doc)
├── guidelines/ (reporting-standards, checklists, annexes)
├── src/automatic_research/ (Python package)
│ ├── manifest.py
│ ├── harness.py
│ ├── adapters/ (langgraph alive; autogen, crewai stubs)
│ └── stats/ (multiple_testing, agreement, bayesian, decision_theoretic)
├── notebooks/ (3 toy notebooks)
├── tests/
├── scripts/ (lint_house_style, check_*)
└── .github/workflows/ (ci, release)
Quick start
git clone https://github.com/digital-ai-finance/automatic-research.git
cd automatic-research
pip install -e .
pytest
Build the papers (requires a working LaTeX distribution and latexmk):
Execute the toy notebooks:
jupyter nbconvert --to notebook --execute notebooks/01_variability_axes.ipynb --output /tmp/out.ipynb
House style
This project follows a strict no-em-dash, no-en-dash punctuation rule across all generated text (LaTeX, markdown, code comments, commit messages). Use commas, parentheses, colons, hyphens, or sentence breaks instead. The rule is enforced by scripts/lint_house_style.py in CI.
Citing
Until a tagged release exists, please cite this repository as work-in-progress:
@misc{automatic_research_2026,
title = {{automatic-research}: a reproducibility framework for {LLM}-agent research in finance},
author = {Osterrieder, J{\"o}rg and {Digital AI Finance contributors}},
year = {2026},
howpublished = {\url{https://github.com/digital-ai-finance/automatic-research}},
}
Once v0.1.0 is released, replace with the tagged Zenodo or arXiv DOI.
License
MIT. See LICENSE.