automatic-research

Reproducibility framework for LLM-agent research in finance: statistical methodology, agent-replication harness, research guidelines.

View on GitHub Homepage

Publications

Can Large Language Models Trade? Testing Financial Theories with LLM Agents in Market Simulations

This paper presents a realistic simulated stock market where large language models (LLMs) act as heterogeneous competing trading agents. The open-source framework incorporates a persistent order book with market and limit orders, partial fills, dividends, and equilibrium clearing alongside agents wi...

Property	Value

| arXiv | 2504.10789 |

| Year | 2025 |

Authors: Alejandro Lopez-Lira

Information

Property	Value
Language	Python
Stars	0
Forks	0
Watchers	0
Open Issues	0
License	MIT License
Created	2026-04-28
Last Updated	2026-04-28
Last Push	2026-04-28
Contributors	1
Default Branch	main
Visibility	public

Topics

conformal-prediction finance llm-agents multiple-testing reproducibility bayesian-hierarchical

Datasets

This repository includes 1 dataset(s):

Dataset	Format	Size

| prd.json | .json | 30.47 KB |

Reproducibility

No specific reproducibility files found.

Status

Issues: Enabled
Wiki: Disabled
Pages: Enabled

README

automatic-research

Reproducibility framework for LLM-agent research in finance.

This repository is the inaugural artifact of the digital-ai-finance GitHub organisation. It hosts a methodological program around one question: when different LLM agents perform the same research task, do they produce different solutions, and what statistical machinery, multiple-testing methodology, and reporting standards do we need so that conclusions remain trustworthy?

What v0.1 ships

Four written artifacts plus a Python package skeleton plus three toy notebooks.

Critical review of Lopez-Lira (2025), "Can Large Language Models Trade?" (arXiv:2504.10789). 10 to 15 pages. Sympathetic but rigorous. Covers four critique angles: reproducibility, validity of the "real-market features" claim, statistical inference under multiple comparisons, and generalizability across LLM versions. Source in paper/review/.
Statistical framework reference document. 20 to 30 pages. Defines a variability taxonomy (model, prompt, run) and three statistical lenses that coexist under it: frequentist multiple-testing with dependent-test corrections (Romano-Wolf step-down, bootstrap reality check), Bayesian hierarchical partial-pooling models, and decision-theoretic robustness via split conformal prediction with Mondrian conditioning. Source in paper/framework/.
Research guidelines and checklists. A reporting-standards.md that maps the framework to existing reproducibility standards (ACM/NeurIPS, AEA, PRISMA 2020, TRIPOD-AI), an essential one-page checklist and an advanced checklist, plus three short annexes for academic finance, ML, and industry-quant audiences. In guidelines/.
Python package skeleton plus three toy notebooks. automatic_research package with stats stubs (multiple_testing, agreement, bayesian, decision_theoretic), a ResearchTaskManifest schema, a ResearchAgentAdapter Protocol, a live LangGraphAdapter plus Protocol-conforming AutoGenAdapter and CrewAIAdapter sibling stubs, and three toy notebooks illustrating the variability decomposition, multiple-testing corrections, and the canonical PyMC hierarchical model. In src/automatic_research/ and notebooks/.

Multi-paper agenda

This repo is the kickoff of a multi-paper research stream:

Paper 1 (this v0.1). Position paper, critical review, statistical framework reference document.
Paper 2 (v0.2). Systematic survey of LLM-finance research through the reproducibility lens, PRISMA 2020 compliant.
Paper 3 (v0.3). Framework methodology paper plus large-scale empirical replication (multi-LLM, multi-task) with the framework self-applied.
Paper 4+. Domain applications: asset pricing, behavioral finance, market microstructure with LLM agents.

What v0.1 does NOT ship

By design: no empirical replication of Lopez-Lira; no large-scale multi-LLM API runs; no production-grade harness; no web app or dashboard; no formal proofs; no peer-review submission; no live AutoGen or CrewAI adapters; no multi-language stack.

Empirical credibility is earned in paper 3. v0.1 is methodological. Lock-in commitments (notation, LangGraph adapter) are explicit, time-bounded, and surfaced in paper/framework/CHANGELOG.md and the v0.1.0 release notes (KnownLimitations).

Vendor agnosticism stance

The framework is Protocol-first, not vendor-first. All agent integrations implement a common automatic_research.adapters.ResearchAgentAdapter Protocol with a run(manifest, model_id) -> RunArtifact method. v0.1 ships only the LangGraphAdapter alive; AutoGenAdapter and CrewAIAdapter exist as sibling Protocol-conforming stubs that raise NotImplementedError, demonstrating non-lock-in by structure rather than by promise. Methodology prose in paper/framework/sections/ is gated by a vendor-agnosticism check (no vendor model names in the methodology body); vendor identifiers live only in adapter code and a model_versions field of the task manifest.

Repository layout

automatic-research/
├── LICENSE                            (MIT)
├── README.md                          (this file)
├── CLAUDE.md                          (repo-local agent instructions, house style)
├── prd.json                           (v0.1 PRD with 10 stories and acceptance criteria)
├── pyproject.toml                     (Python 3.11+, ruff, mypy, pytest, nbconvert)
├── claude-logs/                       (per-session process logs)
├── prompt-logs/                       (per-prompt logs)
├── planning-logs/                     (planning artifacts)
├── paper/
│   ├── shared/                        (refs.bib, style.sty, macros.tex)
│   ├── review/                        (Lopez-Lira critical review)
│   └── framework/                     (statistical framework reference doc)
├── guidelines/                        (reporting-standards, checklists, annexes)
├── src/automatic_research/            (Python package)
│   ├── manifest.py
│   ├── harness.py
│   ├── adapters/                      (langgraph alive; autogen, crewai stubs)
│   └── stats/                         (multiple_testing, agreement, bayesian, decision_theoretic)
├── notebooks/                         (3 toy notebooks)
├── tests/
├── scripts/                           (lint_house_style, check_*)
└── .github/workflows/                 (ci, release)

Quick start

git clone https://github.com/digital-ai-finance/automatic-research.git
cd automatic-research
pip install -e .
pytest

Build the papers (requires a working LaTeX distribution and latexmk):

latexmk -pdf paper/review/main.tex
latexmk -pdf paper/framework/main.tex

Execute the toy notebooks:

jupyter nbconvert --to notebook --execute notebooks/01_variability_axes.ipynb --output /tmp/out.ipynb

House style

This project follows a strict no-em-dash, no-en-dash punctuation rule across all generated text (LaTeX, markdown, code comments, commit messages). Use commas, parentheses, colons, hyphens, or sentence breaks instead. The rule is enforced by scripts/lint_house_style.py in CI.

Citing

Until a tagged release exists, please cite this repository as work-in-progress:

@misc{automatic_research_2026,
  title  = {{automatic-research}: a reproducibility framework for {LLM}-agent research in finance},
  author = {Osterrieder, J{\"o}rg and {Digital AI Finance contributors}},
  year   = {2026},
  howpublished = {\url{https://github.com/digital-ai-finance/automatic-research}},
}

Once v0.1.0 is released, replace with the tagged Zenodo or arXiv DOI.

License

MIT. See LICENSE.