Data Management Plan | Narrative Digital Finance

1 Data collection and documentation

1.1 What data will you collect, observe, generate or reuse?

We use publicly available data from financial markets (accessed via original source URLs) and commercial databases (accessed via provider APIs). Public data is not re-stored by us. Commercial data remains on provider infrastructure per license agreements. Project-generated code, datasets, and research outputs are archived on Zenodo with DOIs.

Commercial Data Sources (Licensed):

Source	Description	Format	Volume	Time Range
RavenPack	News headlines with sentiment scores	CSV, Parquet	~3 GB	2000-2025
Deutsche Borse T7	Nanosecond-level trading data for FESX and DAX futures	Binary, CSV	~50 GB	Jan 2021 - Sep 2024

Public Data Sources:

Source	Description	Format	URL
BIS Gigando	Central bank speeches worldwide (1996-2025)	Text, PDF	bis.org/cbspeeches/
St. Louis FED FRED	Macroeconomic indicators (CPI, PPI, GDP, etc.)	CSV via API	fred.stlouisfed.org/

Generated Research Outputs:

Output	Type	Format	Repository
Analysis code and methods	Software	Python	Zenodo
CB Speech Transcripts Dataset	Dataset	CSV	Zenodo
Daily Evergreen Narrative Sentiment	Dataset	CSV	Zenodo
Macro Regime Detection Notebooks	Software	Jupyter	Zenodo
Research posters and preprints	Publication	PDF	Zenodo, SSRN, arXiv

Note on data volume: Total data volume increased from the original estimate of ~10 GB to ~55 GB due to the addition of Deutsche Borse T7 high-frequency trading data (~50 GB) through a research collaboration established after project start.

1.2 How will the data be collected, observed or generated?

The quality of the collected data is checked during the first working package and is an integral part of the research. We apply statistical methods to address data shortcomings. Various documents describe the dataset, its quality, and the methods used to check its consistency.

Data Collection Methods:

API Access: FRED API (Python fredapi), BIS API for central bank speeches
Commercial Data: RavenPack SQL access, Deutsche Borse secure file transfer
NLP Processing: HuggingFace transformers, BERT/FinBERT embeddings, BERTopic modeling, Named Entity Recognition (NER) for institution and region tagging
LLM-based Analysis: OpenAI gpt-4o-mini for daily narrative sentiment scoring (Evergreen narratives)

Quality Assurance:

Data checked for consistency and completeness using statistical methods
Change point detection to identify structural economic shifts

Versioning: Code and databases are versioned with Git. Data releases receive DOIs via Zenodo integration.

1.3 What documentation and metadata will you provide with the data?

The information on the data as well as data sources and survey processes are documented in detail.

Metadata Standards:

Zenodo metadata (DataCite schema) for all archived outputs
Standard README documentation with variable descriptions and usage instructions
Data dictionaries with variable names, types, and descriptions

Documentation Provided:

README.md files in all repositories with variable descriptions
Inline code comments where helpful
Dependency files (requirements.txt) for reproducibility
Jupyter notebooks with step-by-step methodology
CC-BY 4.0 licensing for all project outputs

2 Ethics, legal and security issues

2.1 How will ethical issues be addressed and handled?

No personal data or other sensitive data is used in the project. The conditions have been discussed with the data providers. No special security standards are required by the data providers for this data.

Ethical Considerations:

All financial data is aggregated market data with no individual identification
Central bank speeches are official public communications
Research outputs are intended for academic purposes

NLP Bias Assessment Framework:

Sentiment distribution analysis by sector and geographic coverage
Temporal drift monitoring for NLP model outputs
Limitations section included in all publications
English-language dominance acknowledged in central bank speech analysis

Data Provider Agreements:

Provider	Agreement Type	Terms
RavenPack	Research subscription	Academic use only
Deutsche Borse	Research collaboration	Project-specific usage

Ethics Approval: Not required (non-human-subjects research per Swiss SNSF guidelines)

2.2 How will data access and security be managed?

We use private cloud solutions. No sensitive data or personal data is collected in the project.

Access Control:

Data Category	Access Level	Storage Location
Public data (FRED, BIS)	Open access	Original source URLs (not re-stored by us)
Commercial data	Provider access only	Provider cloud infrastructure (not redistributable)
Project code	Open access	GitHub and Zenodo (public)
Project datasets	Open access	Zenodo (CC-BY 4.0)

Security Measures: Commercial data remains on provider infrastructure per license agreements. Relevant project code is public on GitHub and Zenodo.

2.3 How will you handle copyright and Intellectual Property Rights issues?

The project is based on data that are largely publicly available. The raw data records may not be published without restriction.

Copyright Framework:

Data	Copyright Holder	Our Rights	License
RavenPack news	RavenPack Inc.	Academic use only	Proprietary
Deutsche Borse data	Deutsche Borse AG	Research collaboration	Agreement
BIS speeches	BIS/Speakers	Full reuse (cite)	Public
Our code	Project team	Open source	MIT License
Our datasets	Project team	Open access	CC-BY 4.0

Publication Rights:

Yes Aggregated statistics and derived features
Yes Code, models, and methodology
Yes Project-generated datasets (Zenodo)
No Raw commercial data redistribution

3 Data storage and preservation

3.1 How will your data be stored and backed-up during the research?

The storage capacities are large; the amount of data remains limited in the project. Relevant data is managed in the form of databases on GitHub and Zenodo and private cloud solutions. Backups and versions of the data are continuously created.

Primary Storage:

Repository	Content	Capacity	Backup
Zenodo	Datasets, code, posters, preprints	50 GB per record	CERN Data Centre
GitHub	Code, documentation	Unlimited	Git version control
Private cloud	Working copies	As needed	Regular backup

3.2 What is your data preservation plan?

Most relevant data is stored on Zenodo for long-term preservation. There is no obligation to destroy the data.

10-Year Retention Strategy:

All project outputs archived on Zenodo with persistent DOIs
Zenodo hosted at CERN Data Centre with guaranteed long-term preservation
Public data remains at original sources (FRED, BIS)
Commercial data is provider responsibility (not archived by us)

DOI Versioning Policy:

Zenodo provides Concept DOI (resolves to latest version) and Version DOI (resolves to specific snapshot)
New version created for any data change; README updated for documentation changes
CHANGELOG.md maintained in each Zenodo deposit
Citation format included in all README files

Deutsche Borse T7 Data: Due to the research collaboration agreement, raw HFT data cannot be archived publicly. Analysis methodology and derived features are documented to enable reproducibility.

4 Data sharing and reuse

4.1 How and where will the data be shared?

The relevant code developed during the project together with all necessary accompanying documentation will be stored on Zenodo. Zenodo offers safe storage for all data and research outputs in CERN's Data Centre.

Published Research Outputs:

Output	Repository	DOI	Status
World Central Banker Speech Transcripts (1996-2025)	Zenodo	10.5281/zenodo.18034730	Published
Daily Evergreen Narrative Sentiment (2004-2025)	Zenodo	10.5281/zenodo.18036051	Published
Macroeconomic Regime Detection Notebooks	Zenodo	10.5281/zenodo.18157708	Published
HFT Market Quality Poster (QuantMinds 2024)	Zenodo	10.5281/zenodo.18167476	Published
CB Communications AI Framework Poster (Freiburg 2025)	Zenodo	10.5281/zenodo.18167572	Published
HFT Impact on Market Liquidity (Preprint)	SSRN	Pending	Submitted
Systematic Literature Review (Financial Innovation)	arXiv	Pending	Under Review
Project Website	GitHub Pages	N/A	Live

Repository URLs:

Website: digital-ai-finance.github.io/Narrative-Digital-Finance/
Code and Datasets: zenodo.org

Sharing by Data Type:

Data Type	Shareable	Location/Reason
Public data (FRED, BIS)	Yes	Available at original sources (not re-distributed)
Commercial data (RavenPack, Deutsche Borse)	No	License restrictions (remains on provider servers)
Project code	Yes	Zenodo and GitHub (MIT License)
Project datasets	Yes	Zenodo (CC-BY 4.0)
Research posters	Yes	Zenodo (CC-BY 4.0)
Preprints	Yes	SSRN, arXiv (Open Access)

4.2 Are there any necessary limitations to protect sensitive data?

We do not use sensitive data in the project. The data come from conventional data providers and are originally collected from public sources.

Commercial Data Post-Project Access:

Commercial data agreements are project-duration specific
Analyses using commercial data are reproducible via methodology documentation
Derived features (shareable) are archived with sufficient detail

4.3 All digital repositories I will choose are conform to the FAIR Data Principles.

Yes

FAIR Data Principles Implementation:

Principle	GitHub	Zenodo
F1 - Findable	Public repository, searchable	DOIs for all deposits
A1 - Accessible	HTTPS access	HTTPS, no authentication required
I1 - Interoperable	Standard Python/Jupyter formats	Standard formats (CSV, JSON, PDF)
R1 - Reusable	MIT License, README files	CC-BY 4.0, DataCite metadata

4.4 I will choose digital repositories maintained by a non-profit organisation.

Yes

Repository Operators:

Zenodo: Operated by CERN (European Organization for Nuclear Research) - non-profit
arXiv: Operated by Cornell University - non-profit
GitHub: Operated by Microsoft - commercial, but free for public repositories

Data Management Plan (DMP)

Contents