DMP Comparison | Narrative Digital Finance

1. Data collection and documentation

1.1 What data will you collect, observe, generate or reuse?

Original

We will use publicly available data on financial markets as well data from databases which we buy, e.g. from Refinitiv. The data will be in csv format, about 10 GB, the data type are financial data, stored as integers and characters.

Enhanced

Commercial Data Sources (Licensed):

Source	Format	Volume
RavenPack	CSV, Parquet	~3 GB
Deutsche Borse T7	Binary, CSV	~50 GB

Public Data Sources:

Source	Format	URL
BIS Gigando	Text, PDF	bis.org/cbspeeches/
St. Louis FED FRED	CSV via API	fred.stlouisfed.org/

Generated Research Outputs:

Analysis code (Python) - Zenodo
CB Speech Transcripts - Zenodo DOI
Daily Evergreen Sentiment - Zenodo DOI
Macro Regime Notebooks - Zenodo DOI
Research posters/preprints - Zenodo/SSRN/arXiv

1.2 How will the data be collected, observed or generated?

Original

Quality assurance

- The data are checked for consistency and completeness using statistical methods.

Versioning

- The code and the database are versioned with the help of the ZHAW internal tools.

The quality of the collected data will be checked during the first working package and is an integral part of the research. The consistency as well. In addition, due to shortcomings of the data, we will apply statistical methods to overcome this. We will have various documents describing the dataset, its quality and the methods used to check its consistency.

Enhanced

Data Collection Methods:

API Access: FRED API, BIS API for central bank speeches
Commercial: RavenPack SQL, Deutsche Borse SFTP
NLP: HuggingFace transformers, FinBERT, BERTopic, NER tagging
LLM: OpenAI gpt-4o-mini for narrative sentiment

Quality Assurance:

Statistical consistency checks, change point detection.

Versioning:

Git version control. Data releases receive DOIs via Zenodo.

1.3 What documentation and metadata will you provide with the data?

Original

The information on the data as well as data sources and survey processes are documented in detail. The information on the project and the data will be made available to our university employees so that further projects can be developed in this area.

Enhanced

Metadata Standards:

Zenodo metadata (DataCite schema)
README documentation with variable descriptions
Data dictionaries

Documentation Provided:

README.md files in all repositories
Jupyter notebooks with methodology
CC-BY 4.0 licensing for all outputs

2. Ethics, legal and security issues

2.1 How will ethical issues be addressed and handled?

Original

No personal data or other sensitive data is used in the project. In this respect, the university internal security standards are applied.

The conditions have already been discussed with the data providers. No special security standards are required by the data providers for this data.

Enhanced

Ethical Considerations:

All financial data is aggregated market data with no individual identification. Central bank speeches are official public communications.

NLP Bias Assessment Framework:

Sentiment distribution analysis by sector/geography
Temporal drift monitoring
English-language dominance acknowledged

Data Provider Agreements:

Provider	Terms
RavenPack	Research subscription, academic use only
Deutsche Borse	Research collaboration, project-specific

2.2 How will data access and security be managed?

Original

Access to the data is only granted to team members. The university IT service guarantees the security of data and processes. No sensitive data or personal data is collected in the project.

Enhanced

We use private cloud solutions. No sensitive data or personal data is collected.

Access Control Matrix:

Data Category	Access	Storage
Public data (FRED, BIS)	Open	Original sources
Commercial data	Provider only	Provider infrastructure
Project code/datasets	Open	GitHub/Zenodo

2.3 How will you handle copyright and Intellectual Property Rights issues?

Original

The project is based on data that are largely publicly available. The raw data records may not be published without restriction.

Enhanced

Copyright Framework:

Data	Copyright	License
Commercial data	Providers	Proprietary
BIS data	Public	Public domain
Our code	Project team	MIT License
Our datasets	Project team	CC-BY 4.0

3. Data storage and preservation

3.1 How will your data be stored and backed-up during the research?

Original

The storage capacities are very large, the amount of data remains very limited in the project.

The data is managed in the form of databases on the ZHAW internal Github. Backups and versions of the data are continuously created.

Enhanced

Relevant data is managed on GitHub and Zenodo and private cloud solutions.

Primary Storage:

Repository	Content	Backup
Zenodo	Datasets, code, posters	CERN Data Centre
GitHub	Code, docs	Git version control
Private cloud	Working copies	Regular backup

3.2 What is your data preservation plan?

Original

The data is stored on the ZHAW internal github for a long time and managed by the ZHAW using the existing tools. There is no obligation to destroy the data.

Enhanced

Most relevant data is stored on Zenodo for long-term preservation.

10-Year Retention Strategy:

All outputs archived on Zenodo with DOIs
Zenodo at CERN with 10+ year retention (SNSF compliant)
Public data at original sources

DOI Versioning Policy:

Concept DOI + Version DOI per Zenodo
CHANGELOG.md in each deposit

Deutsche Borse Note: Raw HFT data cannot be archived publicly. Methodology documented for reproducibility.

Compliance: Quarterly self-audit, PI responsible.

4. Data sharing and reuse

4.1 How and where will the data be shared?

Original

The code developed during the project together with all necessary accompanying documentation will be stored on a GitHub channel. On the other hand, the data archiving will be done through Zenodo. Zenodo offers safe storage for all data and research outputs in CERN's Data Centre and it provides easy integration with GitHub.

Enhanced

The relevant code developed during the project will be stored on Zenodo.

Published Research Outputs (7 items):

Output	Repository	Status
CB Speech Transcripts	Zenodo	Published
Evergreen Narrative Sentiment	Zenodo	Published
Macro Regime Detection	Zenodo	Published
HFT Poster (QuantMinds)	Zenodo	Published
CB AI Framework Poster	Zenodo	Published
HFT Preprint	SSRN	Pending
SLR Preprint	arXiv	Under Review

Sharing by Data Type:

Data	Shareable	Location
Public data	Yes	Original sources
Commercial data	No	Provider servers
Project code	Yes	Zenodo/GitHub (MIT)
Project datasets	Yes	Zenodo (CC-BY 4.0)

4.2 Are there any necessary limitations to protect sensitive data?

Original

We do not use sensitive data in the project. The data come from conventional data providers and are originally collected from public sources.

Enhanced

Commercial Data Post-Project:

Agreements are project-duration specific
Methodology documented for reproducibility
Derived features archived

4.3 All digital repositories I will choose are conform to the FAIR Data Principles.

Original

Yes

Enhanced

Answer: Yes

FAIR Compliance Matrix:

Principle	GitHub	Zenodo
F1 - Findable	Public, searchable	DOIs for all deposits
A1 - Accessible	HTTPS access	HTTPS, no auth
I1 - Interoperable	Python/Jupyter	CSV, JSON, PDF
R1 - Reusable	MIT License	CC-BY 4.0, DataCite

4.4 I will choose digital repositories maintained by a non-profit organisation.

Original

Yes

Enhanced

Answer: Yes

Repository Operators:

Zenodo: CERN (non-profit)
arXiv: Cornell University (non-profit)
GitHub: Microsoft (commercial, free for public)

DMP Comparison: Original vs Enhanced