Mutual Fund Style Drift Literature Corpus

Complete Replication Guide | Version 1.0 | 2025-12-28

65
Final Papers
6,580
Total Citations
3
Pipeline Phases
10
Python Scripts

1. Overview and Key Innovation

This guide documents the complete methodology for constructing a corpus of 65 peer-reviewed papers on mutual fund style drift. The corpus was built through a two-stage process: (1) parallel independent searches using different strategies, and (2) unified keyword-based relevance classification.

Key Innovation

Rather than relying solely on journal prestige or citation counts, we applied domain-specific keyword filtering that mirrors how experts identify relevant literature. Papers must contain established terminology from the style drift literature (e.g., "style drift", "active share", "closet indexing") to be classified as relevant.

Phase 1: 58-Corpus          Phase 2: 53-Corpus          Phase 3: Merge
     |                            |                          |
     v                            v                          v
+-------------+             +-------------+            +-------------+
| 19 OpenAlex |             | Broad Search|            | Load Both   |
|   Queries   |             | Top Journals|            |   Corpora   |
+-------------+             +-------------+            +-------------+
     |                            |                          |
     v                            v                          v
+-------------+             +-------------+            +-------------+
| 5400+ Raw   |             | 2600+ Raw   |            | Deduplicate |
|   Papers    |             |   Papers    |            |   by DOI    |
+-------------+             +-------------+            +-------------+
     |                            |                          |
     v                            v                          v
+-------------+             +-------------+            +-------------+
| Stage 1-2   |             | Journal +   |            | 111 Unique  |
| Filters     |             | Citation    |            |   Papers    |
+-------------+             +-------------+            +-------------+
     |                            |                          |
     v                            v                          v
+-------------+             +-------------+            +-------------+
| Stage 3-4   |             | Weak Kw     |            | Keyword     |
| Journals+Kw |             | Filter      |            | Classify    |
+-------------+             +-------------+            +-------------+
     |                            |                          |
     v                            v                          v
+-------------+             +-------------+            +-------------+
|  58 Papers  |             |  53 Papers  |            |  65 RELEVANT|
|   (100%)    |             |  (13.2%)    |            |  46 REVIEW  |
+-------------+             +-------------+            +-------------+
                

2. Prerequisites and Dependencies

Python Environment

RequirementValue
Python Version3.8+
Required Packagesrequests, matplotlib
Optional Packagespandas (for CSV manipulation), numpy (for statistics)

API Configuration

APIEndpointRate LimitAuth
OpenAlex https://api.openalex.org 10 requests/second with polite pool (mailto header) None required, but mailto header recommended
Semantic Scholar https://api.semanticscholar.org/graph/v1 1 request/3 seconds without API key Optional API key for higher limits
Important: Include a mailto header in OpenAlex requests to access the polite pool (10 req/sec). Without it, rate limits are lower.

3. Phase 1: 58-Corpus Construction

Location: literature/sections/introduction/

Step 1: OpenAlex Search (15_openalex_literature_search.py)

Execute 19 targeted queries against OpenAlex API

Search Queries (19 total)

Organized into 5 tiers by specificity:

  • "mutual fund style drift"
  • "fund style misclassification"
  • "style consistency" AND "mutual fund"
  • "investment style" AND "fund" AND "deviation"
  • "closet indexing"
  • "active share" AND "mutual fund"
  • "benchmark mismatch" AND "fund"
  • "style timing" AND "fund"
  • "window dressing" AND "mutual fund"
  • "return-based style analysis"
  • "holdings-based style analysis"
  • "Sharpe" AND "style analysis"
  • "style box" AND "Morningstar"
  • "fund manager incentives" AND "risk"
  • "tournament" AND "mutual fund"
  • "flow-performance" AND "mutual fund"
  • "fund misrepresentation"
  • "alpha" AND "style" AND "fund"
  • "fund classification"

Stage 1 Filters (Relevance)

FilterValue
year_range1990-2025
typejournal-article
languageEnglish
has_doiTrue
has_abstractTrue

Stage 2 Filters (Quality)

FilterValue
citations_min10
or_top_journalTrue
or_recent_with_3_cites2020+ with 3+ citations

Outputs

  • raw_results.json (5400+ papers)
  • stage1_results.json (after relevance filters)
  • stage2_results.json (1594 papers after quality filters)
Step 2: Corpus Reduction (16_corpus_reduction.py)

Apply finance journal and keyword relevance filters

Stage 3: Journal Filter

Finance/Economics journals only

Exception: SSRN papers with 50+ citations pass

Stage 4: Keyword Relevance Filter

Papers must contain at least one of these terms in title or abstract:

style drift style consistency style timing closet index closet indexing active share benchmark mismatch misclassif* misrepresent* window dressing return-based style returns-based style holdings-based style style analysis style box

Context-dependent: investment style - "fund" must also appear in text

Step 3: Metadata Enrichment (17_enrich_corpus_metadata.py)

Fetch full metadata from OpenAlex and Semantic Scholar

  • Full abstract text
  • OpenAlex concepts/topics
  • Author affiliations
  • Reference counts
  • Open access status
Result: 58 papers with 100% relevance rate (all passed keyword filter by construction)

4. Phase 2: 53-Corpus Construction

Location: literature/scripts/

Methodology

Broad search filtered by publication in top finance journals:

Top 12 Journal Filter

  • Journal of Finance
  • Journal of Financial Economics
  • Review of Financial Studies
  • Journal of Financial and Quantitative Analysis
  • Financial Analysts Journal
  • Journal of Portfolio Management
  • Journal of Banking & Finance
  • Management Science
  • Journal of Monetary Economics
  • Review of Finance
  • Journal of Financial Markets
  • Journal of Corporate Finance

Scripts

ScriptOutput
01_openalex_search.pyraw_papers.csv
02_supplementary_search.pymerged_corpus.csv
05_relevance_filter.pyfinal_corpus.csv (53 papers)
Issue: No explicit keyword relevance filter - includes tangentially related papers

5. Phase 3: Merging and Classification

Location: literature/scripts/

Step 1: Merge Corpora
SourceFile
58-Corpusliterature/sections/introduction/openalex_output/enriched_corpus.json
53-Corpusliterature/data/final_corpus.csv

Deduplication: DOI-based matching (case-insensitive)

Result: 111 unique papers (0 duplicates found)

Step 2: Keyword Classification (21_relevance_classifier.py)

Apply Stage 4 keywords to all merged papers

Classification Algorithm

def classify_paper(title: str, abstract: str) -> str:
    text = (title + " " + abstract).lower()

    # Check direct keywords
    for keyword in RELEVANCE_KEYWORDS:
        if keyword in text:
            return "RELEVANT"

    # Check context-dependent keywords
    if ("fund" in text or "mutual" in text):
        for keyword in CONTEXT_KEYWORDS:
            if keyword in text:
                return "RELEVANT"

    return "NEEDS_REVIEW"
            
Step 3: Generate Outputs
  • classified_papers.json (111 papers with classification)
  • final_relevant_corpus.json (71 papers)
  • final_relevant_corpus.bib
  • final_corpus_apa.html
  • corpus_statistics.json
  • corpus_statistics.html
  • charts/ (5 PNG visualizations)

6. Relevance Keywords Reference

Direct Keywords (17 terms)

Match triggers RELEVANT classification immediately:

style drift style consistency style timing active share closet index closet indexing return-based style returns-based style holdings-based style style analysis style box misclassif misrepresent window dressing fund style style rotation style risk

Context-Dependent Keywords (4 terms)

Only match if "fund" or "mutual" also appears:

investment style style deviation benchmark deviation style shift

Keyword Match Frequency (Final Corpus)

KeywordPapers Matched
investment style [+fund]20
style analysis12
active share11
fund style6
window dressing6
style drift5
closet index/indexing9
return-based style4
misclassification3
style timing3

7. Final Results and Validation

65
Total Papers
6,580
Citations
101.4
Mean Cites
27
Journals

Corpus Composition

SourceRelevantTotalRate
58-Corpus (targeted search) 58 58 100%
53-Corpus (journal prestige) 7 53 13.2%

Validation Checks

  • All 71 papers have DOIs
  • All 71 papers match at least one relevance keyword
  • No duplicates exist in final corpus
  • Year range is within 1990-2025
  • All papers are from finance/economics journals or SSRN with 50+ cites

8. The Seven Additional Papers

These papers from the 53-corpus matched relevance keywords and were added to the final corpus:

# Title Journal Cites Matched Keyword
1 Mutual Fund Styles Journal of Financial Economics 490 fund style
2 On Mutual Fund Investment Styles Review of Financial Studies 422 investment style [+fund]
3 Liquidity, Investment Style, and the Relation between Fund Size and Fund Performance Journal of Financial and Quantitative Analysis 332 investment style [+fund]
4 Active Share and Mutual Fund Performance Financial Analysts Journal 296 active share
5 Mutual Fund Misclassification: Evidence Based on Style Analysis Financial Analysts Journal 162 style analysis, misclassification
6 Equity Style Timing Financial Analysts Journal 69 style timing
7 Diseconomies of Scale in Quantitative and Fundamental Investment Styles Journal of Financial and Quantitative Analysis 13 investment style [+fund]

9. Complete Execution Guide

Execute these commands in order to fully reproduce the 65-paper corpus:

Phase 1: 58-Corpus Construction
cd literature/sections/introduction/

python 15_openalex_literature_search.py
python 16_corpus_reduction.py
python 17_enrich_corpus_metadata.py
python 18_generate_apa_html.py
Phase 2: 53-Corpus Construction (if starting fresh)
cd literature/scripts/

python 01_openalex_search.py
python 02_supplementary_search.py
python 05_relevance_filter.py
Phase 3: Merging and Classification
cd literature/scripts/

python 21_relevance_classifier.py
python 22_final_corpus_apa_html.py
python 23_corpus_statistics.py
Expected Final Outputs:
  • classification_output/final_relevant_corpus.json - 71 papers
  • classification_output/final_corpus_apa.html - APA citations
  • classification_output/corpus_statistics.html - Statistics dashboard
  • classification_output/charts/ - 5 PNG visualizations

Generated: 2025-12-28 18:38:25 | Script: 24_replication_guide.py | Mutual Fund Style Drift SLR