1. Introduction
Since the Capital Asset Pricing Model reduced expected returns to a single market factor (Sharpe, 1964), the cross-section of expected stock returns has been one of the most intensively studied areas in financial economics. Early anomalies such as the size effect (Banz, 1981) and the value premium (Fama and French, 1993) challenged the CAPM and motivated multi-factor models. By 2016, researchers had documented at least 316 distinct characteristics that purportedly predict returns (Harvey, Liu, and Zhu, 2016), and the count has continued to grow.
This proliferation raises a fundamental question: how many of these factors represent genuine economic phenomena, and how many are statistical artifacts of data mining? The answer matters not only for academic theory but also for practitioners who must decide which signals to trust when constructing portfolios.
For an overview of how factor signals are translated into portfolio weights, see our companion article on quantitative portfolio construction.
2. The Growth of the Factor Zoo
2.1 A Brief History of Factor Discovery
The CAPM (Sharpe, 1964) posited that expected returns are determined solely by exposure to market risk. The first cracks appeared with the size effect: Banz (1981) showed that small-capitalization stocks earned higher average returns than the CAPM predicted. Fama and French (1993) formalized the value and size effects into the three-factor model (market, SMB, HML), which became the workhorse of empirical asset pricing for over a decade. Momentum (Jegadeesh and Titman, 1993; Carhart, 1997) and profitability (Novy-Marx, 2013) followed, each documented as a robust anomaly in its own right.
The pace of factor discovery accelerated through the 2000s and 2010s. Harvey, Liu, and Zhu (2016) catalogued 316 factors proposed in top finance and economics journals through roughly 2012, noting that 59 new factors were discovered between 2010 and 2012 alone. In his 2011 AFA presidential address, Cochrane (2011) gave the phenomenon its memorable name: "We also thought that the cross-section of expected returns came from the CAPM. Now we have a zoo of new factors."
2.2 Competing Factor Models
As the zoo grew, several research groups proposed parsimonious factor models to organize it. Table 1 summarizes the major contenders.
| Model | Factors | Year | Key Reference |
|---|---|---|---|
| CAPM | Market | 1964 | Sharpe (1964) |
| Fama–French 3 | Market, SMB, HML | 1993 | Fama and French (1993) |
| Carhart 4 | +Momentum (UMD) | 1997 | Carhart (1997) |
| Fama–French 5 | +Profitability (RMW), Investment (CMA) | 2015 | Fama and French (2015) |
| q-Factor | Market, Size, I/A, ROE | 2015 | Hou, Xue, and Zhang (2015) |
| Mispricing | Market, SMB, MGMT, PERF | 2017 | Stambaugh and Yuan (2017) |
No single model dominates across all test assets, and spanning tests frequently reject each model's ability to price the others' factors. The q-factor model (Hou, Xue, and Zhang, 2015) is grounded in the investment CAPM and, according to its authors, largely subsumes the Fama–French five-factor model in spanning tests, though this claim remains contested. The mispricing factor model (Stambaugh and Yuan, 2017) takes an explicitly behavioral perspective. This unresolved competition underscores the difficulty of separating risk premia from mispricing with a small number of factors.
3. The Replication Crisis
The volume of proposed factors inevitably raised concerns about false discoveries. Several large-scale replication studies have attempted to quantify the problem, arriving at strikingly different conclusions.
3.1 Pessimistic Evidence
Hou, Xue, and Zhang (2020) conducted the largest replication exercise at the time, testing 452 anomalies from the literature. Using value-weighted returns with NYSE breakpoints to mitigate the influence of microcap stocks, they found that 65% of anomalies fail to clear even the conventional single-test hurdle of |t| = 1.96. At the multiple-testing-adjusted threshold of |t| = 2.78, the failure rate rises to 82%. The trading frictions literature fared worst: 96% of 106 anomalies (102 of 106) failed to replicate. Their conclusion was stark: "Capital markets are more efficient than previously recognized."
McLean and Pontiff (2016) took a different approach, studying 97 published predictors before and after publication. They found that portfolio returns were 26% lower out of sample (before publication) and 58% lower after publication. The 26% decline provides an upper bound on the extent of data mining, while the additional 32% decline is attributed to investors learning about and trading away the anomalies.
3.2 Optimistic Evidence
Jensen, Kelly, and Pedersen (2023) reached a more favorable verdict. Testing 153 characteristics across 93 countries, they found that the majority of factors can be replicated, that factors cluster into 13 economically meaningful themes, and that the evidence is strengthened rather than weakened by the large number of observed factors when evaluated through a Bayesian lens. Their comprehensive dataset is publicly available at jkpfactors.com, covering global factor returns.
Chen and Zimmermann (2022) similarly found that 98% of the 161 characteristics that were clearly significant in the original papers produce long-short portfolios with t-statistics above 1.96 when replicated faithfully. A regression of reproduced t-statistics on original t-statistics yields a slope of 0.88 and an R2 of 82%, suggesting that while effect sizes may be somewhat overstated, the literature is more credible than the pessimistic assessments imply.
3.3 Reconciling the Views
The apparent contradiction between 35% and 98% replication rates reflects both methodological choices and a deeper conceptual disagreement about what "replication" means. At the methodological level, Hou, Xue, and Zhang (2020) use NYSE breakpoints and value-weighted returns, which effectively remove microcap stocks that drive many anomalies but are difficult to trade in practice. Chen and Zimmermann (2022) replicate more closely to the original methodology of each paper, preserving microcap exposure. Jensen, Kelly, and Pedersen (2023) use a Bayesian framework that accounts for the multiple testing problem differently from frequentist t-statistic thresholds.
But the disagreement also runs deeper. Hou, Xue, and Zhang (2020) are testing whether factors survive in investable universes under standardized methodology, a form of scientific replication that asks whether the economic phenomenon is real and exploitable. Chen and Zimmermann (2022) are testing whether published results can be faithfully reproduced, a form of statistical replication that asks whether the original authors' computations were correct. These are different epistemological standards, and the gap between 35% and 98% partly reflects which standard each study applies. Linnainmaa and Roberts (2018) add a temporal dimension to this debate, showing that many accounting-based anomalies documented in post-1963 data do not survive in the pre-COMPUSTAT era (1926–1963), suggesting they may be artifacts of data mining rather than persistent economic phenomena.
| Study | Factors Tested | Replication Rate | Threshold | Key Finding |
|---|---|---|---|---|
| Hou, Xue, and Zhang (2020) | 452 | 35% | |t| > 1.96 | Microcap-mitigated; most anomalies fail |
| McLean and Pontiff (2016) | 97 | n/a* | Return magnitude | 26% OOS decline, 58% post-publication decline |
| Chen and Zimmermann (2022) | 161 | 98% | |t| > 1.96 | Close-to-original replication |
| Jensen, Kelly, and Pedersen (2023) | 153 | Majority | Bayesian | 13 themes, 93 countries |
*McLean and Pontiff measure return magnitude decline, not statistical significance rates; their metric is not directly comparable to the other studies.
The reconciliation carries practical significance: factors that survive after removing microcaps and applying stricter thresholds are more likely to be investable. The optimistic studies, by contrast, confirm that the academic literature is not fundamentally broken, even if many factors have limited practical relevance.
4. Methodological Foundations
4.1 The Multiple Testing Problem
When hundreds of hypotheses are tested, some will appear significant by chance. At the conventional 5% significance level, testing 400 independent hypotheses would produce approximately 20 "discoveries" even if none were real. Harvey, Liu, and Zhu (2016) argued that the traditional t-statistic threshold of 1.96 is inadequate and proposed a minimum hurdle of t > 3.0.
Table 3 shows how multiple testing corrections raise the bar.
| Method | Implied t-stat | Controls | Reference |
|---|---|---|---|
| Conventional | 1.96 | Single test | — |
| BHY (316 observed, 5%) | 2.78 | False discovery rate | Harvey, Liu, and Zhu (2016) |
| BHY (est. total tests, 5%) | 3.18 | FDR + missing factors | Harvey, Liu, and Zhu (2016) |
| Holm (316 factors) | 3.64 | Family-wise error rate | Harvey, Liu, and Zhu (2016) |
| Bonferroni (316 factors) | 3.78 | Family-wise error rate | Harvey, Liu, and Zhu (2016) |
| Practical minimum | 3.0 | Approximate FDR | Harvey, Liu, and Zhu (2016) |
Harvey and Liu (2020) extended this work with a double-bootstrap method that calibrates both false discovery (Type I) and missed discovery (Type II) errors simultaneously, highlighting the inherent trade-off: as the threshold rises, fewer false factors pass, but more genuine ones are missed. Harvey and Liu (2021) proposed a bootstrap framework that tests individual stocks directly rather than relying on portfolio sorts, providing a natural control for the multiple testing problem.
4.2 Publication Bias and Data Snooping
The multiple testing problem is compounded by publication bias. Journals preferentially publish significant results, leaving null findings in the "file drawer." Harvey, Liu, and Zhu (2016) estimated that 71% of all factors tried are missing from the published record, implying that the effective number of tests far exceeds the 316 published ones. Lo and MacKinlay (1990) were among the first to formalize data-snooping biases in asset pricing tests, showing that grouping stocks by a characteristic discovered in the same dataset inflates apparent predictability. White (2000) generalized this insight into a statistical framework for testing whether the best model found through specification search has genuine predictive superiority, or whether the result reflects data snooping.
Together, multiple testing and publication bias create an environment in which even honest researchers, making individually reasonable choices, can collectively generate a body of literature with a high false discovery rate.
5. Taming the Zoo
5.1 Shrinkage and Regularization
Rather than picking individual "winning" factors, several approaches compress the zoo into a low-dimensional structure. Kozak, Nagel, and Santosh (2020) construct a stochastic discount factor using Bayesian shrinkage, showing that a small number of principal components, not individual characteristics, explains the cross-section. Their key insight is that the quest for a sparse characteristics-based factor model is misguided: the pricing information is distributed across many correlated characteristics, and principal components capture this more efficiently than any small set of named factors.
Feng, Giglio, and Xiu (2020) take a different approach with double-selection LASSO, which evaluates new factors while controlling for omitted variable bias from the existing high-dimensional factor set. Across 135 factors, most new proposals are found to be redundant, though a few (notably profitability) have genuine marginal explanatory power.
5.2 Machine Learning Approaches
Machine learning methods offer a more flexible framework for extracting the latent factor structure. Kelly, Pruitt, and Su (2019) introduced Instrumented Principal Component Analysis (IPCA), which allows for latent factors with time-varying loadings by using observable characteristics as instruments. Five IPCA factors explain the cross-section significantly more accurately than existing factor models, and among a large collection of characteristics, only about ten are statistically significant at the 1% level.
Gu, Kelly, and Xiu (2020) conducted a comprehensive comparison of machine learning methods for return prediction, including LASSO, elastic net, random forests, gradient boosting, and neural networks. Trees and neural networks performed best, with gains traced to their ability to capture nonlinear interactions among predictors. All methods agreed on the dominant signals: variations of momentum, liquidity, and volatility.
Chen, Pelger, and Zhu (2024) extended this to deep neural networks with a no-arbitrage criterion function and an adversarial approach to constructing the most informative test assets. Their model explains 8% of total individual stock return variation (roughly twice the benchmark) and 23% of expected returns at the individual stock level.
5.3 How Many Factors Are Enough?
A practical question for both researchers and portfolio managers is the effective dimensionality of the factor space. Green, Hand, and Zhang (2017) tested 94 characteristics simultaneously and found that only a handful provide independent information about average monthly returns once the others are controlled for. Swade, Hanauer, Lohre, and Blitz (2024) addressed the spanning question directly, starting from 153 published U.S. equity factors and finding that approximately 15 factors are sufficient to span the entire factor zoo. Common three-to-five factor models are insufficient, but the full zoo of 153 factors is highly redundant. Notably, the specific factor representatives that span the zoo vary over time, underscoring the importance of continuous factor evaluation rather than a fixed factor set.
| Approach | Method | Effective Factors | Key Reference |
|---|---|---|---|
| PCA + shrinkage | Bayesian SDF | Few PCs | Kozak, Nagel, and Santosh (2020) |
| Double-selection LASSO | Penalized regression | Variable | Feng, Giglio, and Xiu (2020) |
| IPCA | Instrumented PCA | 5 latent | Kelly, Pruitt, and Su (2019) |
| Neural networks | Deep learning | Learned | Chen, Pelger, and Zhu (2024) |
| Spanning tests | Factor rotation | ~15 | Swade, Hanauer, Lohre, and Blitz (2024) |
6. Implications for Research and Practice
The factor zoo debate carries direct implications for how researchers evaluate new signals and how practitioners build portfolios.
For researchers, the literature suggests several guidelines. New factor discoveries should clear a t-statistic threshold of at least 3.0 (Harvey, Liu, and Zhu, 2016). Out-of-sample validation, especially in international markets, is essential for establishing robustness (Jensen, Kelly, and Pedersen, 2023). Economic motivation, not just statistical significance, should underpin any claimed factor: those grounded in theory (such as the investment CAPM behind the q-factor model) tend to survive replication better than purely empirical discoveries. Open replication databases such as those provided by Chen and Zimmermann (2022) and Jensen, Kelly, and Pedersen (2023) make independent verification more practical than ever.
For practitioners, the key insight is that sparse factor models with three to five factors are insufficient to capture the cross-section of expected returns, but the full zoo is heavily redundant. Machine learning methods that extract the underlying latent structure, such as IPCA (Kelly, Pruitt, and Su, 2019) or deep factor models (Chen, Pelger, and Zhu, 2024), offer a principled alternative to hand-picking factors, though they are not immune to overfitting and require careful out-of-sample validation. The factor zoo debate also shapes the signal-to-weights pipeline: the choice of which signals to trust is the prerequisite question before any portfolio construction decision.
7. Conclusion
The factor zoo grew from a handful of anomalies in the 1980s to over 400 proposed factors by the 2020s. The replication crisis debate has clarified that the literature is neither fundamentally broken nor entirely reliable: the truth depends heavily on methodological choices about microcap inclusion, t-statistic thresholds, and the treatment of multiple testing. Pessimistic estimates suggest that only 18–35% of anomalies survive strict scrutiny (|t| > 2.78 and |t| > 1.96, respectively) (Hou, Xue, and Zhang, 2020), while optimistic estimates place the replication rate at 98% when factors are tested faithfully against their original methodology (Chen and Zimmermann, 2022).
The field has responded with increasingly sophisticated tools. Multiple testing adjustments raise the bar for new discoveries, shrinkage methods compress the cross-section into principal components, and machine learning extracts latent factor structures directly from data. The emerging consensus is that the effective dimensionality of the factor space lies somewhere around 10 to 15 factors, far fewer than the zoo's 400-plus inhabitants but more than the three to five of traditional models.
For the applied researcher and the quantitative portfolio manager alike, the factor zoo and the replication crisis are not merely academic curiosities. They determine which signals deserve a place in the portfolio construction pipeline and, ultimately, which risks are worth bearing.