WP1: Data Collection & Processing
Work Package 1: Data Collection & Processing
Lead: Prof. Jeffrey Chu (Renmin University of China) Duration: Months 1-4 Status: Completed (Adapted)
Research Context
The rapid expansion of consumer lending markets, particularly in China, has created unprecedented demand for sophisticated credit risk assessment methodologies. Traditional credit scoring approaches, developed primarily for bank-based lending in Western economies, face significant limitations when applied to the diverse and dynamic landscape of peer-to-peer (P2P) lending platforms. This work package addresses the fundamental challenge of acquiring and processing data suitable for graph-based credit risk modeling.
Theoretical Foundation
Credit risk assessment has evolved substantially since Altman’s (1968) seminal work on discriminant analysis for bankruptcy prediction. The field has progressed through logistic regression models (Ohlson, 1980), neural networks (Tam & Kiang, 1992), and ensemble methods (Lessmann et al., 2015). However, these approaches treat borrowers as independent observations, ignoring potential network effects and peer influences that characterize modern lending ecosystems.
The theoretical motivation for graph-based approaches stems from financial contagion literature (Allen & Gale, 2000) and social network analysis in economics (Jackson, 2008). In P2P lending contexts, borrowers exhibit homophily patterns where individuals with similar characteristics demonstrate correlated default behavior, a phenomenon documented by Freedman and Jin (2017) and Lin et al. (2013).
Objectives
- Establish comprehensive P2P lending datasets for credit risk modeling
- Process and clean data for graph-based analysis
- Create standardized data formats for cross-institutional research
- Develop reproducible data pipelines for academic replication
Literature on P2P Lending Data
Market Overview
The global P2P lending market has experienced substantial growth, with Chinese platforms dominating until regulatory intervention in 2018-2020. Academic research has utilized various data sources:
| Study | Platform | Sample Size | Key Findings |
|---|---|---|---|
| Serrano-Cinca et al. (2015) | LendingClub | 24,449 | Grade and purpose predict default |
| Emekter et al. (2015) | LendingClub | 61,119 | Credit grade most significant predictor |
| Malekipirbazari & Aksakalli (2015) | LendingClub | 38,735 | Random forest outperforms logistic regression |
| Jiang et al. (2018) | Renrendai | 56,451 | Social connections reduce default risk |
Data Quality Challenges
P2P lending data presents unique challenges for academic research:
- Selection Bias: Platforms vary in screening criteria, affecting sample representativeness
- Survivorship Bias: Failed platforms leave incomplete records
- Feature Heterogeneity: Inconsistent variable definitions across platforms
- Temporal Non-Stationarity: Market conditions and regulations evolve rapidly
Original Plan: Ant Group Data
The original proposal planned to utilize Ant Group’s consumer lending dataset, which would have provided access to:
- Over 1 billion users in the Alipay ecosystem
- Comprehensive behavioral data from payment transactions
- Social network connections through Alipay contacts
- Alternative credit features from e-commerce activity
This dataset would have represented the state-of-the-art in Chinese consumer finance data, enabling analysis of:
- Network Effects: How peer behavior influences individual default risk
- Alternative Data: Predictive power of non-traditional credit features
- Market Dynamics: Real-time credit risk in rapidly evolving markets
Regulatory Constraints
Applications to access Ant Group data were rejected due to:
- Personal Information Protection Law (PIPL) of China (2021)
- Data Security Law requirements
- Platform-specific data governance policies
This outcome reflects broader trends in data protection globally, necessitating research designs compatible with privacy regulations.
Adaptive Research Design
Following the data access constraints, the research pivoted to publicly available P2P lending datasets that enable rigorous academic investigation while ensuring reproducibility and compliance.
Primary Datasets
Bondora Dataset
European P2P Lending Platform | Estonia
Bondora provides one of the most comprehensive public P2P lending datasets, containing 134,529 loans originated between 2009-2020 with 112 features. The platform operates under EU regulatory frameworks, ensuring data quality and standardization. Key variables include borrower demographics, employment information, existing liabilities, and detailed loan performance metrics.
Academic Usage: Cited in over 200 peer-reviewed publications
LendingClub Dataset
US P2P Lending Platform | 2007-2018
The LendingClub dataset represents the largest publicly available P2P lending data source, containing over 2.26 million loans with 150+ features. As the first P2P platform to register with the SEC, LendingClub data meets stringent disclosure requirements. The dataset includes FICO score ranges, debt-to-income ratios, employment history, and loan-level performance data.
Academic Usage: Foundation for majority of P2P lending research in finance journals
German Credit Dataset
UCI Machine Learning Repository | Classic Benchmark
The German Credit dataset (Hofmann, 1994) remains a standard benchmark in credit scoring research despite its age. Containing 1,000 instances with 20 attributes, it enables direct comparison with decades of prior work. The dataset includes categorical features (employment status, housing) and numerical features (credit amount, duration) with binary classification targets.
Academic Usage: Essential benchmark for methodological comparison
Home Credit Default Risk Dataset
Kaggle Competition | Global Consumer Finance
Released through a Kaggle competition, this dataset contains 307,511 loan applications with rich feature sets including bureau data, previous applications, and payment behavior. The competition attracted 7,198 teams, generating extensive documentation of preprocessing approaches and baseline models.
Academic Usage: Standard for comparing advanced ML techniques
Data Processing Methodology
Preprocessing Pipeline
The data processing pipeline follows established best practices in credit scoring literature (Lessmann et al., 2015; Thomas et al., 2017):
Stage 1: Data Cleaning
- Missing value imputation using multiple imputation by chained equations (MICE)
- Outlier detection via isolation forests and domain-based rules
- Duplicate removal through record linkage algorithms
Stage 2: Feature Engineering
- Temporal aggregation of payment behavior
- Ratio-based features (debt-to-income, credit utilization)
- Categorical encoding using target encoding with regularization
Stage 3: Normalization
- Continuous features standardized to zero mean, unit variance
- Categorical features one-hot encoded with rare category grouping
Feature Categories
| Category | Example Features | Theoretical Basis |
|---|---|---|
| Demographic | Age, gender, marital status | Socioeconomic risk factors |
| Financial | Income, debt, assets | Ability to repay |
| Employment | Tenure, industry, stability | Income stability |
| Credit History | Delinquencies, inquiries | Past payment behavior |
| Loan Characteristics | Amount, term, purpose | Contract risk profile |
| Behavioral | Payment patterns, usage | Revealed preferences |
Graph Construction Framework
Theoretical Motivation
The construction of borrower similarity graphs draws from multiple theoretical perspectives:
- Homophily Theory (McPherson et al., 2001): Individuals with similar characteristics tend to form connections and exhibit similar behaviors
- Social Learning (Bandura, 1977): Borrowers may learn from and emulate peer behavior
- Information Spillovers (Herding, 1992): Shared information environments create correlated outcomes
Similarity Metrics
For borrowers $i$ and $j$ with feature vectors $\mathbf{x}_i$ and $\mathbf{x}_j$:
Cosine Similarity: \(\text{sim}_{\text{cos}}(i,j) = \frac{\mathbf{x}_i \cdot \mathbf{x}_j}{||\mathbf{x}_i|| \cdot ||\mathbf{x}_j||}\)
Euclidean Distance (normalized): \(\text{sim}_{\text{euc}}(i,j) = \frac{1}{1 + ||\mathbf{x}_i - \mathbf{x}_j||_2}\)
Combined Similarity: \(\text{sim}(i,j) = \alpha \cdot \text{sim}_{\text{cos}}(i,j) + (1-\alpha) \cdot \text{sim}_{\text{euc}}(i,j)\)
Edge Construction
Edges are created between borrower pairs exceeding similarity threshold $\tau$:
\[A_{ij} = \mathbb{1}[\text{sim}(i,j) > \tau]\]The threshold $\tau$ is optimized through cross-validation to maximize downstream prediction performance while maintaining computational tractability.
Quality Assurance
Data Validation Checks
| Check | Method | Threshold |
|---|---|---|
| Missing values | Percentage by feature | <5% critical features |
| Class imbalance | Default rate | 10-30% typical range |
| Feature correlation | Variance inflation factor | VIF < 5 |
| Temporal consistency | Trend analysis | No structural breaks |
Reproducibility Measures
- All preprocessing code version-controlled on GitHub
- Random seeds fixed for reproducible results
- Data dictionaries documenting all transformations
- Unit tests for data pipeline validation
Deliverables
| Deliverable | Status | Description |
|---|---|---|
| Dataset acquisition | Completed | 4 public datasets obtained |
| Preprocessing pipeline | Completed | Python scripts with documentation |
| Feature engineering | Completed | 50+ derived features |
| Graph construction | Completed | Multiple similarity metrics |
| Data dictionary | Completed | Variable definitions and sources |
| Quality report | Completed | Validation checks and statistics |
Key Findings
The data collection and processing phase yielded several important insights:
- Dataset Sufficiency: Public P2P datasets provide adequate scale and feature richness for GNN research, addressing concerns about proprietary data access
- Cross-Platform Validity: Homophily patterns in default behavior appear consistently across geographic and regulatory contexts
- Feature Importance: Behavioral features (payment history) exhibit strongest predictive power, followed by credit history and loan characteristics
- Graph Density: Similarity threshold of 0.7 produces graphs with ~0.1% edge density, balancing information richness with computational efficiency
References
- Allen, F., & Gale, D. (2000). Financial contagion. Journal of Political Economy, 108(1), 1-33.
- Altman, E. I. (1968). Financial ratios, discriminant analysis and the prediction of corporate bankruptcy. Journal of Finance, 23(4), 589-609.
- Freedman, S., & Jin, G. Z. (2017). The information value of online social networks: Lessons from peer-to-peer lending. International Journal of Industrial Organization, 51, 185-222.
- Lessmann, S., et al. (2015). Benchmarking state-of-the-art classification algorithms for credit scoring. European Journal of Operational Research, 247(1), 124-136.
- Lin, M., Prabhala, N. R., & Viswanathan, S. (2013). Judging borrowers by the company they keep. Management Science, 59(1), 17-35.
Related Publications
- Liu, Y. & Osterrieder, J. “Why are Global P2P Lending Platforms Exiting Peer-to-Peer Models?” (Financial Innovation - Under Review)
- Liu, Y., Osterrieder, J., et al. “Credit Risk Prediction via Graph Neural Networks” (JMIS Submission)
Next Steps
Data prepared for WP2: Graph-Based Methodology development.
(c) Joerg Osterrieder 2025