WP1: Data Collection & Processing

Work Package 1: Data Collection & Processing

Lead: Prof. Jeffrey Chu (Renmin University of China) Duration: Months 1-4 Status: Completed (Adapted)

Research Context

The rapid expansion of consumer lending markets, particularly in China, has created unprecedented demand for sophisticated credit risk assessment methodologies. Traditional credit scoring approaches, developed primarily for bank-based lending in Western economies, face significant limitations when applied to the diverse and dynamic landscape of peer-to-peer (P2P) lending platforms. This work package addresses the fundamental challenge of acquiring and processing data suitable for graph-based credit risk modeling.

Theoretical Foundation

Credit risk assessment has evolved substantially since Altman’s (1968) seminal work on discriminant analysis for bankruptcy prediction. The field has progressed through logistic regression models (Ohlson, 1980), neural networks (Tam & Kiang, 1992), and ensemble methods (Lessmann et al., 2015). However, these approaches treat borrowers as independent observations, ignoring potential network effects and peer influences that characterize modern lending ecosystems.

The theoretical motivation for graph-based approaches stems from financial contagion literature (Allen & Gale, 2000) and social network analysis in economics (Jackson, 2008). In P2P lending contexts, borrowers exhibit homophily patterns where individuals with similar characteristics demonstrate correlated default behavior, a phenomenon documented by Freedman and Jin (2017) and Lin et al. (2013).

Objectives

Establish comprehensive P2P lending datasets for credit risk modeling
Process and clean data for graph-based analysis
Create standardized data formats for cross-institutional research
Develop reproducible data pipelines for academic replication

Literature on P2P Lending Data

Market Overview

The global P2P lending market has experienced substantial growth, with Chinese platforms dominating until regulatory intervention in 2018-2020. Academic research has utilized various data sources:

Study	Platform	Sample Size	Key Findings
Serrano-Cinca et al. (2015)	LendingClub	24,449	Grade and purpose predict default
Emekter et al. (2015)	LendingClub	61,119	Credit grade most significant predictor
Malekipirbazari & Aksakalli (2015)	LendingClub	38,735	Random forest outperforms logistic regression
Jiang et al. (2018)	Renrendai	56,451	Social connections reduce default risk

Data Quality Challenges

P2P lending data presents unique challenges for academic research:

Selection Bias: Platforms vary in screening criteria, affecting sample representativeness
Survivorship Bias: Failed platforms leave incomplete records
Feature Heterogeneity: Inconsistent variable definitions across platforms
Temporal Non-Stationarity: Market conditions and regulations evolve rapidly

Original Plan: Ant Group Data

The original proposal planned to utilize Ant Group’s consumer lending dataset, which would have provided access to:

Over 1 billion users in the Alipay ecosystem
Comprehensive behavioral data from payment transactions
Social network connections through Alipay contacts
Alternative credit features from e-commerce activity

This dataset would have represented the state-of-the-art in Chinese consumer finance data, enabling analysis of:

Network Effects: How peer behavior influences individual default risk
Alternative Data: Predictive power of non-traditional credit features
Market Dynamics: Real-time credit risk in rapidly evolving markets

Regulatory Constraints

Applications to access Ant Group data were rejected due to:

Personal Information Protection Law (PIPL) of China (2021)
Data Security Law requirements
Platform-specific data governance policies

This outcome reflects broader trends in data protection globally, necessitating research designs compatible with privacy regulations.

Adaptive Research Design

Following the data access constraints, the research pivoted to publicly available P2P lending datasets that enable rigorous academic investigation while ensuring reproducibility and compliance.

Primary Datasets

Bondora Dataset

European P2P Lending Platform | Estonia

Bondora provides one of the most comprehensive public P2P lending datasets, containing 134,529 loans originated between 2009-2020 with 112 features. The platform operates under EU regulatory frameworks, ensuring data quality and standardization. Key variables include borrower demographics, employment information, existing liabilities, and detailed loan performance metrics.

Academic Usage: Cited in over 200 peer-reviewed publications

LendingClub Dataset

US P2P Lending Platform | 2007-2018

The LendingClub dataset represents the largest publicly available P2P lending data source, containing over 2.26 million loans with 150+ features. As the first P2P platform to register with the SEC, LendingClub data meets stringent disclosure requirements. The dataset includes FICO score ranges, debt-to-income ratios, employment history, and loan-level performance data.

Academic Usage: Foundation for majority of P2P lending research in finance journals

German Credit Dataset

UCI Machine Learning Repository | Classic Benchmark

The German Credit dataset (Hofmann, 1994) remains a standard benchmark in credit scoring research despite its age. Containing 1,000 instances with 20 attributes, it enables direct comparison with decades of prior work. The dataset includes categorical features (employment status, housing) and numerical features (credit amount, duration) with binary classification targets.

Academic Usage: Essential benchmark for methodological comparison

Home Credit Default Risk Dataset

Kaggle Competition | Global Consumer Finance

Released through a Kaggle competition, this dataset contains 307,511 loan applications with rich feature sets including bureau data, previous applications, and payment behavior. The competition attracted 7,198 teams, generating extensive documentation of preprocessing approaches and baseline models.

Academic Usage: Standard for comparing advanced ML techniques

Data Processing Methodology

Preprocessing Pipeline

The data processing pipeline follows established best practices in credit scoring literature (Lessmann et al., 2015; Thomas et al., 2017):

Stage 1: Data Cleaning

Missing value imputation using multiple imputation by chained equations (MICE)
Outlier detection via isolation forests and domain-based rules
Duplicate removal through record linkage algorithms

Stage 2: Feature Engineering

Temporal aggregation of payment behavior
Ratio-based features (debt-to-income, credit utilization)
Categorical encoding using target encoding with regularization

Stage 3: Normalization

Continuous features standardized to zero mean, unit variance
Categorical features one-hot encoded with rare category grouping

Feature Categories

Category	Example Features	Theoretical Basis
Demographic	Age, gender, marital status	Socioeconomic risk factors
Financial	Income, debt, assets	Ability to repay
Employment	Tenure, industry, stability	Income stability
Credit History	Delinquencies, inquiries	Past payment behavior
Loan Characteristics	Amount, term, purpose	Contract risk profile
Behavioral	Payment patterns, usage	Revealed preferences

Graph Construction Framework

Theoretical Motivation

The construction of borrower similarity graphs draws from multiple theoretical perspectives:

Homophily Theory (McPherson et al., 2001): Individuals with similar characteristics tend to form connections and exhibit similar behaviors
Social Learning (Bandura, 1977): Borrowers may learn from and emulate peer behavior
Information Spillovers (Herding, 1992): Shared information environments create correlated outcomes

Similarity Metrics

For borrowers $i$ and $j$ with feature vectors $\mathbf{x}_i$ and $\mathbf{x}_j$:

Cosine Similarity: $\text{sim}_{\text{cos}}(i,j) = \frac{\mathbf{x}_i \cdot \mathbf{x}_j}{||\mathbf{x}_i|| \cdot ||\mathbf{x}_j||}$

Euclidean Distance (normalized): $\text{sim}_{\text{euc}}(i,j) = \frac{1}{1 + ||\mathbf{x}_i - \mathbf{x}_j||_2}$

Combined Similarity: $\text{sim}(i,j) = \alpha \cdot \text{sim}_{\text{cos}}(i,j) + (1-\alpha) \cdot \text{sim}_{\text{euc}}(i,j)$

Edge Construction

Edges are created between borrower pairs exceeding similarity threshold $\tau$:

\[A_{ij} = \mathbb{1}[\text{sim}(i,j) > \tau]\]

The threshold $\tau$ is optimized through cross-validation to maximize downstream prediction performance while maintaining computational tractability.

Quality Assurance

Data Validation Checks

Check	Method	Threshold
Missing values	Percentage by feature	<5% critical features
Class imbalance	Default rate	10-30% typical range
Feature correlation	Variance inflation factor	VIF < 5
Temporal consistency	Trend analysis	No structural breaks

Reproducibility Measures

All preprocessing code version-controlled on GitHub
Random seeds fixed for reproducible results
Data dictionaries documenting all transformations
Unit tests for data pipeline validation

Deliverables

Deliverable	Status	Description
Dataset acquisition	Completed	4 public datasets obtained
Preprocessing pipeline	Completed	Python scripts with documentation
Feature engineering	Completed	50+ derived features
Graph construction	Completed	Multiple similarity metrics
Data dictionary	Completed	Variable definitions and sources
Quality report	Completed	Validation checks and statistics

Key Findings

The data collection and processing phase yielded several important insights:

Dataset Sufficiency: Public P2P datasets provide adequate scale and feature richness for GNN research, addressing concerns about proprietary data access
Cross-Platform Validity: Homophily patterns in default behavior appear consistently across geographic and regulatory contexts
Feature Importance: Behavioral features (payment history) exhibit strongest predictive power, followed by credit history and loan characteristics
Graph Density: Similarity threshold of 0.7 produces graphs with ~0.1% edge density, balancing information richness with computational efficiency

References

Allen, F., & Gale, D. (2000). Financial contagion. Journal of Political Economy, 108(1), 1-33.
Altman, E. I. (1968). Financial ratios, discriminant analysis and the prediction of corporate bankruptcy. Journal of Finance, 23(4), 589-609.
Freedman, S., & Jin, G. Z. (2017). The information value of online social networks: Lessons from peer-to-peer lending. International Journal of Industrial Organization, 51, 185-222.
Lessmann, S., et al. (2015). Benchmarking state-of-the-art classification algorithms for credit scoring. European Journal of Operational Research, 247(1), 124-136.
Lin, M., Prabhala, N. R., & Viswanathan, S. (2013). Judging borrowers by the company they keep. Management Science, 59(1), 17-35.

Liu, Y. & Osterrieder, J. “Why are Global P2P Lending Platforms Exiting Peer-to-Peer Models?” (Financial Innovation - Under Review)
Liu, Y., Osterrieder, J., et al. “Credit Risk Prediction via Graph Neural Networks” (JMIS Submission)

Next Steps

Data prepared for WP2: Graph-Based Methodology development.