Work Package 1: Data Collection & Processing

Lead: Prof. Jeffrey Chu (Renmin University of China) Duration: Months 1-4 Status: Completed (Adapted)


Research Context

The rapid expansion of consumer lending markets, particularly in China, has created unprecedented demand for sophisticated credit risk assessment methodologies. Traditional credit scoring approaches, developed primarily for bank-based lending in Western economies, face significant limitations when applied to the diverse and dynamic landscape of peer-to-peer (P2P) lending platforms. This work package addresses the fundamental challenge of acquiring and processing data suitable for graph-based credit risk modeling.

Theoretical Foundation

Credit risk assessment has evolved substantially since Altman’s (1968) seminal work on discriminant analysis for bankruptcy prediction. The field has progressed through logistic regression models (Ohlson, 1980), neural networks (Tam & Kiang, 1992), and ensemble methods (Lessmann et al., 2015). However, these approaches treat borrowers as independent observations, ignoring potential network effects and peer influences that characterize modern lending ecosystems.

The theoretical motivation for graph-based approaches stems from financial contagion literature (Allen & Gale, 2000) and social network analysis in economics (Jackson, 2008). In P2P lending contexts, borrowers exhibit homophily patterns where individuals with similar characteristics demonstrate correlated default behavior, a phenomenon documented by Freedman and Jin (2017) and Lin et al. (2013).


Objectives

  1. Establish comprehensive P2P lending datasets for credit risk modeling
  2. Process and clean data for graph-based analysis
  3. Create standardized data formats for cross-institutional research
  4. Develop reproducible data pipelines for academic replication

Literature on P2P Lending Data

Market Overview

The global P2P lending market has experienced substantial growth, with Chinese platforms dominating until regulatory intervention in 2018-2020. Academic research has utilized various data sources:

Study Platform Sample Size Key Findings
Serrano-Cinca et al. (2015) LendingClub 24,449 Grade and purpose predict default
Emekter et al. (2015) LendingClub 61,119 Credit grade most significant predictor
Malekipirbazari & Aksakalli (2015) LendingClub 38,735 Random forest outperforms logistic regression
Jiang et al. (2018) Renrendai 56,451 Social connections reduce default risk

Data Quality Challenges

P2P lending data presents unique challenges for academic research:

  • Selection Bias: Platforms vary in screening criteria, affecting sample representativeness
  • Survivorship Bias: Failed platforms leave incomplete records
  • Feature Heterogeneity: Inconsistent variable definitions across platforms
  • Temporal Non-Stationarity: Market conditions and regulations evolve rapidly

Original Plan: Ant Group Data

The original proposal planned to utilize Ant Group’s consumer lending dataset, which would have provided access to:

  • Over 1 billion users in the Alipay ecosystem
  • Comprehensive behavioral data from payment transactions
  • Social network connections through Alipay contacts
  • Alternative credit features from e-commerce activity

This dataset would have represented the state-of-the-art in Chinese consumer finance data, enabling analysis of:

  1. Network Effects: How peer behavior influences individual default risk
  2. Alternative Data: Predictive power of non-traditional credit features
  3. Market Dynamics: Real-time credit risk in rapidly evolving markets

Regulatory Constraints

Applications to access Ant Group data were rejected due to:

  • Personal Information Protection Law (PIPL) of China (2021)
  • Data Security Law requirements
  • Platform-specific data governance policies

This outcome reflects broader trends in data protection globally, necessitating research designs compatible with privacy regulations.


Adaptive Research Design

Following the data access constraints, the research pivoted to publicly available P2P lending datasets that enable rigorous academic investigation while ensuring reproducibility and compliance.

Primary Datasets

Bondora Dataset

European P2P Lending Platform | Estonia

Bondora provides one of the most comprehensive public P2P lending datasets, containing 134,529 loans originated between 2009-2020 with 112 features. The platform operates under EU regulatory frameworks, ensuring data quality and standardization. Key variables include borrower demographics, employment information, existing liabilities, and detailed loan performance metrics.

Academic Usage: Cited in over 200 peer-reviewed publications

LendingClub Dataset

US P2P Lending Platform | 2007-2018

The LendingClub dataset represents the largest publicly available P2P lending data source, containing over 2.26 million loans with 150+ features. As the first P2P platform to register with the SEC, LendingClub data meets stringent disclosure requirements. The dataset includes FICO score ranges, debt-to-income ratios, employment history, and loan-level performance data.

Academic Usage: Foundation for majority of P2P lending research in finance journals

German Credit Dataset

UCI Machine Learning Repository | Classic Benchmark

The German Credit dataset (Hofmann, 1994) remains a standard benchmark in credit scoring research despite its age. Containing 1,000 instances with 20 attributes, it enables direct comparison with decades of prior work. The dataset includes categorical features (employment status, housing) and numerical features (credit amount, duration) with binary classification targets.

Academic Usage: Essential benchmark for methodological comparison

Home Credit Default Risk Dataset

Kaggle Competition | Global Consumer Finance

Released through a Kaggle competition, this dataset contains 307,511 loan applications with rich feature sets including bureau data, previous applications, and payment behavior. The competition attracted 7,198 teams, generating extensive documentation of preprocessing approaches and baseline models.

Academic Usage: Standard for comparing advanced ML techniques


Data Processing Methodology

Preprocessing Pipeline

The data processing pipeline follows established best practices in credit scoring literature (Lessmann et al., 2015; Thomas et al., 2017):

Stage 1: Data Cleaning

  • Missing value imputation using multiple imputation by chained equations (MICE)
  • Outlier detection via isolation forests and domain-based rules
  • Duplicate removal through record linkage algorithms

Stage 2: Feature Engineering

  • Temporal aggregation of payment behavior
  • Ratio-based features (debt-to-income, credit utilization)
  • Categorical encoding using target encoding with regularization

Stage 3: Normalization

  • Continuous features standardized to zero mean, unit variance
  • Categorical features one-hot encoded with rare category grouping

Feature Categories

Category Example Features Theoretical Basis
Demographic Age, gender, marital status Socioeconomic risk factors
Financial Income, debt, assets Ability to repay
Employment Tenure, industry, stability Income stability
Credit History Delinquencies, inquiries Past payment behavior
Loan Characteristics Amount, term, purpose Contract risk profile
Behavioral Payment patterns, usage Revealed preferences

Graph Construction Framework

Theoretical Motivation

The construction of borrower similarity graphs draws from multiple theoretical perspectives:

  1. Homophily Theory (McPherson et al., 2001): Individuals with similar characteristics tend to form connections and exhibit similar behaviors
  2. Social Learning (Bandura, 1977): Borrowers may learn from and emulate peer behavior
  3. Information Spillovers (Herding, 1992): Shared information environments create correlated outcomes

Similarity Metrics

For borrowers $i$ and $j$ with feature vectors $\mathbf{x}_i$ and $\mathbf{x}_j$:

Cosine Similarity: \(\text{sim}_{\text{cos}}(i,j) = \frac{\mathbf{x}_i \cdot \mathbf{x}_j}{||\mathbf{x}_i|| \cdot ||\mathbf{x}_j||}\)

Euclidean Distance (normalized): \(\text{sim}_{\text{euc}}(i,j) = \frac{1}{1 + ||\mathbf{x}_i - \mathbf{x}_j||_2}\)

Combined Similarity: \(\text{sim}(i,j) = \alpha \cdot \text{sim}_{\text{cos}}(i,j) + (1-\alpha) \cdot \text{sim}_{\text{euc}}(i,j)\)

Edge Construction

Edges are created between borrower pairs exceeding similarity threshold $\tau$:

\[A_{ij} = \mathbb{1}[\text{sim}(i,j) > \tau]\]

The threshold $\tau$ is optimized through cross-validation to maximize downstream prediction performance while maintaining computational tractability.


Quality Assurance

Data Validation Checks

Check Method Threshold
Missing values Percentage by feature <5% critical features
Class imbalance Default rate 10-30% typical range
Feature correlation Variance inflation factor VIF < 5
Temporal consistency Trend analysis No structural breaks

Reproducibility Measures

  • All preprocessing code version-controlled on GitHub
  • Random seeds fixed for reproducible results
  • Data dictionaries documenting all transformations
  • Unit tests for data pipeline validation

Deliverables

Deliverable Status Description
Dataset acquisition Completed 4 public datasets obtained
Preprocessing pipeline Completed Python scripts with documentation
Feature engineering Completed 50+ derived features
Graph construction Completed Multiple similarity metrics
Data dictionary Completed Variable definitions and sources
Quality report Completed Validation checks and statistics

Key Findings

The data collection and processing phase yielded several important insights:

  1. Dataset Sufficiency: Public P2P datasets provide adequate scale and feature richness for GNN research, addressing concerns about proprietary data access
  2. Cross-Platform Validity: Homophily patterns in default behavior appear consistently across geographic and regulatory contexts
  3. Feature Importance: Behavioral features (payment history) exhibit strongest predictive power, followed by credit history and loan characteristics
  4. Graph Density: Similarity threshold of 0.7 produces graphs with ~0.1% edge density, balancing information richness with computational efficiency

References

  • Allen, F., & Gale, D. (2000). Financial contagion. Journal of Political Economy, 108(1), 1-33.
  • Altman, E. I. (1968). Financial ratios, discriminant analysis and the prediction of corporate bankruptcy. Journal of Finance, 23(4), 589-609.
  • Freedman, S., & Jin, G. Z. (2017). The information value of online social networks: Lessons from peer-to-peer lending. International Journal of Industrial Organization, 51, 185-222.
  • Lessmann, S., et al. (2015). Benchmarking state-of-the-art classification algorithms for credit scoring. European Journal of Operational Research, 247(1), 124-136.
  • Lin, M., Prabhala, N. R., & Viswanathan, S. (2013). Judging borrowers by the company they keep. Management Science, 59(1), 17-35.

  • Liu, Y. & Osterrieder, J. “Why are Global P2P Lending Platforms Exiting Peer-to-Peer Models?” (Financial Innovation - Under Review)
  • Liu, Y., Osterrieder, J., et al. “Credit Risk Prediction via Graph Neural Networks” (JMIS Submission)

Next Steps

Data prepared for WP2: Graph-Based Methodology development.


(c) Joerg Osterrieder 2025