Assignment Overview

Groups of 3 students work together throughout the semester to apply data science techniques to a finance-related dataset of their choice. Every project must include Exploratory Data Analysis (Module 3) plus 4 technical topics — one from each of four different modules (choose 4 out of Modules 4–8). The project culminates in a live in-class presentation (L45/L46/L48) and a GitHub repository containing all code and slides.

  • Find and use your own finance dataset (stocks, crypto, credit, macro, ESG, etc.)
  • Perform EDA using Module 3 techniques (mandatory for every project)
  • Apply 4 additional techniques — one from each of 4 different modules (M4–M8)
  • Submit via GitHub: Jupyter notebooks + PowerPoint slides
  • Present live in class (10–15 minutes per group)
  • Peer review another group's submission (counts toward your grade)
  • All deliverables in English

Topic Selection

Every project includes EDA (mandatory) plus 4 topics chosen from 4 different modules. You skip one of Modules 4–8 — choose the combination that best fits your research question.

MANDATORY Exploratory Data Analysis Module 3: L13–L20

Every project must begin with a thorough EDA. Use any combination of Module 3 techniques to understand your dataset before applying ML methods.

TechniqueLessonWhat to Demonstrate
Descriptive StatisticsL13Summary stats, distribution shapes, outlier detection, skewness/kurtosis
DistributionsL14Fit distributions to data, QQ-plots, compare empirical vs. theoretical
Hypothesis TestingL15t-tests, normality tests, significance of observed patterns
Correlation AnalysisL16Correlation matrix, significance testing, spurious correlation awareness
Matplotlib & SeabornL17–L18Publication-quality charts, heatmaps, pairplots, distribution plots
Data StorytellingL20Narrative-driven visualizations that communicate findings clearly

Choose 1 topic from each of 4 different modules:

Module 4: Regression
TopicLessonWhat to Demonstrate
Linear RegressionL21OLS regression, coefficient interpretation, R², residual analysis
RegularizationL22Ridge/Lasso comparison, cross-validated lambda selection
Regression MetricsL23MSE, RMSE, MAE, R², cross-validation comparison
Factor ModelsL24Multi-factor regression, Fama–French style analysis
Module 5: Classification
TopicLessonWhat to Demonstrate
Logistic RegressionL25Binary classification, odds ratios, probability calibration
Decision TreesL26Tree building, Random Forest, feature importance
Classification MetricsL27Confusion matrix, ROC-AUC, precision/recall, threshold tuning
Class ImbalanceL28SMOTE, class weights, stratified CV, PR curves
Module 6: Unsupervised
TopicLessonWhat to Demonstrate
KMeans ClusteringL29Elbow method, silhouette score, cluster interpretation
Hierarchical ClusteringL30Dendrograms, linkage methods, correlation-based clustering
PCAL31Dimensionality reduction, scree plots, loadings interpretation
ML PipelineL32sklearn Pipeline, cross-validation, hyperparameter tuning
Module 7: Deep Learning
TopicLessonWhat to Demonstrate
PerceptronL33Single-layer neural network, linear separability, convergence
MLP & ActivationsL34Multi-layer network, activation functions, hidden layers
BackpropagationL35Gradient descent, learning rate, loss curves
Overfitting PreventionL36Dropout, early stopping, regularization, validation curves
Module 8: NLP & Text
TopicLessonWhat to Demonstrate
Text PreprocessingL37Tokenization, stopword removal, lemmatization, vocabulary
BOW & TF-IDFL38Term frequency analysis, document-term matrix, feature extraction
Word EmbeddingsL39Word2Vec, similarity analysis, embedding visualization
Sentiment AnalysisL40Dictionary-based or ML-based sentiment scoring

Peer Review

After final submission, each group reviews another group's work. Rate each criterion 1–5 and provide constructive comments. Copy the template below into your review file.

# Peer Review **Reviewed Group:** [Group Name] **Reviewer(s):** [Your Names] **Date:** [Date] ## 1. Data Quality & Preparation (Score: _/5) Comments: ## 2. Technical Depth (Score: _/5) Comments: ## 3. Analysis & Interpretation (Score: _/5) Comments: ## 4. Code Quality (Score: _/5) Comments: ## 5. Presentation & Storytelling (Score: _/5) Comments: ## Overall Impression [2-3 sentences summarizing strengths and areas for improvement] ## Total Score: _/25

Example Projects

Three sample projects illustrating how to combine EDA with techniques from different modules.

M3 M4 M6 M7 M8 What Drives Cryptocurrency Returns?

Dataset: Daily prices for 20 cryptocurrencies from CoinGecko API (free), plus Bitcoin dominance, trading volume, and S&P 500 as benchmark. ~2 years of daily data.

#TechniqueModuleWhat They Do
1Descriptive Statistics (L13)M3Summary stats per coin: mean return, volatility, skewness, kurtosis. Compare distributions to normal.
2Correlation Analysis (L16)M3Correlation matrix across coins. Identify clusters of co-moving assets. Test significance of correlations.
3Linear Regression (L21)M4Regress altcoin returns on Bitcoin + S&P 500. Interpret beta, R-squared. Which altcoins are Bitcoin-independent?
4KMeans Clustering (L29)M6Cluster coins by risk/return/volume profiles. Name clusters (e.g., "stablecoins", "high-beta altcoins", "DeFi tokens").
5MLP (L34)M7Train neural network to predict next-day return direction from technical indicators (RSI, MACD, volume).
6Sentiment Analysis (L40)M8Scrape crypto news headlines, score sentiment, correlate with returns. Does news predict price movements?
Quantitative deliverable: Regression table showing beta/alpha for each coin, cluster profiles with radar charts, neural network confusion matrix for direction prediction, sentiment-return correlation timeseries.
M3 M4 M5 M6 M7 Predicting Loan Defaults

Dataset: Lending Club open dataset (Kaggle, ~50k loans) with features like income, debt-to-income, credit score, loan amount, employment length. Binary target: default vs. fully paid.

#TechniqueModuleWhat They Do
1Descriptive Statistics (L13)M3Summarize loan features, check for skewed distributions, detect outliers in income/DTI.
2Distributions (L14)M3Plot histograms and QQ-plots for loan amount, income. Test normality assumptions.
3Factor Models (L24)M4Multi-factor regression analysis of default drivers. Which factors have largest impact? Interpret coefficients.
4Logistic Regression (L25)M5Baseline classifier. Interpret odds ratios: "each 1-unit increase in DTI multiplies default odds by 1.3x."
5Decision Trees (L26)M5Random Forest for feature importance ranking. Which variables matter most? Compare to logistic baseline.
6PCA (L31)M6Reduce 20+ features to 5–7 principal components. Interpret loadings. Does PCA-reduced model match full model?
7MLP (L34)M7Neural network classifier. Compare performance to logistic regression and Random Forest. Does complexity help?
Quantitative deliverable: Model comparison table (Logistic vs. RF vs. MLP, with/without PCA) showing AUC, F1, precision@90% recall. Feature importance analysis. Cost analysis: "rejecting 100 more applicants saves EUR X in defaults but loses EUR Y in interest income."
M3 M4 M5 M7 M8 Does Financial News Sentiment Predict Stock Returns?

Dataset: 6 months of headlines from a financial news API (e.g., NewsAPI free tier or scraped RSS from Reuters/Bloomberg) for 10 DAX stocks, plus daily stock returns from Yahoo Finance.

#TechniqueModuleWhat They Do
1Descriptive Statistics (L13)M3Summary stats of returns per stock. Compare volatility, skewness across companies.
2Correlation Analysis (L16)M3Correlate sentiment scores with return series. Are they related? Test significance.
3Linear Regression (L21)M4Regress next-day returns on today's sentiment score. Is sentiment a leading indicator? Control for momentum and volume.
4Regularization (L22)M4Lasso regression with all TF-IDF features — which words predict returns? Ridge vs. Lasso comparison. Cross-validate lambda.
5Logistic Regression (L25)M5Classify positive/negative return days from sentiment features. Is classification easier than regression?
6Text Preprocessing (L37)M8Tokenize headlines, remove stopwords, lemmatize. Build vocabulary. Handle financial jargon ("bearish", "rally", "downgrade").
7TF-IDF (L38)M8Convert headlines to TF-IDF vectors. Identify most distinctive words per stock. Visualize term importance.
8Sentiment Analysis (L40)M8Score each headline (positive/negative/neutral). Compare dictionary-based (VADER) vs. FinBERT. Aggregate daily sentiment per stock.
9MLP (L34)M7Neural network for return prediction from text features. Compare to linear regression. Does non-linearity help?
Quantitative deliverable: Regression table with sentiment coefficients per stock, Lasso-selected "predictive words" list, classification confusion matrix, rolling 30-day sentiment vs. return scatter with R-squared, neural network vs. linear model performance comparison.

Recommended Data Sources

Yahoo Finance
Stock prices, fundamentals — via yfinance Python library
FRED
Federal Reserve Economic Data — macro indicators, interest rates
CoinGecko API
Cryptocurrency prices, market cap, volume (free tier)
Kaggle Datasets
Lending Club, credit card fraud, stock data, and more
ECB Statistical Data Warehouse
European economic data, exchange rates, monetary statistics
Alpha Vantage
Stock, forex, crypto data with free API key
NewsAPI
Financial news headlines (free tier, 100 requests/day)