Semester Group Assignment: Data Science with Python

Assignment Overview

Groups of 3 students work together throughout the semester to apply data science techniques to a finance-related dataset of their choice. Every project must include Exploratory Data Analysis (Module 3) plus 4 technical topics, one from each of four different modules (choose 4 out of Modules 4-8). The project culminates in a live in-class presentation (L45/L46/L48) and a GitHub repository containing all code and slides.

Find and use your own finance dataset (stocks, crypto, credit, macro, ESG, etc.)
Perform EDA using Module 3 techniques (mandatory for every project)
Apply 4 additional techniques, one from each of 4 different modules (M4-M8)
Submit via GitHub: Jupyter notebooks + PowerPoint slides
Present live in class (10-15 minutes per group)
Peer review another group's submission (counts toward your grade)
All deliverables in English

Topic Selection

Every project includes EDA (mandatory) plus 4 topics chosen from 4 different modules. You skip one of Modules 4-8, choose the combination that best fits your research question.

MANDATORY Exploratory Data Analysis Module 3: L13-L20

Every project must begin with a thorough EDA. Use any combination of Module 3 techniques to understand your dataset before applying ML methods.

Technique	Lesson	What to Demonstrate
Descriptive Statistics	L13	Summary stats, distribution shapes, outlier detection, skewness/kurtosis
Distributions	L14	Fit distributions to data, QQ-plots, compare empirical vs. theoretical
Hypothesis Testing	L15	t-tests, normality tests, significance of observed patterns
Correlation Analysis	L16	Correlation matrix, significance testing, spurious correlation awareness
Matplotlib & Seaborn	L17-L18	Publication-quality charts, heatmaps, pairplots, distribution plots
Data Storytelling	L20	Narrative-driven visualizations that communicate findings clearly

Choose 1 topic from each of 4 different modules:

Module 4: Regression

Topic	Lesson	What to Demonstrate
Linear Regression	L21	OLS regression, coefficient interpretation, R², residual analysis
Regularization	L22	Ridge/Lasso comparison, cross-validated lambda selection
Regression Metrics	L23	MSE, RMSE, MAE, R², cross-validation comparison
Factor Models	L24	Multi-factor regression, Fama-French style analysis

Module 5: Classification

Topic	Lesson	What to Demonstrate
Logistic Regression	L25	Binary classification, odds ratios, probability calibration
Decision Trees	L26	Tree building, Random Forest, feature importance
Classification Metrics	L27	Confusion matrix, ROC-AUC, precision/recall, threshold tuning
Class Imbalance	L28	SMOTE, class weights, stratified CV, PR curves

Module 6: Unsupervised

Topic	Lesson	What to Demonstrate
KMeans Clustering	L29	Elbow method, silhouette score, cluster interpretation
Hierarchical Clustering	L30	Dendrograms, linkage methods, correlation-based clustering
PCA	L31	Dimensionality reduction, scree plots, loadings interpretation
ML Pipeline	L32	sklearn Pipeline, cross-validation, hyperparameter tuning

Module 7: Deep Learning

Topic	Lesson	What to Demonstrate
Perceptron	L33	Single-layer neural network, linear separability, convergence
MLP & Activations	L34	Multi-layer network, activation functions, hidden layers
Backpropagation	L35	Gradient descent, learning rate, loss curves
Overfitting Prevention	L36	Dropout, early stopping, regularization, validation curves

Module 8: NLP & Text

Topic	Lesson	What to Demonstrate
Text Preprocessing	L37	Tokenization, stopword removal, lemmatization, vocabulary
BOW & TF-IDF	L38	Term frequency analysis, document-term matrix, feature extraction
Word Embeddings	L39	Word2Vec, similarity analysis, embedding visualization
Sentiment Analysis	L40	Dictionary-based or ML-based sentiment scoring

Peer Review

After final submission, each group reviews another group's work. Rate each criterion 1-5 and provide constructive comments. Copy the template below into your review file.

# Peer Review **Reviewed Group:** [Group Name] **Reviewer(s):** [Your Names] **Date:** [Date] ## 1. Data Quality & Preparation (Score: _/5) Comments: ## 2. Technical Depth (Score: _/5) Comments: ## 3. Analysis & Interpretation (Score: _/5) Comments: ## 4. Code Quality (Score: _/5) Comments: ## 5. Presentation & Storytelling (Score: _/5) Comments: ## Overall Impression [2-3 sentences summarizing strengths and areas for improvement] ## Total Score: _/25

Example Projects

Three sample projects illustrating how to combine EDA with techniques from different modules.

M3 M4 M6 M7 M8 What Drives Cryptocurrency Returns?

Dataset: Daily prices for 20 cryptocurrencies from CoinGecko API (free), plus Bitcoin dominance, trading volume, and S&P 500 as benchmark. ~2 years of daily data.

#	Technique	Module	What They Do
1	Descriptive Statistics (L13)	M3	Summary stats per coin: mean return, volatility, skewness, kurtosis. Compare distributions to normal.
2	Correlation Analysis (L16)	M3	Correlation matrix across coins. Identify clusters of co-moving assets. Test significance of correlations.
3	Linear Regression (L21)	M4	Regress altcoin returns on Bitcoin + S&P 500. Interpret beta, R-squared. Which altcoins are Bitcoin-independent?
4	KMeans Clustering (L29)	M6	Cluster coins by risk/return/volume profiles. Name clusters (e.g., "stablecoins", "high-beta altcoins", "DeFi tokens").
5	MLP (L34)	M7	Train neural network to predict next-day return direction from technical indicators (RSI, MACD, volume).
6	Sentiment Analysis (L40)	M8	Scrape crypto news headlines, score sentiment, correlate with returns. Does news predict price movements?

Quantitative deliverable: Regression table showing beta/alpha for each coin, cluster profiles with radar charts, neural network confusion matrix for direction prediction, sentiment-return correlation timeseries.

M3 M4 M5 M6 M7 Predicting Loan Defaults

Dataset: Lending Club open dataset (Kaggle, ~50k loans) with features like income, debt-to-income, credit score, loan amount, employment length. Binary target: default vs. fully paid.

#	Technique	Module	What They Do
1	Descriptive Statistics (L13)	M3	Summarize loan features, check for skewed distributions, detect outliers in income/DTI.
2	Distributions (L14)	M3	Plot histograms and QQ-plots for loan amount, income. Test normality assumptions.
3	Factor Models (L24)	M4	Multi-factor regression analysis of default drivers. Which factors have largest impact? Interpret coefficients.
4	Logistic Regression (L25)	M5	Baseline classifier. Interpret odds ratios: "each 1-unit increase in DTI multiplies default odds by 1.3x."
5	Decision Trees (L26)	M5	Random Forest for feature importance ranking. Which variables matter most? Compare to logistic baseline.
6	PCA (L31)	M6	Reduce 20+ features to 5-7 principal components. Interpret loadings. Does PCA-reduced model match full model?
7	MLP (L34)	M7	Neural network classifier. Compare performance to logistic regression and Random Forest. Does complexity help?

Quantitative deliverable: Model comparison table (Logistic vs. RF vs. MLP, with/without PCA) showing AUC, F1, precision@90% recall. Feature importance analysis. Cost analysis: "rejecting 100 more applicants saves EUR X in defaults but loses EUR Y in interest income."

M3 M4 M5 M7 M8 Does Financial News Sentiment Predict Stock Returns?

Dataset: 6 months of headlines from a financial news API (e.g., NewsAPI free tier or scraped RSS from Reuters/Bloomberg) for 10 DAX stocks, plus daily stock returns from Yahoo Finance.

#	Technique	Module	What They Do
1	Descriptive Statistics (L13)	M3	Summary stats of returns per stock. Compare volatility, skewness across companies.
2	Correlation Analysis (L16)	M3	Correlate sentiment scores with return series. Are they related? Test significance.
3	Linear Regression (L21)	M4	Regress next-day returns on today's sentiment score. Is sentiment a leading indicator? Control for momentum and volume.
4	Regularization (L22)	M4	Lasso regression with all TF-IDF features: which words predict returns? Ridge vs. Lasso comparison. Cross-validate lambda.
5	Logistic Regression (L25)	M5	Classify positive/negative return days from sentiment features. Is classification easier than regression?
6	Text Preprocessing (L37)	M8	Tokenize headlines, remove stopwords, lemmatize. Build vocabulary. Handle financial jargon ("bearish", "rally", "downgrade").
7	TF-IDF (L38)	M8	Convert headlines to TF-IDF vectors. Identify most distinctive words per stock. Visualize term importance.
8	Sentiment Analysis (L40)	M8	Score each headline (positive/negative/neutral). Compare dictionary-based (VADER) vs. FinBERT. Aggregate daily sentiment per stock.
9	MLP (L34)	M7	Neural network for return prediction from text features. Compare to linear regression. Does non-linearity help?

Quantitative deliverable: Regression table with sentiment coefficients per stock, Lasso-selected "predictive words" list, classification confusion matrix, rolling 30-day sentiment vs. return scatter with R-squared, neural network vs. linear model performance comparison.

Recommended Data Sources

Yahoo Finance

Stock prices, fundamentals via yfinance Python library

FRED

Federal Reserve Economic Data: macro indicators, interest rates

CoinGecko API

Cryptocurrency prices, market cap, volume (free tier)

Kaggle Datasets

Lending Club, credit card fraud, stock data, and more

ECB Statistical Data Warehouse

European economic data, exchange rates, monetary statistics

Alpha Vantage

Stock, forex, crypto data with free API key

NewsAPI

Financial news headlines (free tier, 100 requests/day)