Collaborative Data Analysis Project: Finance Focus
Groups of 3 students work together throughout the semester to apply data science techniques to a finance-related dataset of their choice. Every project must include Exploratory Data Analysis (Module 3) plus 4 technical topics, one from each of four different modules (choose 4 out of Modules 4-8). The project culminates in a live in-class presentation (L45/L46/L48) and a GitHub repository containing all code and slides.
Every project includes EDA (mandatory) plus 4 topics chosen from 4 different modules. You skip one of Modules 4-8, choose the combination that best fits your research question.
Every project must begin with a thorough EDA. Use any combination of Module 3 techniques to understand your dataset before applying ML methods.
| Technique | Lesson | What to Demonstrate |
|---|---|---|
| Descriptive Statistics | L13 | Summary stats, distribution shapes, outlier detection, skewness/kurtosis |
| Distributions | L14 | Fit distributions to data, QQ-plots, compare empirical vs. theoretical |
| Hypothesis Testing | L15 | t-tests, normality tests, significance of observed patterns |
| Correlation Analysis | L16 | Correlation matrix, significance testing, spurious correlation awareness |
| Matplotlib & Seaborn | L17-L18 | Publication-quality charts, heatmaps, pairplots, distribution plots |
| Data Storytelling | L20 | Narrative-driven visualizations that communicate findings clearly |
Choose 1 topic from each of 4 different modules:
| Topic | Lesson | What to Demonstrate |
|---|---|---|
| Linear Regression | L21 | OLS regression, coefficient interpretation, R², residual analysis |
| Regularization | L22 | Ridge/Lasso comparison, cross-validated lambda selection |
| Regression Metrics | L23 | MSE, RMSE, MAE, R², cross-validation comparison |
| Factor Models | L24 | Multi-factor regression, Fama-French style analysis |
| Topic | Lesson | What to Demonstrate |
|---|---|---|
| Logistic Regression | L25 | Binary classification, odds ratios, probability calibration |
| Decision Trees | L26 | Tree building, Random Forest, feature importance |
| Classification Metrics | L27 | Confusion matrix, ROC-AUC, precision/recall, threshold tuning |
| Class Imbalance | L28 | SMOTE, class weights, stratified CV, PR curves |
| Topic | Lesson | What to Demonstrate |
|---|---|---|
| KMeans Clustering | L29 | Elbow method, silhouette score, cluster interpretation |
| Hierarchical Clustering | L30 | Dendrograms, linkage methods, correlation-based clustering |
| PCA | L31 | Dimensionality reduction, scree plots, loadings interpretation |
| ML Pipeline | L32 | sklearn Pipeline, cross-validation, hyperparameter tuning |
| Topic | Lesson | What to Demonstrate |
|---|---|---|
| Perceptron | L33 | Single-layer neural network, linear separability, convergence |
| MLP & Activations | L34 | Multi-layer network, activation functions, hidden layers |
| Backpropagation | L35 | Gradient descent, learning rate, loss curves |
| Overfitting Prevention | L36 | Dropout, early stopping, regularization, validation curves |
| Topic | Lesson | What to Demonstrate |
|---|---|---|
| Text Preprocessing | L37 | Tokenization, stopword removal, lemmatization, vocabulary |
| BOW & TF-IDF | L38 | Term frequency analysis, document-term matrix, feature extraction |
| Word Embeddings | L39 | Word2Vec, similarity analysis, embedding visualization |
| Sentiment Analysis | L40 | Dictionary-based or ML-based sentiment scoring |
After final submission, each group reviews another group's work. Rate each criterion 1-5 and provide constructive comments. Copy the template below into your review file.
Three sample projects illustrating how to combine EDA with techniques from different modules.
Dataset: Daily prices for 20 cryptocurrencies from CoinGecko API (free), plus Bitcoin dominance, trading volume, and S&P 500 as benchmark. ~2 years of daily data.
| # | Technique | Module | What They Do |
|---|---|---|---|
| 1 | Descriptive Statistics (L13) | M3 | Summary stats per coin: mean return, volatility, skewness, kurtosis. Compare distributions to normal. |
| 2 | Correlation Analysis (L16) | M3 | Correlation matrix across coins. Identify clusters of co-moving assets. Test significance of correlations. |
| 3 | Linear Regression (L21) | M4 | Regress altcoin returns on Bitcoin + S&P 500. Interpret beta, R-squared. Which altcoins are Bitcoin-independent? |
| 4 | KMeans Clustering (L29) | M6 | Cluster coins by risk/return/volume profiles. Name clusters (e.g., "stablecoins", "high-beta altcoins", "DeFi tokens"). |
| 5 | MLP (L34) | M7 | Train neural network to predict next-day return direction from technical indicators (RSI, MACD, volume). |
| 6 | Sentiment Analysis (L40) | M8 | Scrape crypto news headlines, score sentiment, correlate with returns. Does news predict price movements? |
Dataset: Lending Club open dataset (Kaggle, ~50k loans) with features like income, debt-to-income, credit score, loan amount, employment length. Binary target: default vs. fully paid.
| # | Technique | Module | What They Do |
|---|---|---|---|
| 1 | Descriptive Statistics (L13) | M3 | Summarize loan features, check for skewed distributions, detect outliers in income/DTI. |
| 2 | Distributions (L14) | M3 | Plot histograms and QQ-plots for loan amount, income. Test normality assumptions. |
| 3 | Factor Models (L24) | M4 | Multi-factor regression analysis of default drivers. Which factors have largest impact? Interpret coefficients. |
| 4 | Logistic Regression (L25) | M5 | Baseline classifier. Interpret odds ratios: "each 1-unit increase in DTI multiplies default odds by 1.3x." |
| 5 | Decision Trees (L26) | M5 | Random Forest for feature importance ranking. Which variables matter most? Compare to logistic baseline. |
| 6 | PCA (L31) | M6 | Reduce 20+ features to 5-7 principal components. Interpret loadings. Does PCA-reduced model match full model? |
| 7 | MLP (L34) | M7 | Neural network classifier. Compare performance to logistic regression and Random Forest. Does complexity help? |
Dataset: 6 months of headlines from a financial news API (e.g., NewsAPI free tier or scraped RSS from Reuters/Bloomberg) for 10 DAX stocks, plus daily stock returns from Yahoo Finance.
| # | Technique | Module | What They Do |
|---|---|---|---|
| 1 | Descriptive Statistics (L13) | M3 | Summary stats of returns per stock. Compare volatility, skewness across companies. |
| 2 | Correlation Analysis (L16) | M3 | Correlate sentiment scores with return series. Are they related? Test significance. |
| 3 | Linear Regression (L21) | M4 | Regress next-day returns on today's sentiment score. Is sentiment a leading indicator? Control for momentum and volume. |
| 4 | Regularization (L22) | M4 | Lasso regression with all TF-IDF features: which words predict returns? Ridge vs. Lasso comparison. Cross-validate lambda. |
| 5 | Logistic Regression (L25) | M5 | Classify positive/negative return days from sentiment features. Is classification easier than regression? |
| 6 | Text Preprocessing (L37) | M8 | Tokenize headlines, remove stopwords, lemmatize. Build vocabulary. Handle financial jargon ("bearish", "rally", "downgrade"). |
| 7 | TF-IDF (L38) | M8 | Convert headlines to TF-IDF vectors. Identify most distinctive words per stock. Visualize term importance. |
| 8 | Sentiment Analysis (L40) | M8 | Score each headline (positive/negative/neutral). Compare dictionary-based (VADER) vs. FinBERT. Aggregate daily sentiment per stock. |
| 9 | MLP (L34) | M7 | Neural network for return prediction from text features. Compare to linear regression. Does non-linearity help? |
yfinance Python library