Collaborative Data Analysis Project — Finance Focus
Groups of 3 students work together throughout the semester to apply data science techniques to a finance-related dataset of their choice. Every project must include Exploratory Data Analysis (Module 3) plus 4 technical topics — one from each of four different modules (choose 4 out of Modules 4–8). The project culminates in a live in-class presentation (L45/L46/L48) and a GitHub repository containing all code and slides.
Every project includes EDA (mandatory) plus 4 topics chosen from 4 different modules. You skip one of Modules 4–8 — choose the combination that best fits your research question.
Every project must begin with a thorough EDA. Use any combination of Module 3 techniques to understand your dataset before applying ML methods.
| Technique | Lesson | What to Demonstrate |
|---|---|---|
| Descriptive Statistics | L13 | Summary stats, distribution shapes, outlier detection, skewness/kurtosis |
| Distributions | L14 | Fit distributions to data, QQ-plots, compare empirical vs. theoretical |
| Hypothesis Testing | L15 | t-tests, normality tests, significance of observed patterns |
| Correlation Analysis | L16 | Correlation matrix, significance testing, spurious correlation awareness |
| Matplotlib & Seaborn | L17–L18 | Publication-quality charts, heatmaps, pairplots, distribution plots |
| Data Storytelling | L20 | Narrative-driven visualizations that communicate findings clearly |
Choose 1 topic from each of 4 different modules:
| Topic | Lesson | What to Demonstrate |
|---|---|---|
| Linear Regression | L21 | OLS regression, coefficient interpretation, R², residual analysis |
| Regularization | L22 | Ridge/Lasso comparison, cross-validated lambda selection |
| Regression Metrics | L23 | MSE, RMSE, MAE, R², cross-validation comparison |
| Factor Models | L24 | Multi-factor regression, Fama–French style analysis |
| Topic | Lesson | What to Demonstrate |
|---|---|---|
| Logistic Regression | L25 | Binary classification, odds ratios, probability calibration |
| Decision Trees | L26 | Tree building, Random Forest, feature importance |
| Classification Metrics | L27 | Confusion matrix, ROC-AUC, precision/recall, threshold tuning |
| Class Imbalance | L28 | SMOTE, class weights, stratified CV, PR curves |
| Topic | Lesson | What to Demonstrate |
|---|---|---|
| KMeans Clustering | L29 | Elbow method, silhouette score, cluster interpretation |
| Hierarchical Clustering | L30 | Dendrograms, linkage methods, correlation-based clustering |
| PCA | L31 | Dimensionality reduction, scree plots, loadings interpretation |
| ML Pipeline | L32 | sklearn Pipeline, cross-validation, hyperparameter tuning |
| Topic | Lesson | What to Demonstrate |
|---|---|---|
| Perceptron | L33 | Single-layer neural network, linear separability, convergence |
| MLP & Activations | L34 | Multi-layer network, activation functions, hidden layers |
| Backpropagation | L35 | Gradient descent, learning rate, loss curves |
| Overfitting Prevention | L36 | Dropout, early stopping, regularization, validation curves |
| Topic | Lesson | What to Demonstrate |
|---|---|---|
| Text Preprocessing | L37 | Tokenization, stopword removal, lemmatization, vocabulary |
| BOW & TF-IDF | L38 | Term frequency analysis, document-term matrix, feature extraction |
| Word Embeddings | L39 | Word2Vec, similarity analysis, embedding visualization |
| Sentiment Analysis | L40 | Dictionary-based or ML-based sentiment scoring |
After final submission, each group reviews another group's work. Rate each criterion 1–5 and provide constructive comments. Copy the template below into your review file.
Three sample projects illustrating how to combine EDA with techniques from different modules.
Dataset: Daily prices for 20 cryptocurrencies from CoinGecko API (free), plus Bitcoin dominance, trading volume, and S&P 500 as benchmark. ~2 years of daily data.
| # | Technique | Module | What They Do |
|---|---|---|---|
| 1 | Descriptive Statistics (L13) | M3 | Summary stats per coin: mean return, volatility, skewness, kurtosis. Compare distributions to normal. |
| 2 | Correlation Analysis (L16) | M3 | Correlation matrix across coins. Identify clusters of co-moving assets. Test significance of correlations. |
| 3 | Linear Regression (L21) | M4 | Regress altcoin returns on Bitcoin + S&P 500. Interpret beta, R-squared. Which altcoins are Bitcoin-independent? |
| 4 | KMeans Clustering (L29) | M6 | Cluster coins by risk/return/volume profiles. Name clusters (e.g., "stablecoins", "high-beta altcoins", "DeFi tokens"). |
| 5 | MLP (L34) | M7 | Train neural network to predict next-day return direction from technical indicators (RSI, MACD, volume). |
| 6 | Sentiment Analysis (L40) | M8 | Scrape crypto news headlines, score sentiment, correlate with returns. Does news predict price movements? |
Dataset: Lending Club open dataset (Kaggle, ~50k loans) with features like income, debt-to-income, credit score, loan amount, employment length. Binary target: default vs. fully paid.
| # | Technique | Module | What They Do |
|---|---|---|---|
| 1 | Descriptive Statistics (L13) | M3 | Summarize loan features, check for skewed distributions, detect outliers in income/DTI. |
| 2 | Distributions (L14) | M3 | Plot histograms and QQ-plots for loan amount, income. Test normality assumptions. |
| 3 | Factor Models (L24) | M4 | Multi-factor regression analysis of default drivers. Which factors have largest impact? Interpret coefficients. |
| 4 | Logistic Regression (L25) | M5 | Baseline classifier. Interpret odds ratios: "each 1-unit increase in DTI multiplies default odds by 1.3x." |
| 5 | Decision Trees (L26) | M5 | Random Forest for feature importance ranking. Which variables matter most? Compare to logistic baseline. |
| 6 | PCA (L31) | M6 | Reduce 20+ features to 5–7 principal components. Interpret loadings. Does PCA-reduced model match full model? |
| 7 | MLP (L34) | M7 | Neural network classifier. Compare performance to logistic regression and Random Forest. Does complexity help? |
Dataset: 6 months of headlines from a financial news API (e.g., NewsAPI free tier or scraped RSS from Reuters/Bloomberg) for 10 DAX stocks, plus daily stock returns from Yahoo Finance.
| # | Technique | Module | What They Do |
|---|---|---|---|
| 1 | Descriptive Statistics (L13) | M3 | Summary stats of returns per stock. Compare volatility, skewness across companies. |
| 2 | Correlation Analysis (L16) | M3 | Correlate sentiment scores with return series. Are they related? Test significance. |
| 3 | Linear Regression (L21) | M4 | Regress next-day returns on today's sentiment score. Is sentiment a leading indicator? Control for momentum and volume. |
| 4 | Regularization (L22) | M4 | Lasso regression with all TF-IDF features — which words predict returns? Ridge vs. Lasso comparison. Cross-validate lambda. |
| 5 | Logistic Regression (L25) | M5 | Classify positive/negative return days from sentiment features. Is classification easier than regression? |
| 6 | Text Preprocessing (L37) | M8 | Tokenize headlines, remove stopwords, lemmatize. Build vocabulary. Handle financial jargon ("bearish", "rally", "downgrade"). |
| 7 | TF-IDF (L38) | M8 | Convert headlines to TF-IDF vectors. Identify most distinctive words per stock. Visualize term importance. |
| 8 | Sentiment Analysis (L40) | M8 | Score each headline (positive/negative/neutral). Compare dictionary-based (VADER) vs. FinBERT. Aggregate daily sentiment per stock. |
| 9 | MLP (L34) | M7 | Neural network for return prediction from text features. Compare to linear regression. Does non-linearity help? |
yfinance Python library