Group Assignment - Methods & Algorithms

1ML Pipeline Challenge

Course	Methods and Algorithms — MSc Data Science
Weight	60% of final grade
Group Size	2–3 students

Each group selects a real-world finance or business problem, sources an appropriate dataset, and builds a complete ML pipeline applying 5 of 6 course topics. Choice of topics determines a difficulty multiplier applied to the technical analysis score. This simulates end-to-end data science workflow: problem formulation, data acquisition, exploratory analysis, model development, evaluation, presentation.

Topic Difficulty Points

Topic	Difficulty Points
L01: Linear Regression	1
L02: Logistic Regression	1
L03: KNN & K-Means	2
L04: Random Forests & Boosting	2
L05: PCA & t-SNE	3
L06: Embeddings & Reinforcement Learning	4
Total Possible	13 points

Difficulty Multiplier

Omitted Topic	Remaining Points	Multiplier	Max Technical Score
L01 or L02 (1 pt)	12	1.00	50 / 50
L03 or L04 (2 pts)	11	0.96	48 / 50
L05 (3 pts)	10	0.92	46 / 50
L06 (4 pts)	9	0.88	44 / 50

Every combination can still earn an A-range grade.

2Group Formation

Size: 2–3 students
Selection: Self-selected with instructor approval by Session 2
Diversity Encouraged: Mix of backgrounds (finance, CS, statistics) is beneficial
Two-Person Bonus: Groups of 2 receive 3 bonus points on final score
Individual Accountability: During Q&A, any member can be asked about any part
If unable to form a group by Session 2, notify instructor

3Deliverables

3a. GitHub Repository

Required structure:

group-project/
├── README.md
├── requirements.txt
├── data/
│   └── README.md       (data dictionary, source, license)
├── notebooks/
│   ├── 01_eda.ipynb
│   ├── 02_preprocessing.ipynb
│   ├── 03_modeling.ipynb
│   └── 04_evaluation.ipynb
├── presentation/
│   └── slides.pptx
└── report/
    └── report.pdf

Requirements:

Minimum 10 meaningful commits distributed across all members
All members must have at least 2 commits
Reproducible: set random seeds, include requirements.txt
README.md: project title, group members, problem statement, reproduce instructions

3b. Written Report (10–15 pages)

Section	Length	Content
Executive Summary	0.5 pg	Problem, approach, key findings
Problem Definition & Data	1.5 pg	Business context, research question, dataset, EDA
Methodology	4–5 pg	One subsection per method: theory, implementation, hyperparameters
Results & Comparison	3–4 pg	Quantitative results, model comparison, interpretation
Business Insights	1 pg	Actionable recommendations for stakeholders
Limitations & Reflection	1 pg	What didn't work, assumptions, lessons learned

Formatting: 11pt font, 1.15 line spacing, captioned figures/tables, APA or IEEE references.

3c. Presentation (15 min + 5 min Q&A)

15 slides maximum (excluding title and references)
All members present roughly equal amounts
Suggested structure: Title → Problem & data → ML pipeline diagram → Results per method → Comparison → Business insights → Limitations

3d. Peer Review

Each group reviews ONE other group's repository (assigned by instructor)
Rate criteria on 1–5 scale with written justification
Constructive, specific, actionable feedback
Due 7 days after presentation day
Peer review form

4Dataset Requirements

Real-world data (not synthetic or toy datasets)
Minimum size: 1,000 observations and 10 features
Appropriate target variable(s) for the problem
Documented: source and license in data/README.md

Suggested Sources

Kaggle, UCI ML Repository, Yahoo Finance, FRED, ECB Statistical Data Warehouse, World Bank Open Data, SimFin, Financial PhraseBank

Approval Process

Submit 1-paragraph description + link by Session 2. Instructor approves or suggests alternatives.

5Combined Topic Requirements

Topic	For Full Credit (5/5)	Acceptable Partial (3–4/5)
L03: KNN + K-Means	Both applied (KNN for classification/regression, K-Means for clustering)	One with strong analysis
L04: RF + Boosting	At least one ensemble (RF/Bagging) + one boosting method (XGBoost/LightGBM) compared	Only RF or only boosting
L05: PCA + t-SNE	PCA for dimensionality reduction + t-SNE for visualization	Only PCA applied

Exception: If your dataset genuinely doesn't support one technique, explain why in your report. This demonstrates analytical thinking and will not be penalized.

Methods & Algorithms — MSc Data Science