1ML Pipeline Challenge
CourseMethods and Algorithms — MSc Data Science
Weight60% of final grade
Group Size2–3 students

Each group selects a real-world finance or business problem, sources an appropriate dataset, and builds a complete ML pipeline applying 5 of 6 course topics. Choice of topics determines a difficulty multiplier applied to the technical analysis score. This simulates end-to-end data science workflow: problem formulation, data acquisition, exploratory analysis, model development, evaluation, presentation.

Topic Difficulty Points

TopicDifficulty Points
L01: Linear Regression1
L02: Logistic Regression1
L03: KNN & K-Means2
L04: Random Forests & Boosting2
L05: PCA & t-SNE3
L06: Embeddings & Reinforcement Learning4
Total Possible13 points

Difficulty Multiplier

Omitted TopicRemaining PointsMultiplierMax Technical Score
L01 or L02 (1 pt)121.0050 / 50
L03 or L04 (2 pts)110.9648 / 50
L05 (3 pts)100.9246 / 50
L06 (4 pts)90.8844 / 50
Every combination can still earn an A-range grade.
2Group Formation
3Deliverables

3a. GitHub Repository

Required structure:

group-project/
├── README.md
├── requirements.txt
├── data/
│   └── README.md       (data dictionary, source, license)
├── notebooks/
│   ├── 01_eda.ipynb
│   ├── 02_preprocessing.ipynb
│   ├── 03_modeling.ipynb
│   └── 04_evaluation.ipynb
├── presentation/
│   └── slides.pptx
└── report/
    └── report.pdf

Requirements:

3b. Written Report (10–15 pages)

SectionLengthContent
Executive Summary0.5 pgProblem, approach, key findings
Problem Definition & Data1.5 pgBusiness context, research question, dataset, EDA
Methodology4–5 pgOne subsection per method: theory, implementation, hyperparameters
Results & Comparison3–4 pgQuantitative results, model comparison, interpretation
Business Insights1 pgActionable recommendations for stakeholders
Limitations & Reflection1 pgWhat didn't work, assumptions, lessons learned

Formatting: 11pt font, 1.15 line spacing, captioned figures/tables, APA or IEEE references.

3c. Presentation (15 min + 5 min Q&A)

3d. Peer Review

4Dataset Requirements

Suggested Sources

Kaggle, UCI ML Repository, Yahoo Finance, FRED, ECB Statistical Data Warehouse, World Bank Open Data, SimFin, Financial PhraseBank

Approval Process

Submit 1-paragraph description + link by Session 2. Instructor approves or suggests alternatives.

5Combined Topic Requirements
TopicFor Full Credit (5/5)Acceptable Partial (3–4/5)
L03: KNN + K-MeansBoth applied (KNN for classification/regression, K-Means for clustering)One with strong analysis
L04: RF + BoostingAt least one ensemble (RF/Bagging) + one boosting method (XGBoost/LightGBM) comparedOnly RF or only boosting
L05: PCA + t-SNEPCA for dimensionality reduction + t-SNE for visualizationOnly PCA applied
Exception: If your dataset genuinely doesn't support one technique, explain why in your report. This demonstrates analytical thinking and will not be penalized.
Methods & Algorithms — MSc Data Science