| Course | Methods and Algorithms — MSc Data Science |
| Weight | 60% of final grade |
| Group Size | 2–3 students |
Each group selects a real-world finance or business problem, sources an appropriate dataset, and builds a complete ML pipeline applying 5 of 6 course topics. Choice of topics determines a difficulty multiplier applied to the technical analysis score. This simulates end-to-end data science workflow: problem formulation, data acquisition, exploratory analysis, model development, evaluation, presentation.
Topic Difficulty Points
| Topic | Difficulty Points |
|---|---|
| L01: Linear Regression | 1 |
| L02: Logistic Regression | 1 |
| L03: KNN & K-Means | 2 |
| L04: Random Forests & Boosting | 2 |
| L05: PCA & t-SNE | 3 |
| L06: Embeddings & Reinforcement Learning | 4 |
| Total Possible | 13 points |
Difficulty Multiplier
| Omitted Topic | Remaining Points | Multiplier | Max Technical Score |
|---|---|---|---|
| L01 or L02 (1 pt) | 12 | 1.00 | 50 / 50 |
| L03 or L04 (2 pts) | 11 | 0.96 | 48 / 50 |
| L05 (3 pts) | 10 | 0.92 | 46 / 50 |
| L06 (4 pts) | 9 | 0.88 | 44 / 50 |
- Size: 2–3 students
- Selection: Self-selected with instructor approval by Session 2
- Diversity Encouraged: Mix of backgrounds (finance, CS, statistics) is beneficial
- Two-Person Bonus: Groups of 2 receive 3 bonus points on final score
- Individual Accountability: During Q&A, any member can be asked about any part
- If unable to form a group by Session 2, notify instructor
3a. GitHub Repository
Required structure:
group-project/
├── README.md
├── requirements.txt
├── data/
│ └── README.md (data dictionary, source, license)
├── notebooks/
│ ├── 01_eda.ipynb
│ ├── 02_preprocessing.ipynb
│ ├── 03_modeling.ipynb
│ └── 04_evaluation.ipynb
├── presentation/
│ └── slides.pptx
└── report/
└── report.pdf
Requirements:
- Minimum 10 meaningful commits distributed across all members
- All members must have at least 2 commits
- Reproducible: set random seeds, include
requirements.txt README.md: project title, group members, problem statement, reproduce instructions
3b. Written Report (10–15 pages)
| Section | Length | Content |
|---|---|---|
| Executive Summary | 0.5 pg | Problem, approach, key findings |
| Problem Definition & Data | 1.5 pg | Business context, research question, dataset, EDA |
| Methodology | 4–5 pg | One subsection per method: theory, implementation, hyperparameters |
| Results & Comparison | 3–4 pg | Quantitative results, model comparison, interpretation |
| Business Insights | 1 pg | Actionable recommendations for stakeholders |
| Limitations & Reflection | 1 pg | What didn't work, assumptions, lessons learned |
Formatting: 11pt font, 1.15 line spacing, captioned figures/tables, APA or IEEE references.
3c. Presentation (15 min + 5 min Q&A)
- 15 slides maximum (excluding title and references)
- All members present roughly equal amounts
- Suggested structure: Title → Problem & data → ML pipeline diagram → Results per method → Comparison → Business insights → Limitations
3d. Peer Review
- Each group reviews ONE other group's repository (assigned by instructor)
- Rate criteria on 1–5 scale with written justification
- Constructive, specific, actionable feedback
- Due 7 days after presentation day
- Peer review form
- Real-world data (not synthetic or toy datasets)
- Minimum size: 1,000 observations and 10 features
- Appropriate target variable(s) for the problem
- Documented: source and license in
data/README.md
Suggested Sources
Kaggle, UCI ML Repository, Yahoo Finance, FRED, ECB Statistical Data Warehouse, World Bank Open Data, SimFin, Financial PhraseBank
Approval Process
Submit 1-paragraph description + link by Session 2. Instructor approves or suggests alternatives.
| Topic | For Full Credit (5/5) | Acceptable Partial (3–4/5) |
|---|---|---|
| L03: KNN + K-Means | Both applied (KNN for classification/regression, K-Means for clustering) | One with strong analysis |
| L04: RF + Boosting | At least one ensemble (RF/Bagging) + one boosting method (XGBoost/LightGBM) compared | Only RF or only boosting |
| L05: PCA + t-SNE | PCA for dimensionality reduction + t-SNE for visualization | Only PCA applied |