Skip to content

Natural-Language-Processing-Details

Task-oriented NLP educational materials for undergraduate courses

View on GitHub


Information

Property Value
Language Jupyter Notebook
Stars 0
Forks 0
Watchers 0
Open Issues 0
License No License
Created 2025-11-29
Last Updated 2026-02-19
Last Push 2025-12-19
Contributors 1
Default Branch master
Visibility private

Notebooks

This repository contains 40 notebook(s):

Notebook Language Type

| nlp_basics_homework | PYTHON | jupyter |

| nlp_basics_solutions | PYTHON | jupyter |

| statistical_analysis | PYTHON | jupyter |

| zipf_law_analysis | PYTHON | jupyter |

| zipf_law_analysis | PYTHON | jupyter |

| classification_basic | PYTHON | jupyter |

| classification_basic_solutions | PYTHON | jupyter |

| clustering_basic | PYTHON | jupyter |

| clustering_basic_solutions | PYTHON | jupyter |

| ner_basic | PYTHON | jupyter |

| ner_basic_solutions | PYTHON | jupyter |

| semantic_search_basic | PYTHON | jupyter |

| semantic_search_basic_solutions | PYTHON | jupyter |

| sentiment_basic | PYTHON | jupyter |

| sentiment_basic_solutions | PYTHON | jupyter |

| text_generation_basic | PYTHON | jupyter |

| text_generation_basic_solutions | PYTHON | jupyter |

| classification_intermediate | PYTHON | jupyter |

| classification_intermediate_solutions | PYTHON | jupyter |

| clustering_intermediate | PYTHON | jupyter |

| clustering_intermediate_solutions | PYTHON | jupyter |

| ner_intermediate | PYTHON | jupyter |

| ner_intermediate_solutions | PYTHON | jupyter |

| semantic_search_intermediate | PYTHON | jupyter |

| semantic_search_intermediate_solutions | PYTHON | jupyter |

| sentiment_intermediate | PYTHON | jupyter |

| sentiment_intermediate_solutions | PYTHON | jupyter |

| text_generation_intermediate | PYTHON | jupyter |

| text_generation_intermediate_solutions | PYTHON | jupyter |

| cnn_text_classification | PYTHON | jupyter |

| feedforward_from_scratch | PYTHON | jupyter |

| rnn_sentiment_example | PYTHON | jupyter |

| transformer_attention_demo | PYTHON | jupyter |

| classification_tutorial | PYTHON | jupyter |

| ner_analysis | PYTHON | jupyter |

| embedding_analysis | PYTHON | jupyter |

| sentiment_analysis | PYTHON | jupyter |

| summarization_analysis | PYTHON | jupyter |

| compare_models | PYTHON | jupyter |

| ngram_analysis | PYTHON | jupyter |

Datasets

This repository includes 67 dataset(s):

Dataset Format Size

| data | | 0.0 KB |

| processed | | 0.0 KB |

| embeddings | | 0.0 KB |

| embeddings_metadata.json | .json | 736.77 KB |

| headlines_embeddings.npy | .npy | 15000.12 KB |

| models | | 0.0 KB |

| classification | | 0.0 KB |

| sentiment | | 0.0 KB |

| text_generation | | 0.0 KB |

| samples | | 0.0 KB |

| sample_20251001_190000_1.txt | .txt | 0.22 KB |

| sample_20251001_190000_2.txt | .txt | 0.23 KB |

| sample_20251001_190000_3.txt | .txt | 0.22 KB |

| sample_20251001_190033_1.txt | .txt | 1.45 KB |

| sample_20251001_190033_2.txt | .txt | 1.47 KB |

| sample_20251001_190033_3.txt | .txt | 1.46 KB |

| visualizations | | 0.0 KB |

| clustering_comparison.png | .png | 842.78 KB |

| pca_visualization.png | .png | 584.71 KB |

| similarity_distribution.png | .png | 86.02 KB |

| tsne_visualization.png | .png | 294.86 KB |

| raw | | 0.0 KB |

| articles | | 0.0 KB |

| news_articles_dataset.csv | .csv | 545.38 KB |

| test.csv | .csv | 81.69 KB |

| train.csv | .csv | 381.88 KB |

| val.csv | .csv | 81.97 KB |

| basic | | 0.0 KB |

| VERIFICATION_REPORT.md | .md | 4.77 KB |

| generate_headlines.py | .py | 10.62 KB |

| news_headlines_dataset.csv | .csv | 27.88 KB |

| nlp_basics_homework.ipynb | .ipynb | 14.71 KB |

| nlp_basics_solutions.ipynb | .ipynb | 547.96 KB |

| extended | | 0.0 KB |

| news_headlines_extended.csv | .csv | 724.04 KB |

| statistical_analysis.ipynb | .ipynb | 571.07 KB |

| test.csv | .csv | 108.68 KB |

| train.csv | .csv | 507.07 KB |

| val.csv | .csv | 108.38 KB |

| zipf_law_analysis.ipynb | .ipynb | 352.02 KB |

| extended_ner | | 0.0 KB |

| README.md | .md | 8.53 KB |

| extended_sentiment | | 0.0 KB |

| README.md | .md | 5.94 KB |

| news_headlines_extended_sentiment.csv | .csv | 857.62 KB |

| test.csv | .csv | 128.66 KB |

| train.csv | .csv | 600.29 KB |

| val.csv | .csv | 128.81 KB |

| large | | 0.0 KB |

| news_headlines_large.csv | .csv | 1824.02 KB |

| zipf_law_analysis.ipynb | .ipynb | 346.81 KB |

| results | | 0.0 KB |

| classification | | 0.0 KB |

| decision_tree_predictions.json | .json | 168.8 KB |

| detailed_results.json | .json | 2.94 KB |

| logistic_regression_predictions.json | .json | 168.49 KB |

| model_comparison.csv | .csv | 0.99 KB |

| naive_bayes_predictions.json | .json | 167.88 KB |

| neural_network_predictions.json | .json | 168.53 KB |

| random_forest_predictions.json | .json | 168.51 KB |

| svm_predictions.json | .json | 168.53 KB |

| sentiment | | 0.0 KB |

| model_comparison.csv | .csv | 0.46 KB |

| datasets | | 0.0 KB |

| index.html | .html | 8.46 KB |

| js | | 0.0 KB |

| dataset-browser.js | .js | 7.94 KB |

Reproducibility

This repository includes reproducibility tools:

  • Python requirements.txt

Status

  • Issues: Enabled
  • Wiki: Disabled
  • Pages: Disabled

README

Natural Language Processing - Educational Materials

Repository: https://git.fhgr.ch/digital-finance/Natural-Language-Processing-Details Organization: Digital Finance @ FHGR Purpose: Task-oriented NLP course materials from classification to generation GitLab Pages: https://osterrijoerg.git.fhgr.ch/digital-finance/Natural-Language-Processing-Details

Quick Start

Prerequisites

  • Python 3.8+
  • LaTeX distribution (MiKTeX, TeX Live, or MacTeX)
  • Required packages:
    pip install pandas numpy matplotlib seaborn scikit-learn sentence-transformers torch jupyter
    

Task-Based Organization

This repository is organized by NLP tasks (what you want to accomplish) rather than methodologies (how you accomplish it):

tasks/                              # NLP Task Implementations
├── classification/                 # Text Classification
│   └── 6 models: Logistic, Naive Bayes, Decision Tree, Random Forest, SVM, Neural Net
├── text_generation/                # Text Generation
│   └── N-gram & Neural language models
├── clustering/                     # Document Organization
│   └── K-Means, Hierarchical, DBSCAN + PCA, t-SNE, UMAP
├── semantic_search/                # Similarity Search
│   └── Embedding-based semantic search
├── summarization/                  # Text Summarization
│   └── Extractive (TextRank) & Abstractive (BART)
├── sentiment_analysis/             # Sentiment Classification
│   └── 4 models: Logistic, SVM, LSTM, BERT
└── ner/                            # Named Entity Recognition
    └── BERT-based token classification with BIO tagging

foundations/                        # Foundational Knowledge
└── neural_networks/                # 7 Architectures (not a task, but building blocks)
    ├── Perceptron → Transformer → Autoencoder
    └── notebooks/                  # 4 implementation notebooks

exercises/                          # Practice Exercises
├── basic/                          # 14 notebooks (7 tasks × 2)
└── intermediate/                   # 14 notebooks (7 tasks × 2)

data/                               # All Datasets & Artifacts
├── raw/                            # Original CSV datasets
├── processed/                      # Generated embeddings, models, visualizations
└── results/                        # Model outputs and predictions

Why Task-Based?

Traditional (Method-Based): - "Learn supervised learning" - Abstract, methodology-focused - "I studied embeddings"

Task-Based (This Repo): - "Build a text classifier" - Concrete, goal-focused - "I built a semantic search system"

Students learn WHAT to solve (tasks) before HOW to solve it (methods).

The Seven Core NLP Tasks

1. Classification (tasks/classification/)

What: Assign categories to text Example: "President announces policy" → Politics Models: 6 algorithms (85-95% accuracy) Use Cases: Spam filtering, sentiment analysis, topic categorization

Quick Start:

cd tasks/classification
python train_models.py                    # Train all 6 models (~3 min)
cd presentation && python generate_charts.py   # Generate 26 charts
pdflatex 20251006_1256_supervised_tutorial.tex # Compile 39-slide presentation

Output: Trained models in data/processed/models/classification/*.pkl

2. Text Generation (tasks/text_generation/)

What: Generate coherent, human-like text Example: "The president" → "announced a new economic policy" Models: 5-gram statistical + LSTM neural Use Cases: Content creation, chatbots, code completion

Quick Start:

cd tasks/text_generation
python train_5gram.py                     # Train 5-gram model (~30 sec)
python generate_half_page.py              # Generate 200-word samples

Output: Model in data/processed/models/text_generation/5gram_extended.pkl

3. Clustering (tasks/clustering/)

What: Group similar documents without labels Example: Automatically discover that sports headlines cluster together Algorithms: K-Means, Hierarchical, DBSCAN + PCA, t-SNE, UMAP Use Cases: Topic discovery, document organization, exploratory analysis

Quick Start:

cd tasks/clustering/presentation
python generate_charts.py                 # Generate 24 charts (~3 min)
pdflatex 20251003_2206_unsupervised_tutorial.tex  # Compile 33-slide presentation

Input: Pre-computed embeddings from semantic_search task

What: Find similar documents by meaning (not keywords) Example: "president policy" finds "leader regulation" (different words, same meaning) Model: Sentence-transformers (384-D embeddings) Use Cases: Document retrieval, Q&A, recommendations, duplicate detection

Quick Start:

cd tasks/semantic_search
python generate_embeddings.py             # Generate 10K embeddings (~2 min)
python semantic_search.py --interactive   # Interactive search demo

Output: data/processed/embeddings/headlines_embeddings.npy (15 MB)

5. Summarization (tasks/summarization/)

What: Generate concise summaries of longer texts Example: 70-word article → 7-word headline Models: TextRank (extractive) + BART (abstractive) Use Cases: News aggregation, document digests, meeting notes

Quick Start:

cd tasks/summarization
python extractive_summary.py              # TextRank (~2-3 min)
python train_abstractive.py               # BART fine-tuning (30-60 min GPU)
cd presentation && python generate_charts.py
pdflatex 20251123_1604_summarization_tutorial.tex

Output: Summaries with ROUGE metrics, trained BART model

6. Sentiment Analysis (tasks/sentiment_analysis/)

What: Determine emotional tone (positive, negative, neutral) Example: "The president announced a groundbreaking policy" → Positive Models: Logistic Regression, SVM, LSTM, BERT (4 models) Use Cases: Brand monitoring, customer feedback, market sentiment

Quick Start:

cd tasks/sentiment_analysis
python create_sentiment_labels.py         # Label 10K headlines
python train_sentiment_models.py          # Train 4 models (~5 min)
cd presentation && python generate_all_charts.py
pdflatex 20251128_1150_sentiment_tutorial.tex

Output: Trained models, BERT achieves 61.7% accuracy

7. Named Entity Recognition (tasks/ner/)

What: Extract entities (people, places, organizations) from text Example: "President Biden announced policy in Washington" → PERSON: Biden, GPE: Washington Model: BERT fine-tuned for token classification with BIO tagging Use Cases: Information extraction, knowledge graphs, document indexing

Quick Start:

cd tasks/ner
python annotate_entities.py               # Annotate 10K headlines (~3 min)
python train_ner_model.py                 # Fine-tune BERT (~15 min GPU)
cd presentation && python generate_all_charts.py
pdflatex 20251128_1202_ner_tutorial.tex

Output: NER model with ~0.92 F1, entity-tagged dataset

Foundational Knowledge

Neural Networks (foundations/neural_networks/)

Why "foundations"? Neural networks aren't an NLP task - they're the building blocks for solving tasks.

7 Architectures Covered: 1. Perceptron - Linear classification 2. Feedforward - Universal approximation 3. CNN - Spatial patterns 4. RNN - Short-term sequence memory 5. LSTM - Long-term sequence memory 6. Transformer - Attention mechanism 7. Autoencoder - Representation learning

3 Presentation Variants: - Advanced/Comprehensive (89 slides) - All 7 architectures - Feedforward Standard (58 slides) - Iris classification example - Feedforward UAT (50 slides) - Universal approximation theorem proof

Complete Directory Structure

NLP_Data/
├── tasks/                          # MAIN: NLP Task Implementations
│   ├── classification/
│   │   ├── train_models.py
│   │   ├── README.md               # Comprehensive task guide
│   │   └── presentation/           # 39-slide tutorial, 26 charts
│   ├── text_generation/
│   │   ├── ngram_model.py
│   │   ├── train_5gram.py
│   │   ├── neural_lm.py
│   │   ├── ngram_analysis.ipynb
│   │   └── README.md
│   ├── clustering/
│   │   ├── README.md
│   │   └── presentation/           # 33-slide tutorial, 24 charts
│   └── semantic_search/
│       ├── generate_embeddings.py
│       ├── semantic_search.py
│       ├── embedding_analysis.ipynb
│       ├── README.md
│       └── presentation/           # 28-slide tutorial, 18 charts
├── foundations/                    # Foundational Techniques
│   └── neural_networks/
│       ├── README.md               # When to use which architecture
│       └── presentation/
│           ├── advanced_comprehensive/     # 89 slides
│           ├── feedforward_standard/       # 58 slides
│           └── feedforward_uat/            # 50 slides
├── data/                           # All Datasets & Artifacts
│   ├── raw/                        # Original CSV datasets
│   │   ├── basic/                  # 400 headlines
│   │   ├── extended/               # 10,000 headlines + splits
│   │   ├── articles/               # 1,000 articles + splits
│   │   └── large/                  # Experimental
│   ├── processed/                  # Generated artifacts
│   │   ├── embeddings/             # 10,000 × 384 vectors (15 MB)
│   │   ├── models/
│   │   │   ├── classification/     # 6 trained classifiers (15 MB)
│   │   │   └── text_generation/    # 5-gram model (902 KB)
│   │   ├── visualizations/         # PNG preview images
│   │   └── samples/                # Generated text
│   └── results/                    # Model outputs
│       └── classification/         # Predictions, metrics
├── generators/                     # Dataset Generation
│   ├── generate_headlines.py
│   ├── generate_extended_headlines.py
│   ├── generate_articles.py
│   └── create_splits.py
├── docs/                           # Documentation
│   ├── DATASET_OVERVIEW.md
│   ├── ZIPF_LAW_ANALYSIS.md
│   ├── EDUCATIONAL_PRESENTATION_FRAMEWORK.md
│   └── presentations/              # Presentation docs
├── wiki_pages/                     # 13 Comprehensive Wiki Pages
│   ├── home.md
│   ├── Setup-and-Prerequisites.md
│   ├── Task-Classification.md      # NEW: 1,748 words
│   ├── Task-Text-Generation.md     # NEW: 1,737 words
│   ├── Task-Clustering.md          # NEW: 1,985 words
│   ├── Task-Semantic-Search.md     # NEW: 1,849 words
│   ├── Foundations-Neural-Networks.md  # NEW: 1,840 words
│   ├── Dataset-Overview.md         # NEW: 1,448 words
│   ├── Repository-Structure.md     # NEW: 1,585 words
│   ├── Reproducibility-Guide.md    # NEW: 1,766 words
│   ├── FAQ.md                      # NEW: 1,869 words
│   └── Contributing.md             # NEW: 1,708 words
├── public_source/                  # GitLab Pages Source (NEW)
│   ├── index.html                  # Landing page
│   ├── css/style.css               # Purple/gray theme
│   ├── slides/index.html           # All presentations
│   ├── datasets/index.html         # Interactive browser
│   └── docs/index.html             # Documentation hub
├── scripts/                        # GitLab Pages Generation (NEW)
│   ├── generate_dataset_samples.py
│   └── generate_pages_index.py
├── archive/                        # Historical Versions
│   ├── scripts/                    # Old generation scripts
│   └── presentations/              # Old .tex/.pdf versions
├── .gitlab-ci.yml                  # GitLab Pages Deployment (NEW)
├── CLAUDE.md                       # Project instructions
├── README.md                       # This file
└── template_beamer_final.tex       # Beamer template

Educational Progression

Week 1-2: Basic Text Analysis

  • Dataset: data/raw/basic/ (400 headlines)
  • Topics: Tokenization, word counts, Zipf's law
  • Exercises: basic/classification_basic.ipynb, basic/text_generation_basic.ipynb

Week 3-4: Semantic Search Task

  • Learn: How to find similar documents by meaning
  • Code: tasks/semantic_search/
  • Presentation: 28 slides, hands-on embedding tutorial
  • Output: Search system that finds semantically similar headlines
  • Exercises: basic/semantic_search_basic.ipynb, intermediate/semantic_search_intermediate.ipynb

Week 5-6: Classification Task

  • Learn: How to categorize text automatically
  • Code: tasks/classification/
  • Presentation: 39 slides, 6 model comparison
  • Output: Trained classifier (85-95% accuracy)
  • Exercises: intermediate/classification_intermediate.ipynb

Week 7-8: Clustering Task

  • Learn: How to discover topics without labels
  • Code: tasks/clustering/
  • Presentation: 33 slides, unsupervised methods
  • Output: Document clusters, 2D visualizations
  • Exercises: basic/clustering_basic.ipynb, intermediate/clustering_intermediate.ipynb

Week 9-10: Neural Networks Foundations

  • Learn: Building blocks for advanced NLP
  • Code: foundations/neural_networks/
  • Presentations: 3 variants (choose based on audience)
  • Notebooks: feedforward_from_scratch, rnn_sentiment, cnn_text, transformer_attention
  • Output: Understanding of 7 architectures

Week 11-12: Text Generation Task

  • Learn: How to generate coherent text
  • Code: tasks/text_generation/
  • Presentation: 50 slides, N-gram to Transformer
  • Output: Working text generator (n-gram + neural)
  • Exercises: basic/text_generation_basic.ipynb, intermediate/text_generation_intermediate.ipynb

Week 13-14: Summarization Task

  • Learn: How to create concise summaries
  • Code: tasks/summarization/
  • Presentation: 45 slides, extractive and abstractive methods
  • Output: TextRank and BART summarizers
  • Exercises: Basic summarization exercises (to be created)

Week 15-16: Sentiment Analysis Task

  • Learn: How to detect emotional tone
  • Code: tasks/sentiment_analysis/
  • Presentation: 35 slides, 4 model comparison
  • Output: Sentiment classifier (BERT achieves 61.7% accuracy)
  • Exercises: basic/sentiment_basic.ipynb, intermediate/sentiment_intermediate.ipynb

Week 17-18: Named Entity Recognition Task

  • Learn: How to extract structured information
  • Code: tasks/ner/
  • Presentation: 40 slides, BIO tagging and BERT-NER
  • Output: Entity extractor (~0.92 F1 score)
  • Exercises: basic/ner_basic.ipynb, intermediate/ner_intermediate.ipynb

Common Workflows

I want to... (Quick-Finding Guide)

I want to... Go to...
Build a text classifier tasks/classification/
Generate new text tasks/text_generation/
Find similar documents tasks/semantic_search/
Cluster documents tasks/clustering/
Summarize articles tasks/summarization/
Analyze sentiment tasks/sentiment_analysis/
Extract entities tasks/ner/
Learn neural architectures foundations/neural_networks/
Practice with exercises exercises/basic/ or exercises/intermediate/
Understand the datasets docs/DATASET_OVERVIEW.md
Regenerate everything wiki_pages/Reproducibility-Guide.md
View presentations online GitLab Pages (see URL above)

Generate All Presentations

Each task has its own presentation:

# Classification (39 slides)
cd tasks/classification/presentation
python generate_charts.py && pdflatex 20251006_1256_supervised_tutorial.tex

# Clustering (33 slides)
cd tasks/clustering/presentation
python generate_charts.py && pdflatex 20251003_2206_unsupervised_tutorial.tex

# Semantic Search (28 slides)
cd tasks/semantic_search/presentation
python generate_charts.py && pdflatex 20251003_1430_tsne_tutorial.tex

# Neural Networks (choose variant)
cd foundations/neural_networks/presentation
python generate_charts.py && python generate_graphviz_charts.py
cd advanced_comprehensive
pdflatex 20251024_0452_neural_networks.tex

GitLab Pages (NEW!)

Public Website: https://osterrijoerg.git.fhgr.ch/digital-finance/Natural-Language-Processing-Details

Features: - All presentation PDFs online - Interactive dataset browser (search, filter, sort) - Complete documentation - No git clone needed for students

Deploy:

git push origin master  # Automatic deployment via .gitlab-ci.yml

Documentation

Complete Wiki (13 Pages)

Upload wiki_pages/*.md to GitLab Wiki for comprehensive course documentation:

  • Setup & Getting Started (4 pages)
  • Task Tutorials (4 pages) - One per task
  • Reference (5 pages) - Datasets, structure, FAQ, contributing

Technical Documentation

  • CLAUDE.md - Complete project documentation and workflows
  • Task READMEs - Detailed guide for each task
  • docs/ - Dataset specs, Zipf analysis, pedagogical framework

Key Features

  • Task-Oriented: Learn by doing (classify, generate, search, cluster)
  • Complete Implementations: Working code for all tasks
  • Publication-Quality: 200+ professional charts across presentations
  • Fully Reproducible: All generation scripts included
  • Interactive Learning: Jupyter notebooks with hands-on exercises
  • Professional Slides: Beamer presentations (Madrid theme, 8pt, purple accent)
  • Comprehensive Docs: 13 wiki pages, task READMEs, technical specs
  • GitLab Pages: Public website with dataset browser

Statistics

  • 7 Core Tasks (classification, generation, clustering, search, summarization, sentiment, NER)
  • 1 Foundation (neural networks with 7 architectures + 4 implementation notebooks)
  • 13 Presentations (420+ slides total)
  • 10,000 Headlines (main dataset for most tasks)
  • 1,000 Articles (summarization dataset)
  • 240+ Charts (all publication-quality, modular structure)
  • 14 Wiki Pages (20,000+ words of documentation)
  • 7 Task READMEs (comprehensive task guides)
  • 28 Exercise Notebooks (14 basic + 14 intermediate with solutions)

Contributing

See wiki_pages/Contributing.md for: - Adding new tasks - Updating presentations - Version control policy - Quality checklist

License

Educational materials for FHGR courses. Not for commercial use.


Last Updated: 2025-11-28 Maintainer: Digital Finance @ FHGR Repository: https://git.fhgr.ch/digital-finance/Natural-Language-Processing-Details Website: https://osterrijoerg.git.fhgr.ch/digital-finance/Natural-Language-Processing-Details


(c) Joerg Osterrieder 2025