Natural-Language-Processing-Details
Task-oriented NLP educational materials for undergraduate courses
Information
| Property | Value |
|---|---|
| Language | Jupyter Notebook |
| Stars | 0 |
| Forks | 0 |
| Watchers | 0 |
| Open Issues | 0 |
| License | No License |
| Created | 2025-11-29 |
| Last Updated | 2026-02-19 |
| Last Push | 2025-12-19 |
| Contributors | 1 |
| Default Branch | master |
| Visibility | private |
Notebooks
This repository contains 40 notebook(s):
| Notebook | Language | Type |
|---|---|---|
| nlp_basics_homework | PYTHON | jupyter |
| nlp_basics_solutions | PYTHON | jupyter |
| statistical_analysis | PYTHON | jupyter |
| zipf_law_analysis | PYTHON | jupyter |
| zipf_law_analysis | PYTHON | jupyter |
| classification_basic | PYTHON | jupyter |
| classification_basic_solutions | PYTHON | jupyter |
| clustering_basic | PYTHON | jupyter |
| clustering_basic_solutions | PYTHON | jupyter |
| ner_basic | PYTHON | jupyter |
| ner_basic_solutions | PYTHON | jupyter |
| semantic_search_basic | PYTHON | jupyter |
| semantic_search_basic_solutions | PYTHON | jupyter |
| sentiment_basic | PYTHON | jupyter |
| sentiment_basic_solutions | PYTHON | jupyter |
| text_generation_basic | PYTHON | jupyter |
| text_generation_basic_solutions | PYTHON | jupyter |
| classification_intermediate | PYTHON | jupyter |
| classification_intermediate_solutions | PYTHON | jupyter |
| clustering_intermediate | PYTHON | jupyter |
| clustering_intermediate_solutions | PYTHON | jupyter |
| ner_intermediate | PYTHON | jupyter |
| ner_intermediate_solutions | PYTHON | jupyter |
| semantic_search_intermediate | PYTHON | jupyter |
| semantic_search_intermediate_solutions | PYTHON | jupyter |
| sentiment_intermediate | PYTHON | jupyter |
| sentiment_intermediate_solutions | PYTHON | jupyter |
| text_generation_intermediate | PYTHON | jupyter |
| text_generation_intermediate_solutions | PYTHON | jupyter |
| cnn_text_classification | PYTHON | jupyter |
| feedforward_from_scratch | PYTHON | jupyter |
| rnn_sentiment_example | PYTHON | jupyter |
| transformer_attention_demo | PYTHON | jupyter |
| classification_tutorial | PYTHON | jupyter |
| ner_analysis | PYTHON | jupyter |
| embedding_analysis | PYTHON | jupyter |
| sentiment_analysis | PYTHON | jupyter |
| summarization_analysis | PYTHON | jupyter |
| compare_models | PYTHON | jupyter |
| ngram_analysis | PYTHON | jupyter |
Datasets
This repository includes 67 dataset(s):
| Dataset | Format | Size |
|---|---|---|
| data | | 0.0 KB |
| processed | | 0.0 KB |
| embeddings | | 0.0 KB |
| embeddings_metadata.json | .json | 736.77 KB |
| headlines_embeddings.npy | .npy | 15000.12 KB |
| models | | 0.0 KB |
| classification | | 0.0 KB |
| sentiment | | 0.0 KB |
| text_generation | | 0.0 KB |
| samples | | 0.0 KB |
| sample_20251001_190000_1.txt | .txt | 0.22 KB |
| sample_20251001_190000_2.txt | .txt | 0.23 KB |
| sample_20251001_190000_3.txt | .txt | 0.22 KB |
| sample_20251001_190033_1.txt | .txt | 1.45 KB |
| sample_20251001_190033_2.txt | .txt | 1.47 KB |
| sample_20251001_190033_3.txt | .txt | 1.46 KB |
| visualizations | | 0.0 KB |
| clustering_comparison.png | .png | 842.78 KB |
| pca_visualization.png | .png | 584.71 KB |
| similarity_distribution.png | .png | 86.02 KB |
| tsne_visualization.png | .png | 294.86 KB |
| raw | | 0.0 KB |
| articles | | 0.0 KB |
| news_articles_dataset.csv | .csv | 545.38 KB |
| test.csv | .csv | 81.69 KB |
| train.csv | .csv | 381.88 KB |
| val.csv | .csv | 81.97 KB |
| basic | | 0.0 KB |
| VERIFICATION_REPORT.md | .md | 4.77 KB |
| generate_headlines.py | .py | 10.62 KB |
| news_headlines_dataset.csv | .csv | 27.88 KB |
| nlp_basics_homework.ipynb | .ipynb | 14.71 KB |
| nlp_basics_solutions.ipynb | .ipynb | 547.96 KB |
| extended | | 0.0 KB |
| news_headlines_extended.csv | .csv | 724.04 KB |
| statistical_analysis.ipynb | .ipynb | 571.07 KB |
| test.csv | .csv | 108.68 KB |
| train.csv | .csv | 507.07 KB |
| val.csv | .csv | 108.38 KB |
| zipf_law_analysis.ipynb | .ipynb | 352.02 KB |
| extended_ner | | 0.0 KB |
| README.md | .md | 8.53 KB |
| extended_sentiment | | 0.0 KB |
| README.md | .md | 5.94 KB |
| news_headlines_extended_sentiment.csv | .csv | 857.62 KB |
| test.csv | .csv | 128.66 KB |
| train.csv | .csv | 600.29 KB |
| val.csv | .csv | 128.81 KB |
| large | | 0.0 KB |
| news_headlines_large.csv | .csv | 1824.02 KB |
| zipf_law_analysis.ipynb | .ipynb | 346.81 KB |
| results | | 0.0 KB |
| classification | | 0.0 KB |
| decision_tree_predictions.json | .json | 168.8 KB |
| detailed_results.json | .json | 2.94 KB |
| logistic_regression_predictions.json | .json | 168.49 KB |
| model_comparison.csv | .csv | 0.99 KB |
| naive_bayes_predictions.json | .json | 167.88 KB |
| neural_network_predictions.json | .json | 168.53 KB |
| random_forest_predictions.json | .json | 168.51 KB |
| svm_predictions.json | .json | 168.53 KB |
| sentiment | | 0.0 KB |
| model_comparison.csv | .csv | 0.46 KB |
| datasets | | 0.0 KB |
| index.html | .html | 8.46 KB |
| js | | 0.0 KB |
| dataset-browser.js | .js | 7.94 KB |
Reproducibility
This repository includes reproducibility tools:
- Python requirements.txt
Status
- Issues: Enabled
- Wiki: Disabled
- Pages: Disabled
README
Natural Language Processing - Educational Materials
Repository: https://git.fhgr.ch/digital-finance/Natural-Language-Processing-Details Organization: Digital Finance @ FHGR Purpose: Task-oriented NLP course materials from classification to generation GitLab Pages: https://osterrijoerg.git.fhgr.ch/digital-finance/Natural-Language-Processing-Details
Quick Start
Prerequisites
- Python 3.8+
- LaTeX distribution (MiKTeX, TeX Live, or MacTeX)
- Required packages:
Task-Based Organization
This repository is organized by NLP tasks (what you want to accomplish) rather than methodologies (how you accomplish it):
tasks/ # NLP Task Implementations
├── classification/ # Text Classification
│ └── 6 models: Logistic, Naive Bayes, Decision Tree, Random Forest, SVM, Neural Net
├── text_generation/ # Text Generation
│ └── N-gram & Neural language models
├── clustering/ # Document Organization
│ └── K-Means, Hierarchical, DBSCAN + PCA, t-SNE, UMAP
├── semantic_search/ # Similarity Search
│ └── Embedding-based semantic search
├── summarization/ # Text Summarization
│ └── Extractive (TextRank) & Abstractive (BART)
├── sentiment_analysis/ # Sentiment Classification
│ └── 4 models: Logistic, SVM, LSTM, BERT
└── ner/ # Named Entity Recognition
└── BERT-based token classification with BIO tagging
foundations/ # Foundational Knowledge
└── neural_networks/ # 7 Architectures (not a task, but building blocks)
├── Perceptron → Transformer → Autoencoder
└── notebooks/ # 4 implementation notebooks
exercises/ # Practice Exercises
├── basic/ # 14 notebooks (7 tasks × 2)
└── intermediate/ # 14 notebooks (7 tasks × 2)
data/ # All Datasets & Artifacts
├── raw/ # Original CSV datasets
├── processed/ # Generated embeddings, models, visualizations
└── results/ # Model outputs and predictions
Why Task-Based?
Traditional (Method-Based): - "Learn supervised learning" - Abstract, methodology-focused - "I studied embeddings"
Task-Based (This Repo): - "Build a text classifier" - Concrete, goal-focused - "I built a semantic search system"
Students learn WHAT to solve (tasks) before HOW to solve it (methods).
The Seven Core NLP Tasks
1. Classification (tasks/classification/)
What: Assign categories to text Example: "President announces policy" → Politics Models: 6 algorithms (85-95% accuracy) Use Cases: Spam filtering, sentiment analysis, topic categorization
Quick Start:
cd tasks/classification
python train_models.py # Train all 6 models (~3 min)
cd presentation && python generate_charts.py # Generate 26 charts
pdflatex 20251006_1256_supervised_tutorial.tex # Compile 39-slide presentation
Output: Trained models in data/processed/models/classification/*.pkl
2. Text Generation (tasks/text_generation/)
What: Generate coherent, human-like text Example: "The president" → "announced a new economic policy" Models: 5-gram statistical + LSTM neural Use Cases: Content creation, chatbots, code completion
Quick Start:
cd tasks/text_generation
python train_5gram.py # Train 5-gram model (~30 sec)
python generate_half_page.py # Generate 200-word samples
Output: Model in data/processed/models/text_generation/5gram_extended.pkl
3. Clustering (tasks/clustering/)
What: Group similar documents without labels Example: Automatically discover that sports headlines cluster together Algorithms: K-Means, Hierarchical, DBSCAN + PCA, t-SNE, UMAP Use Cases: Topic discovery, document organization, exploratory analysis
Quick Start:
cd tasks/clustering/presentation
python generate_charts.py # Generate 24 charts (~3 min)
pdflatex 20251003_2206_unsupervised_tutorial.tex # Compile 33-slide presentation
Input: Pre-computed embeddings from semantic_search task
4. Semantic Search (tasks/semantic_search/)
What: Find similar documents by meaning (not keywords) Example: "president policy" finds "leader regulation" (different words, same meaning) Model: Sentence-transformers (384-D embeddings) Use Cases: Document retrieval, Q&A, recommendations, duplicate detection
Quick Start:
cd tasks/semantic_search
python generate_embeddings.py # Generate 10K embeddings (~2 min)
python semantic_search.py --interactive # Interactive search demo
Output: data/processed/embeddings/headlines_embeddings.npy (15 MB)
5. Summarization (tasks/summarization/)
What: Generate concise summaries of longer texts Example: 70-word article → 7-word headline Models: TextRank (extractive) + BART (abstractive) Use Cases: News aggregation, document digests, meeting notes
Quick Start:
cd tasks/summarization
python extractive_summary.py # TextRank (~2-3 min)
python train_abstractive.py # BART fine-tuning (30-60 min GPU)
cd presentation && python generate_charts.py
pdflatex 20251123_1604_summarization_tutorial.tex
Output: Summaries with ROUGE metrics, trained BART model
6. Sentiment Analysis (tasks/sentiment_analysis/)
What: Determine emotional tone (positive, negative, neutral) Example: "The president announced a groundbreaking policy" → Positive Models: Logistic Regression, SVM, LSTM, BERT (4 models) Use Cases: Brand monitoring, customer feedback, market sentiment
Quick Start:
cd tasks/sentiment_analysis
python create_sentiment_labels.py # Label 10K headlines
python train_sentiment_models.py # Train 4 models (~5 min)
cd presentation && python generate_all_charts.py
pdflatex 20251128_1150_sentiment_tutorial.tex
Output: Trained models, BERT achieves 61.7% accuracy
7. Named Entity Recognition (tasks/ner/)
What: Extract entities (people, places, organizations) from text Example: "President Biden announced policy in Washington" → PERSON: Biden, GPE: Washington Model: BERT fine-tuned for token classification with BIO tagging Use Cases: Information extraction, knowledge graphs, document indexing
Quick Start:
cd tasks/ner
python annotate_entities.py # Annotate 10K headlines (~3 min)
python train_ner_model.py # Fine-tune BERT (~15 min GPU)
cd presentation && python generate_all_charts.py
pdflatex 20251128_1202_ner_tutorial.tex
Output: NER model with ~0.92 F1, entity-tagged dataset
Foundational Knowledge
Neural Networks (foundations/neural_networks/)
Why "foundations"? Neural networks aren't an NLP task - they're the building blocks for solving tasks.
7 Architectures Covered: 1. Perceptron - Linear classification 2. Feedforward - Universal approximation 3. CNN - Spatial patterns 4. RNN - Short-term sequence memory 5. LSTM - Long-term sequence memory 6. Transformer - Attention mechanism 7. Autoencoder - Representation learning
3 Presentation Variants: - Advanced/Comprehensive (89 slides) - All 7 architectures - Feedforward Standard (58 slides) - Iris classification example - Feedforward UAT (50 slides) - Universal approximation theorem proof
Complete Directory Structure
NLP_Data/
├── tasks/ # MAIN: NLP Task Implementations
│ ├── classification/
│ │ ├── train_models.py
│ │ ├── README.md # Comprehensive task guide
│ │ └── presentation/ # 39-slide tutorial, 26 charts
│ ├── text_generation/
│ │ ├── ngram_model.py
│ │ ├── train_5gram.py
│ │ ├── neural_lm.py
│ │ ├── ngram_analysis.ipynb
│ │ └── README.md
│ ├── clustering/
│ │ ├── README.md
│ │ └── presentation/ # 33-slide tutorial, 24 charts
│ └── semantic_search/
│ ├── generate_embeddings.py
│ ├── semantic_search.py
│ ├── embedding_analysis.ipynb
│ ├── README.md
│ └── presentation/ # 28-slide tutorial, 18 charts
│
├── foundations/ # Foundational Techniques
│ └── neural_networks/
│ ├── README.md # When to use which architecture
│ └── presentation/
│ ├── advanced_comprehensive/ # 89 slides
│ ├── feedforward_standard/ # 58 slides
│ └── feedforward_uat/ # 50 slides
│
├── data/ # All Datasets & Artifacts
│ ├── raw/ # Original CSV datasets
│ │ ├── basic/ # 400 headlines
│ │ ├── extended/ # 10,000 headlines + splits
│ │ ├── articles/ # 1,000 articles + splits
│ │ └── large/ # Experimental
│ ├── processed/ # Generated artifacts
│ │ ├── embeddings/ # 10,000 × 384 vectors (15 MB)
│ │ ├── models/
│ │ │ ├── classification/ # 6 trained classifiers (15 MB)
│ │ │ └── text_generation/ # 5-gram model (902 KB)
│ │ ├── visualizations/ # PNG preview images
│ │ └── samples/ # Generated text
│ └── results/ # Model outputs
│ └── classification/ # Predictions, metrics
│
├── generators/ # Dataset Generation
│ ├── generate_headlines.py
│ ├── generate_extended_headlines.py
│ ├── generate_articles.py
│ └── create_splits.py
│
├── docs/ # Documentation
│ ├── DATASET_OVERVIEW.md
│ ├── ZIPF_LAW_ANALYSIS.md
│ ├── EDUCATIONAL_PRESENTATION_FRAMEWORK.md
│ └── presentations/ # Presentation docs
│
├── wiki_pages/ # 13 Comprehensive Wiki Pages
│ ├── home.md
│ ├── Setup-and-Prerequisites.md
│ ├── Task-Classification.md # NEW: 1,748 words
│ ├── Task-Text-Generation.md # NEW: 1,737 words
│ ├── Task-Clustering.md # NEW: 1,985 words
│ ├── Task-Semantic-Search.md # NEW: 1,849 words
│ ├── Foundations-Neural-Networks.md # NEW: 1,840 words
│ ├── Dataset-Overview.md # NEW: 1,448 words
│ ├── Repository-Structure.md # NEW: 1,585 words
│ ├── Reproducibility-Guide.md # NEW: 1,766 words
│ ├── FAQ.md # NEW: 1,869 words
│ └── Contributing.md # NEW: 1,708 words
│
├── public_source/ # GitLab Pages Source (NEW)
│ ├── index.html # Landing page
│ ├── css/style.css # Purple/gray theme
│ ├── slides/index.html # All presentations
│ ├── datasets/index.html # Interactive browser
│ └── docs/index.html # Documentation hub
│
├── scripts/ # GitLab Pages Generation (NEW)
│ ├── generate_dataset_samples.py
│ └── generate_pages_index.py
│
├── archive/ # Historical Versions
│ ├── scripts/ # Old generation scripts
│ └── presentations/ # Old .tex/.pdf versions
│
├── .gitlab-ci.yml # GitLab Pages Deployment (NEW)
├── CLAUDE.md # Project instructions
├── README.md # This file
└── template_beamer_final.tex # Beamer template
Educational Progression
Week 1-2: Basic Text Analysis
- Dataset:
data/raw/basic/(400 headlines) - Topics: Tokenization, word counts, Zipf's law
- Exercises: basic/classification_basic.ipynb, basic/text_generation_basic.ipynb
Week 3-4: Semantic Search Task
- Learn: How to find similar documents by meaning
- Code:
tasks/semantic_search/ - Presentation: 28 slides, hands-on embedding tutorial
- Output: Search system that finds semantically similar headlines
- Exercises: basic/semantic_search_basic.ipynb, intermediate/semantic_search_intermediate.ipynb
Week 5-6: Classification Task
- Learn: How to categorize text automatically
- Code:
tasks/classification/ - Presentation: 39 slides, 6 model comparison
- Output: Trained classifier (85-95% accuracy)
- Exercises: intermediate/classification_intermediate.ipynb
Week 7-8: Clustering Task
- Learn: How to discover topics without labels
- Code:
tasks/clustering/ - Presentation: 33 slides, unsupervised methods
- Output: Document clusters, 2D visualizations
- Exercises: basic/clustering_basic.ipynb, intermediate/clustering_intermediate.ipynb
Week 9-10: Neural Networks Foundations
- Learn: Building blocks for advanced NLP
- Code:
foundations/neural_networks/ - Presentations: 3 variants (choose based on audience)
- Notebooks: feedforward_from_scratch, rnn_sentiment, cnn_text, transformer_attention
- Output: Understanding of 7 architectures
Week 11-12: Text Generation Task
- Learn: How to generate coherent text
- Code:
tasks/text_generation/ - Presentation: 50 slides, N-gram to Transformer
- Output: Working text generator (n-gram + neural)
- Exercises: basic/text_generation_basic.ipynb, intermediate/text_generation_intermediate.ipynb
Week 13-14: Summarization Task
- Learn: How to create concise summaries
- Code:
tasks/summarization/ - Presentation: 45 slides, extractive and abstractive methods
- Output: TextRank and BART summarizers
- Exercises: Basic summarization exercises (to be created)
Week 15-16: Sentiment Analysis Task
- Learn: How to detect emotional tone
- Code:
tasks/sentiment_analysis/ - Presentation: 35 slides, 4 model comparison
- Output: Sentiment classifier (BERT achieves 61.7% accuracy)
- Exercises: basic/sentiment_basic.ipynb, intermediate/sentiment_intermediate.ipynb
Week 17-18: Named Entity Recognition Task
- Learn: How to extract structured information
- Code:
tasks/ner/ - Presentation: 40 slides, BIO tagging and BERT-NER
- Output: Entity extractor (~0.92 F1 score)
- Exercises: basic/ner_basic.ipynb, intermediate/ner_intermediate.ipynb
Common Workflows
I want to... (Quick-Finding Guide)
| I want to... | Go to... |
|---|---|
| Build a text classifier | tasks/classification/ |
| Generate new text | tasks/text_generation/ |
| Find similar documents | tasks/semantic_search/ |
| Cluster documents | tasks/clustering/ |
| Summarize articles | tasks/summarization/ |
| Analyze sentiment | tasks/sentiment_analysis/ |
| Extract entities | tasks/ner/ |
| Learn neural architectures | foundations/neural_networks/ |
| Practice with exercises | exercises/basic/ or exercises/intermediate/ |
| Understand the datasets | docs/DATASET_OVERVIEW.md |
| Regenerate everything | wiki_pages/Reproducibility-Guide.md |
| View presentations online | GitLab Pages (see URL above) |
Generate All Presentations
Each task has its own presentation:
# Classification (39 slides)
cd tasks/classification/presentation
python generate_charts.py && pdflatex 20251006_1256_supervised_tutorial.tex
# Clustering (33 slides)
cd tasks/clustering/presentation
python generate_charts.py && pdflatex 20251003_2206_unsupervised_tutorial.tex
# Semantic Search (28 slides)
cd tasks/semantic_search/presentation
python generate_charts.py && pdflatex 20251003_1430_tsne_tutorial.tex
# Neural Networks (choose variant)
cd foundations/neural_networks/presentation
python generate_charts.py && python generate_graphviz_charts.py
cd advanced_comprehensive
pdflatex 20251024_0452_neural_networks.tex
GitLab Pages (NEW!)
Public Website: https://osterrijoerg.git.fhgr.ch/digital-finance/Natural-Language-Processing-Details
Features: - All presentation PDFs online - Interactive dataset browser (search, filter, sort) - Complete documentation - No git clone needed for students
Deploy:
Documentation
Complete Wiki (13 Pages)
Upload wiki_pages/*.md to GitLab Wiki for comprehensive course documentation:
- Setup & Getting Started (4 pages)
- Task Tutorials (4 pages) - One per task
- Reference (5 pages) - Datasets, structure, FAQ, contributing
Technical Documentation
CLAUDE.md- Complete project documentation and workflows- Task READMEs - Detailed guide for each task
docs/- Dataset specs, Zipf analysis, pedagogical framework
Key Features
- Task-Oriented: Learn by doing (classify, generate, search, cluster)
- Complete Implementations: Working code for all tasks
- Publication-Quality: 200+ professional charts across presentations
- Fully Reproducible: All generation scripts included
- Interactive Learning: Jupyter notebooks with hands-on exercises
- Professional Slides: Beamer presentations (Madrid theme, 8pt, purple accent)
- Comprehensive Docs: 13 wiki pages, task READMEs, technical specs
- GitLab Pages: Public website with dataset browser
Statistics
- 7 Core Tasks (classification, generation, clustering, search, summarization, sentiment, NER)
- 1 Foundation (neural networks with 7 architectures + 4 implementation notebooks)
- 13 Presentations (420+ slides total)
- 10,000 Headlines (main dataset for most tasks)
- 1,000 Articles (summarization dataset)
- 240+ Charts (all publication-quality, modular structure)
- 14 Wiki Pages (20,000+ words of documentation)
- 7 Task READMEs (comprehensive task guides)
- 28 Exercise Notebooks (14 basic + 14 intermediate with solutions)
Contributing
See wiki_pages/Contributing.md for:
- Adding new tasks
- Updating presentations
- Version control policy
- Quality checklist
License
Educational materials for FHGR courses. Not for commercial use.
Last Updated: 2025-11-28 Maintainer: Digital Finance @ FHGR Repository: https://git.fhgr.ch/digital-finance/Natural-Language-Processing-Details Website: https://osterrijoerg.git.fhgr.ch/digital-finance/Natural-Language-Processing-Details
(c) Joerg Osterrieder 2025