WP3: Long Document Understanding & Field Extraction

Overview

Attribute	Value
Duration	M4-M15
FHGR Hours	600h
Wecan Hours	500h
Total Hours	1,100h
Lead	FHGR Research Lead

Objectives

Develop robust OCR pipeline for Swiss compliance documents
Implement hierarchical attention for 50-100 page documents
Achieve 90% extraction accuracy on extended documents
Handle variable layouts, tables, and multi-language content
Validate on 100+ real-world documents

Technical Approach

OCR Pipeline Architecture

Scanned Document
       |
       v
+----------------+
| Pre-processing |  <-- Deskew, denoise, enhance
+----------------+
       |
       v
+----------------+
| OCR Ensemble   |  <-- PyMuPDF + Tesseract + PaddleOCR
+----------------+
       |
       v
+----------------+
| Layout Analysis|  <-- Docling, LayoutLM
+----------------+
       |
       v
+----------------+
| Text Merging   |  <-- Confidence-weighted fusion
+----------------+
       |
       v
Structured Output

Hierarchical Attention for Long Documents

Document (100 pages)
       |
       v
+------------------+
| Page-Level       |  <-- Process each page
| Attention        |
+------------------+
       |
       v
+------------------+
| Section-Level    |  <-- Group by document sections
| Attention        |
+------------------+
       |
       v
+------------------+
| Document-Level   |  <-- Global context
| Attention        |
+------------------+
       |
       v
Extracted Fields

Technology Evaluation

OCR Candidates

Technology	Strengths	Weaknesses	Status
PyMuPDF	Fast, native PDF	No scanned support	Evaluate
Tesseract	Open source, multilingual	Accuracy varies	Evaluate
PaddleOCR	High accuracy, tables	Chinese-focused	Evaluate
Docling	Layout-aware	Newer, less tested	Evaluate
EasyOCR	Simple API	Less accurate	Evaluate

Evaluation Metrics

Metric	Description	Target
Character Error Rate (CER)	Character-level accuracy	<5%
Word Error Rate (WER)	Word-level accuracy	<10%
Table Detection	Tables correctly identified	>95%
Layout Accuracy	Structure preserved	>90%
Processing Speed	Pages per minute	>10

Activities

M4-M6: Technology Selection

Activity	Owner	Output
Install and configure PyMuPDF+Tesseract	FHGR	Working pipeline
Install and configure Docling	FHGR	Working pipeline
Install and configure PaddleOCR	FHGR	Working pipeline
Create evaluation test set (50 docs)	FHGR	Test dataset
Complete digitization benchmark	FHGR	Benchmark results
Select primary OCR technology	FHGR	Selection report
Document technology rationale	FHGR	D3.1

M7-M12: Development

Activity	Owner	Output
Implement hierarchical attention	FHGR	Attention module
Develop table extraction	FHGR	Table parser
Handle multi-language content	FHGR	Language detection
Integrate with WP2 models	FHGR	Unified pipeline
Build extraction prototypes	FHGR	D3.2

M13-M15: Validation

Activity	Owner	Output
Validate on 100 documents	FHGR	Validation results
Optimize performance	FHGR	Performance report
Document validation results	FHGR	D3.3

Deliverables

ID	Deliverable	Due	Owner	Status
D3.1	Technology evaluation report	M6	FHGR	Complete
D3.2	Document extraction prototypes	M12	Wecan	Complete
D3.3	Validation report (100 docs)	M15	FHGR	Complete

All deliverable templates complete. See deliverables/ for detailed templates.

Document Processing Pipeline

Input Handling

Format	Handling	Notes
Native PDF	Direct text extraction	Preserve layout
Scanned PDF	OCR + layout analysis	Multi-pass if needed
Image files	OCR	JPEG, PNG, TIFF
Mixed mode	Detect and route	Per-page decision

Field Extraction Types

Field Type	Method	Accuracy Target
Text fields	NER + context	95%
Numeric values	Pattern + validation	98%
Dates	Pattern + normalization	98%
Tables	Structure detection	90%
Checkboxes	Visual detection	95%
Signatures	Presence detection	90%

Language Support

Language	Priority	Training Data
German	High	40% of corpus
French	High	30% of corpus
Italian	Medium	15% of corpus
English	Medium	15% of corpus

Long Document Strategies

Challenge: Context Window Limits

Current LLMs have context limits (4K-128K tokens). A 100-page document may exceed this.

Solutions Evaluated

Strategy	Description	Trade-offs
Chunking	Split document, process chunks	Context loss at boundaries
Sliding Window	Overlapping chunks	Redundant processing
Hierarchical	Page -> Section -> Document	Complexity, but preserves context
Map-Reduce	Extract per page, aggregate	May miss cross-page references

Selected Approach: Hierarchical Attention

Page Level: Extract all fields from each page
Section Level: Group pages by document section, resolve cross-page entities
Document Level: Validate consistency, resolve conflicts

Objective Alignment

Objective	WP3 Contribution
OBJ1: 90% Document Accuracy	Primary owner (extraction accuracy)
OBJ4: < 2 Hours Processing	Performance optimization
OBJ8: 500 Multilingual Documents	Multi-language OCR

GitHub Issue: #430 - Document Accuracy Blind Assessment Protocol

Milestone Checkpoints

MS1 (M4)

OCR tools installed and configured
Evaluation test set created (50 docs)

MS2 (M6)

Digitization benchmarked on 100 documents
Primary OCR technology selected
D3.1 Technology evaluation report complete

MS3 (M12)

D3.2 Document extraction prototypes delivered
Long document strategies validated on 20+ multi-page docs

MS4 (M15)

D3.3 Validation report (100 docs) complete
90% accuracy achieved on test set

Integration Points

From WP2

Input	Description	Timeline
Domain-adapted models	Fine-tuned LLMs	M6
Hallucination detection	Validation methods	M6
Annotated dataset	Training/test data	M12

To WP4

Output	Description	Timeline
Extracted fields	Structured data	M12
Confidence scores	Per-field certainty	M12
Document structure	Section/page hierarchy	M12

Performance Targets

Metric	Target	Measurement
Accuracy	90% field-level	Blind assessment
Speed	10+ pages/minute	Benchmark suite
Memory	<24GB peak	GPU monitoring
Languages	4 (DE, FR, IT, EN)	Per-language metrics

Back to Work Packages

Previous: WP2

Next: WP4