AI Orchestrator

A Specialized and Secure AI Orchestrator for Swiss Financial Compliance

View the Project on GitHub Digital-AI-Finance/wecan-innosuisse-ai-draft

WP3: Long Document Understanding & Field Extraction

Home > Work Packages > WP3


Overview

Attribute Value
Duration M4-M15
FHGR Hours 600h
Wecan Hours 500h
Total Hours 1,100h
Lead FHGR Research Lead

Objectives

  1. Develop robust OCR pipeline for Swiss compliance documents
  2. Implement hierarchical attention for 50-100 page documents
  3. Achieve 90% extraction accuracy on extended documents
  4. Handle variable layouts, tables, and multi-language content
  5. Validate on 100+ real-world documents

Technical Approach

OCR Pipeline Architecture

Scanned Document
       |
       v
+----------------+
| Pre-processing |  <-- Deskew, denoise, enhance
+----------------+
       |
       v
+----------------+
| OCR Ensemble   |  <-- PyMuPDF + Tesseract + PaddleOCR
+----------------+
       |
       v
+----------------+
| Layout Analysis|  <-- Docling, LayoutLM
+----------------+
       |
       v
+----------------+
| Text Merging   |  <-- Confidence-weighted fusion
+----------------+
       |
       v
Structured Output

Hierarchical Attention for Long Documents

Document (100 pages)
       |
       v
+------------------+
| Page-Level       |  <-- Process each page
| Attention        |
+------------------+
       |
       v
+------------------+
| Section-Level    |  <-- Group by document sections
| Attention        |
+------------------+
       |
       v
+------------------+
| Document-Level   |  <-- Global context
| Attention        |
+------------------+
       |
       v
Extracted Fields

Technology Evaluation

OCR Candidates

Technology Strengths Weaknesses Status
PyMuPDF Fast, native PDF No scanned support Evaluate
Tesseract Open source, multilingual Accuracy varies Evaluate
PaddleOCR High accuracy, tables Chinese-focused Evaluate
Docling Layout-aware Newer, less tested Evaluate
EasyOCR Simple API Less accurate Evaluate

Evaluation Metrics

Metric Description Target
Character Error Rate (CER) Character-level accuracy <5%
Word Error Rate (WER) Word-level accuracy <10%
Table Detection Tables correctly identified >95%
Layout Accuracy Structure preserved >90%
Processing Speed Pages per minute >10

Activities

M4-M6: Technology Selection

Activity Owner Output
Install and configure PyMuPDF+Tesseract FHGR Working pipeline
Install and configure Docling FHGR Working pipeline
Install and configure PaddleOCR FHGR Working pipeline
Create evaluation test set (50 docs) FHGR Test dataset
Complete digitization benchmark FHGR Benchmark results
Select primary OCR technology FHGR Selection report
Document technology rationale FHGR D3.1

M7-M12: Development

Activity Owner Output
Implement hierarchical attention FHGR Attention module
Develop table extraction FHGR Table parser
Handle multi-language content FHGR Language detection
Integrate with WP2 models FHGR Unified pipeline
Build extraction prototypes FHGR D3.2

M13-M15: Validation

Activity Owner Output
Validate on 100 documents FHGR Validation results
Optimize performance FHGR Performance report
Document validation results FHGR D3.3

Deliverables

ID Deliverable Due Owner Status
D3.1 Technology evaluation report M6 FHGR Complete
D3.2 Document extraction prototypes M12 Wecan Complete
D3.3 Validation report (100 docs) M15 FHGR Complete

All deliverable templates complete. See deliverables/ for detailed templates.


Document Processing Pipeline

Input Handling

Format Handling Notes
Native PDF Direct text extraction Preserve layout
Scanned PDF OCR + layout analysis Multi-pass if needed
Image files OCR JPEG, PNG, TIFF
Mixed mode Detect and route Per-page decision

Field Extraction Types

Field Type Method Accuracy Target
Text fields NER + context 95%
Numeric values Pattern + validation 98%
Dates Pattern + normalization 98%
Tables Structure detection 90%
Checkboxes Visual detection 95%
Signatures Presence detection 90%

Language Support

Language Priority Training Data
German High 40% of corpus
French High 30% of corpus
Italian Medium 15% of corpus
English Medium 15% of corpus

Long Document Strategies

Challenge: Context Window Limits

Current LLMs have context limits (4K-128K tokens). A 100-page document may exceed this.

Solutions Evaluated

Strategy Description Trade-offs
Chunking Split document, process chunks Context loss at boundaries
Sliding Window Overlapping chunks Redundant processing
Hierarchical Page -> Section -> Document Complexity, but preserves context
Map-Reduce Extract per page, aggregate May miss cross-page references

Selected Approach: Hierarchical Attention

  1. Page Level: Extract all fields from each page
  2. Section Level: Group pages by document section, resolve cross-page entities
  3. Document Level: Validate consistency, resolve conflicts

Objective Alignment

Objective WP3 Contribution
OBJ1: 90% Document Accuracy Primary owner (extraction accuracy)
OBJ4: < 2 Hours Processing Performance optimization
OBJ8: 500 Multilingual Documents Multi-language OCR

GitHub Issue: #430 - Document Accuracy Blind Assessment Protocol


Milestone Checkpoints

MS1 (M4)

MS2 (M6)

MS3 (M12)

MS4 (M15)


Integration Points

From WP2

Input Description Timeline
Domain-adapted models Fine-tuned LLMs M6
Hallucination detection Validation methods M6
Annotated dataset Training/test data M12

To WP4

Output Description Timeline
Extracted fields Structured data M12
Confidence scores Per-field certainty M12
Document structure Section/page hierarchy M12

Performance Targets

Metric Target Measurement
Accuracy 90% field-level Blind assessment
Speed 10+ pages/minute Benchmark suite
Memory <24GB peak GPU monitoring
Languages 4 (DE, FR, IT, EN) Per-language metrics

Back to Work Packages Previous: WP2 Next: WP4