A Specialized and Secure AI Orchestrator for Swiss Financial Compliance
View the Project on GitHub Digital-AI-Finance/wecan-innosuisse-ai-draft
Home > Work Packages > WP3
| Attribute | Value |
|---|---|
| Duration | M4-M15 |
| FHGR Hours | 600h |
| Wecan Hours | 500h |
| Total Hours | 1,100h |
| Lead | FHGR Research Lead |
Scanned Document
|
v
+----------------+
| Pre-processing | <-- Deskew, denoise, enhance
+----------------+
|
v
+----------------+
| OCR Ensemble | <-- PyMuPDF + Tesseract + PaddleOCR
+----------------+
|
v
+----------------+
| Layout Analysis| <-- Docling, LayoutLM
+----------------+
|
v
+----------------+
| Text Merging | <-- Confidence-weighted fusion
+----------------+
|
v
Structured Output
Document (100 pages)
|
v
+------------------+
| Page-Level | <-- Process each page
| Attention |
+------------------+
|
v
+------------------+
| Section-Level | <-- Group by document sections
| Attention |
+------------------+
|
v
+------------------+
| Document-Level | <-- Global context
| Attention |
+------------------+
|
v
Extracted Fields
| Technology | Strengths | Weaknesses | Status |
|---|---|---|---|
| PyMuPDF | Fast, native PDF | No scanned support | Evaluate |
| Tesseract | Open source, multilingual | Accuracy varies | Evaluate |
| PaddleOCR | High accuracy, tables | Chinese-focused | Evaluate |
| Docling | Layout-aware | Newer, less tested | Evaluate |
| EasyOCR | Simple API | Less accurate | Evaluate |
| Metric | Description | Target |
|---|---|---|
| Character Error Rate (CER) | Character-level accuracy | <5% |
| Word Error Rate (WER) | Word-level accuracy | <10% |
| Table Detection | Tables correctly identified | >95% |
| Layout Accuracy | Structure preserved | >90% |
| Processing Speed | Pages per minute | >10 |
| Activity | Owner | Output |
|---|---|---|
| Install and configure PyMuPDF+Tesseract | FHGR | Working pipeline |
| Install and configure Docling | FHGR | Working pipeline |
| Install and configure PaddleOCR | FHGR | Working pipeline |
| Create evaluation test set (50 docs) | FHGR | Test dataset |
| Complete digitization benchmark | FHGR | Benchmark results |
| Select primary OCR technology | FHGR | Selection report |
| Document technology rationale | FHGR | D3.1 |
| Activity | Owner | Output |
|---|---|---|
| Implement hierarchical attention | FHGR | Attention module |
| Develop table extraction | FHGR | Table parser |
| Handle multi-language content | FHGR | Language detection |
| Integrate with WP2 models | FHGR | Unified pipeline |
| Build extraction prototypes | FHGR | D3.2 |
| Activity | Owner | Output |
|---|---|---|
| Validate on 100 documents | FHGR | Validation results |
| Optimize performance | FHGR | Performance report |
| Document validation results | FHGR | D3.3 |
| ID | Deliverable | Due | Owner | Status |
|---|---|---|---|---|
| D3.1 | Technology evaluation report | M6 | FHGR | Complete |
| D3.2 | Document extraction prototypes | M12 | Wecan | Complete |
| D3.3 | Validation report (100 docs) | M15 | FHGR | Complete |
All deliverable templates complete. See deliverables/ for detailed templates.
| Format | Handling | Notes |
|---|---|---|
| Native PDF | Direct text extraction | Preserve layout |
| Scanned PDF | OCR + layout analysis | Multi-pass if needed |
| Image files | OCR | JPEG, PNG, TIFF |
| Mixed mode | Detect and route | Per-page decision |
| Field Type | Method | Accuracy Target |
|---|---|---|
| Text fields | NER + context | 95% |
| Numeric values | Pattern + validation | 98% |
| Dates | Pattern + normalization | 98% |
| Tables | Structure detection | 90% |
| Checkboxes | Visual detection | 95% |
| Signatures | Presence detection | 90% |
| Language | Priority | Training Data |
|---|---|---|
| German | High | 40% of corpus |
| French | High | 30% of corpus |
| Italian | Medium | 15% of corpus |
| English | Medium | 15% of corpus |
Current LLMs have context limits (4K-128K tokens). A 100-page document may exceed this.
| Strategy | Description | Trade-offs |
|---|---|---|
| Chunking | Split document, process chunks | Context loss at boundaries |
| Sliding Window | Overlapping chunks | Redundant processing |
| Hierarchical | Page -> Section -> Document | Complexity, but preserves context |
| Map-Reduce | Extract per page, aggregate | May miss cross-page references |
| Objective | WP3 Contribution |
|---|---|
| OBJ1: 90% Document Accuracy | Primary owner (extraction accuracy) |
| OBJ4: < 2 Hours Processing | Performance optimization |
| OBJ8: 500 Multilingual Documents | Multi-language OCR |
GitHub Issue: #430 - Document Accuracy Blind Assessment Protocol
| Input | Description | Timeline |
|---|---|---|
| Domain-adapted models | Fine-tuned LLMs | M6 |
| Hallucination detection | Validation methods | M6 |
| Annotated dataset | Training/test data | M12 |
| Output | Description | Timeline |
|---|---|---|
| Extracted fields | Structured data | M12 |
| Confidence scores | Per-field certainty | M12 |
| Document structure | Section/page hierarchy | M12 |
| Metric | Target | Measurement |
|---|---|---|
| Accuracy | 90% field-level | Blind assessment |
| Speed | 10+ pages/minute | Benchmark suite |
| Memory | <24GB peak | GPU monitoring |
| Languages | 4 (DE, FR, IT, EN) | Per-language metrics |
| Back to Work Packages | Previous: WP2 | Next: WP4 |