Skip to content

bank-contacts

FINMA-regulated banks and asset managers contact database for academic research

View on GitHub


Information

Property Value
Language Python
Stars 0
Forks 0
Watchers 0
Open Issues 0
License No License
Created 2026-01-20
Last Updated 2026-01-21
Last Push 2026-01-21
Contributors 1
Default Branch master
Visibility private

Datasets

This repository includes 11 dataset(s):

Dataset Format Size

| data | | 0.0 KB |

| asset_managers.json | .json | 458.82 KB |

| banks.json | .json | 62.31 KB |

| banks_scraped.json | .json | 41.46 KB |

| combined.csv | .csv | 208.56 KB |

| combined.json | .json | 522.19 KB |

| combined_final.json | .json | 532.89 KB |

| final.json | .json | 1079.72 KB |

| fintech.json | .json | 1.06 KB |

| scraping_report.html | .html | 2.63 KB |

| data.json | .json | 1079.72 KB |

Reproducibility

This repository includes reproducibility tools:

  • Python requirements.txt

Status

  • Issues: Enabled
  • Wiki: Disabled
  • Pages: Enabled

README

FINMA Regulated Entities Contact Database

A comprehensive database of FINMA-regulated Swiss financial institutions with executive contacts, board members, and quantitative role identification for academic research outreach.

Features

  • 2,171 institutions (270 banks, 1,897 asset managers, 4 fintech)
  • Executive contacts from ZEFIX, company websites, and annual reports
  • Quantitative role identification (Head of Quant, Data Science, Risk Analytics)
  • Email inference with validation using DNS/SMTP verification
  • Stealth scraping with anti-detection measures

Data Sources

Source Description URL
FINMA Official list of authorized institutions finma.ch
ZEFIX Swiss Commercial Register (board members) zefix.admin.ch
Websites Company team/management pages Deep scraping
PDFs Annual report org charts Text extraction

Data Fields

Field Availability Source
Institution Name 100% FINMA
City/Canton 100% FINMA
License Type 100% FINMA
Website ~90% Scraped
Board Members ~92% ZEFIX
Executives ~69% Website/PDF
Quant Contacts ~14-23% Website/PDF
High-Confidence Emails ~37-55% Inferred+Validated

Quick Start

Prerequisites

# Create Python 3.11 environment (required for undetected-chromedriver)
conda create -n selenium_scraper python=3.11
conda activate selenium_scraper

# Install dependencies
pip install -r requirements.txt

Run Full Pipeline

cd scripts
python scraping_orchestrator.py --input combined.json

This runs all stages: 1. ZEFIX scraping (~6 hours) 2. Website deep scraping (~12 hours) 3. PDF extraction (~4 hours) 4. Email validation (~2 hours)

Run Individual Stages

# Stage 1: ZEFIX (board members)
python zefix_scraper.py --input combined.json --output zefix_enriched.json

# Stage 2: Website scraping (executives, quant roles)
python website_deep_scraper.py --input zefix_enriched.json --output website_enriched.json

# Stage 3: PDF extraction (annual reports)
python pdf_extractor.py --input website_enriched.json --output pdf_enriched.json

# Stage 4: Email validation
python email_validator.py --input pdf_enriched.json --output final.json

# Generate exports
python export_quants.py --input final.json
python generate_site.py

Test with Small Sample

python scraping_orchestrator.py --limit 10 --visible

Output Files

File Description
data/final.json Complete enriched data
data/quant_contacts.csv Quantitative role contacts for outreach
data/quant_contacts_high_conf.csv High-confidence emails only
data/scraping_report.html Visual progress report
docs/index.html Interactive GitHub Pages site

Pipeline Architecture

combined.json
    |
    v
[ZEFIX Scraper] --> zefix_enriched.json
    |                (board members, UIDs)
    v
[Website Scraper] --> website_enriched.json
    |                 (executives, quant roles, LinkedIn)
    v
[PDF Extractor] --> pdf_enriched.json
    |               (annual report extraction)
    v
[Email Validator] --> final.json
    |                 (DNS/SMTP validation)
    v
[Exports]
    ├── quant_contacts.csv
    ├── scraping_report.html
    └── GitHub Pages site

Script Descriptions

Script Purpose
selenium_core.py Stealth browser driver with anti-detection
zefix_scraper.py ZEFIX web interface scraper (bypasses API)
website_deep_scraper.py Deep website scraping for team pages
pdf_extractor.py Annual report org chart extraction
email_validator.py DNS/SMTP email validation
scraping_orchestrator.py Pipeline coordinator
export_quants.py CSV export for quant contacts
generate_site.py GitHub Pages site generator

Configuration

Edit config/scraping_config.yaml:

zefix:
  min_delay: 3.0       # Seconds between requests
  max_delay: 6.0
  max_requests_per_session: 25

website:
  min_delay: 2.0
  max_delay: 5.0
  max_pages_per_site: 5

general:
  headless: true       # Set false to see browser
  checkpoint_interval: 10

Checkpoint Recovery

The pipeline saves progress every 10 institutions. If interrupted:

# Resume from checkpoint (default)
python scraping_orchestrator.py

# Start fresh (ignore checkpoint)
python scraping_orchestrator.py --no-resume

Checkpoints stored in: data/checkpoints/

Quantitative Role Keywords

The system searches for these roles (EN/DE/FR):

  • Head of Quantitative Research
  • Quant Analyst / Researcher
  • Chief Data Officer / Head of Data Science
  • Head of Risk Analytics / Risk Modeling
  • Quantitative Strategist
  • Machine Learning / AI Lead

Rate Limiting

To avoid detection and blocking:

  • ZEFIX: 1 request per 3-6 seconds, max 25/session
  • Websites: 1 request per 2-5 seconds, max 40/session
  • PDF downloads: 1 per second
  • Total runtime: ~24 hours for full scrape

View Results

Online

Visit GitHub Pages site

Locally

start data/scraping_report.html    # Progress report
start docs/index.html              # Interactive table

Command Line

# Check ZEFIX coverage
python -c "import json; d=json.load(open('data/final.json')); print(f'Board members: {sum(1 for i in d if i.get(\"board_members\"))}/{len(d)}')"

# Check quant contacts
python -c "import json; d=json.load(open('data/final.json')); q=sum(len(i.get('quant_contacts',[])) for i in d); print(f'Quant contacts: {q}')"

Data Protection

  • Only publicly available information is collected
  • Data sourced from official Swiss government registers and public company websites
  • No personal data beyond publicly listed executive roles
  • For data removal requests, please open an issue

License

Data is sourced from public Swiss government registers. This repository is for academic research purposes.

Disclaimer

This database is provided for academic research purposes only. While we strive for accuracy: - Executive data may change as people move between roles - Email addresses are inferred and may not be accurate - Always verify contact information before use


Last updated: January 2026