InvoiceSense: OCR-Driven Document Understanding

Project Overview

InvoiceSense is a Python-based pipeline for end-to-end document understanding, built on the Ghega dataset. It extracts text from invoice images via OCR, applies rule-based NLP to pull out key fields (Voltage, Storage Temperature, Power Dissipation), and evaluates extraction quality against ground truth.

Key Features

Dataset Loader: Groups `.png`, `.blocks.csv`, and `.groundtruth.csv` files into per-document structures.
OCR Extraction:
- extract_text_from_image: full-page Tesseract OCR
- extract_text_from_blocks: CSV block-level OCR text
Rule-Based Entity Extraction: Regex-driven extraction of Voltage, Storage Temperature, and Power Dissipation.
Pipeline Integration: One-command runner to perform OCR → NLP on a single document or the full dataset.
Evaluation Scripts:
- enhanced_evaluate.py: value-only metrics
- per_field_evaluate.py: per-field precision & recall
Debug Tools: Inspect OCR text, regex matches, and ground truth side-by-side for any document.

Pipeline Components

1. Data Loader

The loader scans a directory of Ghega files, pairing each invoice image (`.png`) with its block-level OCR output (`.blocks.csv`) and ground-truth CSV (`.groundtruth.csv`), yielding a unified Python object per document.

2. OCR Extraction

Using Tesseract, we extract full-page text and also parse OCRopus block CSVs for structured text. This dual approach improves recall on handwritten or low-contrast fields.

Example: Raw invoice image → Tesseract text + block-level CSV alignments.

3. Entity Extraction

Regex rules pinpoint numerical fields: Voltage, Storage Temperature, and Power Dissipation. Each field type has tailored patterns to handle units, decimals, and common OCR errors.

Detected entities overlaid on OCR text for review and debugging.

4. Evaluation & Debug

Run built-in evaluators to compare extracted values against ground truth: enhanced_evaluate.py for raw numeric matches, and per_field_evaluate.py for precision/recall per field. Quick-start debug scripts let you inspect one document’s OCR vs. regex vs. truth.

Sample output: field-level precision & recall across the Ghega test set.

Usage & Testing

After cloning and installing (see the repo README), you can:

python data/dataset_loader.py – load & summarize documents
python pipeline/integration.py – run full OCR → NLP pipeline
python evaluation/enhanced_evaluate.py – compute value-only metrics
python evaluation/per_field_evaluate.py – compute precision & recall
pytest – run unit tests for OCR, NLP, and integration