Shera Jafaritabar

GitHub logo View on GitHub

Project Overview

InvoiceSense is a Python-based pipeline for end-to-end document understanding, built on the Ghega dataset. It extracts text from invoice images via OCR, applies rule-based NLP to pull out key fields (Voltage, Storage Temperature, Power Dissipation), and evaluates extraction quality against ground truth.

Key Features

Pipeline Components

1. Data Loader

The loader scans a directory of Ghega files, pairing each invoice image (`.png`) with its block-level OCR output (`.blocks.csv`) and ground-truth CSV (`.groundtruth.csv`), yielding a unified Python object per document.

2. OCR Extraction

Using Tesseract, we extract full-page text and also parse OCRopus block CSVs for structured text. This dual approach improves recall on handwritten or low-contrast fields.

OCR output example

Example: Raw invoice image → Tesseract text + block-level CSV alignments.

3. Entity Extraction

Regex rules pinpoint numerical fields: Voltage, Storage Temperature, and Power Dissipation. Each field type has tailored patterns to handle units, decimals, and common OCR errors.

Entity extraction example

Detected entities overlaid on OCR text for review and debugging.

4. Evaluation & Debug

Run built-in evaluators to compare extracted values against ground truth: enhanced_evaluate.py for raw numeric matches, and per_field_evaluate.py for precision/recall per field. Quick-start debug scripts let you inspect one document’s OCR vs. regex vs. truth.

Evaluation metrics

Sample output: field-level precision & recall across the Ghega test set.

Usage & Testing

After cloning and installing (see the repo README), you can: