1. Data Loader
The loader scans a directory of Ghega files, pairing each invoice image (`.png`) with its block-level OCR output (`.blocks.csv`) and ground-truth CSV (`.groundtruth.csv`), yielding a unified Python object per document.
InvoiceSense is a Python-based pipeline for end-to-end document understanding, built on the Ghega dataset. It extracts text from invoice images via OCR, applies rule-based NLP to pull out key fields (Voltage, Storage Temperature, Power Dissipation), and evaluates extraction quality against ground truth.
extract_text_from_image: full-page Tesseract OCRextract_text_from_blocks: CSV block-level OCR textenhanced_evaluate.py: value-only metricsper_field_evaluate.py: per-field precision & recallThe loader scans a directory of Ghega files, pairing each invoice image (`.png`) with its block-level OCR output (`.blocks.csv`) and ground-truth CSV (`.groundtruth.csv`), yielding a unified Python object per document.
Using Tesseract, we extract full-page text and also parse OCRopus block CSVs for structured text. This dual approach improves recall on handwritten or low-contrast fields.
Example: Raw invoice image → Tesseract text + block-level CSV alignments.
Regex rules pinpoint numerical fields:
Voltage, Storage Temperature, and Power Dissipation.
Each field type has tailored patterns to handle units, decimals, and common OCR errors.
Detected entities overlaid on OCR text for review and debugging.
Run built-in evaluators to compare extracted values against ground truth:
enhanced_evaluate.py for raw numeric matches, and
per_field_evaluate.py for precision/recall per field.
Quick-start debug scripts let you inspect one document’s OCR vs. regex vs. truth.
Sample output: field-level precision & recall across the Ghega test set.
After cloning and installing (see the repo README), you can:
python data/dataset_loader.py – load & summarize documentspython pipeline/integration.py – run full OCR → NLP pipelinepython evaluation/enhanced_evaluate.py – compute value-only metricspython evaluation/per_field_evaluate.py – compute precision & recallpytest – run unit tests for OCR, NLP, and integration