PDF2Data – From Documents to Clean Datasets

Turning unstructured PDFs into analysis-ready data pipelines.

Roadmap

  1. Phase 1 – Structured Extraction (MVP)
    • Upload PDFs → extract predefined fields / tables.
    • Support invoices, contracts, inspection reports.
    • Export CSV / JSON; basic preview UI.
  2. Phase 2 – Data Cleansing Pipeline
    • Rule-based + LLM-assisted deduplication, type casting, currency/date normalisation.
    • Interactive cleaning UI with bulk actions & undo.
  3. Phase 3 – Intelligent Analytics
    • Merge multi-file datasets, build KPIs and anomaly detection.
    • Natural-language querying (Chat over data).