How to Extract Data from Scanned Documents

Scanned documents are one of the hardest data sources to work with. They're images, not text — which means you can't search them, copy from them, or feed them into a database. But they contain valuable data: invoices from vendors who still email scans, legacy records stored as TIFF files, forms filled out by hand and photographed.

Here's how to turn scanned documents into structured, usable JSON data.

The Challenge with Scanned Documents

A scanned PDF isn't like a digital PDF. When you open a digital PDF, the text is stored as characters — your computer knows what each word says. A scanned document is just a picture. The computer sees pixels, not text.

This means you need two steps:

OCR — Convert the image to text
Extraction — Turn that text into structured data

Most tools stop at step 1. They give you a wall of raw text and leave you to figure out what it means. That's where schema-based extraction comes in — it understands the context and gives you structured JSON.

How OCR + Extraction Works

Step 1: Image Processing

The scanned document is analyzed for text regions. The OCR engine identifies characters, words, and their positions on the page. Modern OCR handles:

Printed text in any font
Low-resolution scans (300 DPI is ideal, but lower works)
Skewed or rotated pages
Multi-column layouts
Tables and structured content

Step 2: Text Reconstruction

Raw OCR output is assembled into readable text, preserving the document's structure — headings, paragraphs, tables, and lists.

Step 3: Schema-Based Extraction

The reconstructed text is analyzed against your JSON schema. Instead of you writing regex patterns to find invoice numbers or dates, the extraction engine understands the document semantically and maps data to your schema fields.

Extracting Data from a Scanned Invoice

Here's a concrete example. You have a scanned invoice — a photograph or a scan from a vendor.

Define Your Schema

{
  "type": "object",
  "properties": {
    "vendor": {
      "type": "object",
      "properties": {
        "name": { "type": "string" },
        "address": { "type": "string" },
        "tax_id": { "type": "string" }
      }
    },
    "invoice_number": { "type": "string" },
    "date": { "type": "string", "format": "date" },
    "due_date": { "type": "string", "format": "date" },
    "line_items": {
      "type": "array",
      "items": {
        "type": "object",
        "properties": {
          "description": { "type": "string" },
          "quantity": { "type": "number" },
          "unit_price": { "type": "number" },
          "total": { "type": "number" }
        }
      }
    },
    "subtotal": { "type": "number" },
    "tax_rate": { "type": "number" },
    "tax_amount": { "type": "number" },
    "total": { "type": "number" }
  }
}

Upload the Scan

curl -X POST https://api.smole.tech/api/pipeline/file \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -F "file=@scanned-invoice.pdf" \
  -F "schemaId=YOUR_SCHEMA_ID" \
  -F "ocr=true"

Smole detects scanned documents automatically, but you can explicitly enable OCR with the ocr flag.

Get Structured JSON

{
  "vendor": {
    "name": "Müller Office Supplies GmbH",
    "address": "Hauptstr. 42, 10115 Berlin",
    "tax_id": "DE123456789"
  },
  "invoice_number": "INV-2025-0394",
  "date": "2025-11-01",
  "due_date": "2025-12-01",
  "line_items": [
    { "description": "A4 Paper (500 sheets)", "quantity": 10, "unit_price": 4.50, "total": 45.00 },
    { "description": "Toner Cartridge Black", "quantity": 2, "unit_price": 35.00, "total": 70.00 }
  ],
  "subtotal": 115.00,
  "tax_rate": 0.19,
  "tax_amount": 21.85,
  "total": 136.85
}

Even from a scan, you get clean, typed data you can feed directly into your system.

Supported Image Formats

Smole processes scanned documents in any common format:

PDF — Scanned PDFs (single and multi-page)
PNG — Screenshots, phone captures
JPEG — Photographs of documents
TIFF — Common in enterprise scanning workflows
BMP, WEBP, GIF — Also supported

Tips for Better OCR Results

Document Quality

Resolution: 300 DPI produces the best results. Lower resolutions work but may reduce accuracy on small text.
Contrast: High contrast between text and background helps. Avoid washed-out or overexposed scans.
Alignment: Slight skew is handled automatically, but significantly rotated documents may need straightening.

Schema Design

Use descriptive field names — The extraction engine uses field names as context. vendor_tax_id is more informative than id_field_3.
Expect imperfections — OCR can occasionally misread characters (0 vs O, 1 vs l). Use numeric types for amounts so the extraction engine knows to interpret ambiguous characters as digits.
Group related fields — Nesting related fields (like vendor name + address in a vendor object) helps the extraction engine understand relationships.

Real-World Use Cases

Digitizing Legacy Archives

Organizations often have years of paper records that were scanned to PDF. These archives contain contracts, personnel files, financial records, and compliance documents. Schema-based extraction turns them into searchable, queryable data.

Processing Vendor Invoices

Many vendors, especially smaller ones, still send invoices as scans or photos. Automating extraction from these documents eliminates manual data entry in accounts payable workflows.

Handling Forms and Applications

Insurance claims, tax forms, HR documents, and customer applications are frequently scanned. Extracting structured data from these documents speeds up processing and reduces errors.

Try It Yourself

Upload a scanned document in the Playground — even a photo from your phone works. Define a schema and see the extraction results in seconds.

For integration details, see the API documentation.

How to Extract Data from Scanned Documents

How to Extract Data from Scanned Documents

The Challenge with Scanned Documents

How OCR + Extraction Works

Step 1: Image Processing

Step 2: Text Reconstruction

Step 3: Schema-Based Extraction

Extracting Data from a Scanned Invoice

Define Your Schema

Upload the Scan

Get Structured JSON

Supported Image Formats

Tips for Better OCR Results

Document Quality

Schema Design

Real-World Use Cases

Digitizing Legacy Archives

Processing Vendor Invoices

Handling Forms and Applications

Try It Yourself

Related articles

How to Extract Tables from PDFs into Structured Data

Extract Structured Data from Documents with Python

How to Extract Data from Images with an API