Back to blog
ocrscanned-documentsextractiontutorial

How to Extract Data from Scanned Documents

February 20, 2026Smole Team

How to Extract Data from Scanned Documents

Scanned documents are one of the hardest data sources to work with. They're images, not text — which means you can't search them, copy from them, or feed them into a database. But they contain valuable data: invoices from vendors who still email scans, legacy records stored as TIFF files, forms filled out by hand and photographed.

Here's how to turn scanned documents into structured, usable JSON data.

The Challenge with Scanned Documents

A scanned PDF isn't like a digital PDF. When you open a digital PDF, the text is stored as characters — your computer knows what each word says. A scanned document is just a picture. The computer sees pixels, not text.

This means you need two steps:

  1. OCR — Convert the image to text
  2. Extraction — Turn that text into structured data

Most tools stop at step 1. They give you a wall of raw text and leave you to figure out what it means. That's where schema-based extraction comes in — it understands the context and gives you structured JSON.

How OCR + Extraction Works

Step 1: Image Processing

The scanned document is analyzed for text regions. The OCR engine identifies characters, words, and their positions on the page. Modern OCR handles:

  • Printed text in any font
  • Low-resolution scans (300 DPI is ideal, but lower works)
  • Skewed or rotated pages
  • Multi-column layouts
  • Tables and structured content

Step 2: Text Reconstruction

Raw OCR output is assembled into readable text, preserving the document's structure — headings, paragraphs, tables, and lists.

Step 3: Schema-Based Extraction

The reconstructed text is analyzed against your JSON schema. Instead of you writing regex patterns to find invoice numbers or dates, the extraction engine understands the document semantically and maps data to your schema fields.

Extracting Data from a Scanned Invoice

Here's a concrete example. You have a scanned invoice — a photograph or a scan from a vendor.

Define Your Schema

{
  "type": "object",
  "properties": {
    "vendor": {
      "type": "object",
      "properties": {
        "name": { "type": "string" },
        "address": { "type": "string" },
        "tax_id": { "type": "string" }
      }
    },
    "invoice_number": { "type": "string" },
    "date": { "type": "string", "format": "date" },
    "due_date": { "type": "string", "format": "date" },
    "line_items": {
      "type": "array",
      "items": {
        "type": "object",
        "properties": {
          "description": { "type": "string" },
          "quantity": { "type": "number" },
          "unit_price": { "type": "number" },
          "total": { "type": "number" }
        }
      }
    },
    "subtotal": { "type": "number" },
    "tax_rate": { "type": "number" },
    "tax_amount": { "type": "number" },
    "total": { "type": "number" }
  }
}

Upload the Scan

curl -X POST https://api.smole.tech/api/pipeline/file \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -F "file=@scanned-invoice.pdf" \
  -F "schemaId=YOUR_SCHEMA_ID" \
  -F "ocr=true"

Smole detects scanned documents automatically, but you can explicitly enable OCR with the ocr flag.

Get Structured JSON

{
  "vendor": {
    "name": "Müller Office Supplies GmbH",
    "address": "Hauptstr. 42, 10115 Berlin",
    "tax_id": "DE123456789"
  },
  "invoice_number": "INV-2025-0394",
  "date": "2025-11-01",
  "due_date": "2025-12-01",
  "line_items": [
    { "description": "A4 Paper (500 sheets)", "quantity": 10, "unit_price": 4.50, "total": 45.00 },
    { "description": "Toner Cartridge Black", "quantity": 2, "unit_price": 35.00, "total": 70.00 }
  ],
  "subtotal": 115.00,
  "tax_rate": 0.19,
  "tax_amount": 21.85,
  "total": 136.85
}

Even from a scan, you get clean, typed data you can feed directly into your system.

Supported Image Formats

Smole processes scanned documents in any common format:

  • PDF — Scanned PDFs (single and multi-page)
  • PNG — Screenshots, phone captures
  • JPEG — Photographs of documents
  • TIFF — Common in enterprise scanning workflows
  • BMP, WEBP, GIF — Also supported

Tips for Better OCR Results

Document Quality

  • Resolution: 300 DPI produces the best results. Lower resolutions work but may reduce accuracy on small text.
  • Contrast: High contrast between text and background helps. Avoid washed-out or overexposed scans.
  • Alignment: Slight skew is handled automatically, but significantly rotated documents may need straightening.

Schema Design

  • Use descriptive field names — The extraction engine uses field names as context. vendor_tax_id is more informative than id_field_3.
  • Expect imperfections — OCR can occasionally misread characters (0 vs O, 1 vs l). Use numeric types for amounts so the extraction engine knows to interpret ambiguous characters as digits.
  • Group related fields — Nesting related fields (like vendor name + address in a vendor object) helps the extraction engine understand relationships.

Real-World Use Cases

Digitizing Legacy Archives

Organizations often have years of paper records that were scanned to PDF. These archives contain contracts, personnel files, financial records, and compliance documents. Schema-based extraction turns them into searchable, queryable data.

Processing Vendor Invoices

Many vendors, especially smaller ones, still send invoices as scans or photos. Automating extraction from these documents eliminates manual data entry in accounts payable workflows.

Handling Forms and Applications

Insurance claims, tax forms, HR documents, and customer applications are frequently scanned. Extracting structured data from these documents speeds up processing and reduces errors.

Try It Yourself

Upload a scanned document in the Playground — even a photo from your phone works. Define a schema and see the extraction results in seconds.

For integration details, see the API documentation.