How to Extract Data from Scanned Documents
How to Extract Data from Scanned Documents
Scanned documents are one of the hardest data sources to work with. They're images, not text — which means you can't search them, copy from them, or feed them into a database. But they contain valuable data: invoices from vendors who still email scans, legacy records stored as TIFF files, forms filled out by hand and photographed.
Here's how to turn scanned documents into structured, usable JSON data.
The Challenge with Scanned Documents
A scanned PDF isn't like a digital PDF. When you open a digital PDF, the text is stored as characters — your computer knows what each word says. A scanned document is just a picture. The computer sees pixels, not text.
This means you need two steps:
- OCR — Convert the image to text
- Extraction — Turn that text into structured data
Most tools stop at step 1. They give you a wall of raw text and leave you to figure out what it means. That's where schema-based extraction comes in — it understands the context and gives you structured JSON.
How OCR + Extraction Works
Step 1: Image Processing
The scanned document is analyzed for text regions. The OCR engine identifies characters, words, and their positions on the page. Modern OCR handles:
- Printed text in any font
- Low-resolution scans (300 DPI is ideal, but lower works)
- Skewed or rotated pages
- Multi-column layouts
- Tables and structured content
Step 2: Text Reconstruction
Raw OCR output is assembled into readable text, preserving the document's structure — headings, paragraphs, tables, and lists.
Step 3: Schema-Based Extraction
The reconstructed text is analyzed against your JSON schema. Instead of you writing regex patterns to find invoice numbers or dates, the extraction engine understands the document semantically and maps data to your schema fields.
Extracting Data from a Scanned Invoice
Here's a concrete example. You have a scanned invoice — a photograph or a scan from a vendor.
Define Your Schema
{
"type": "object",
"properties": {
"vendor": {
"type": "object",
"properties": {
"name": { "type": "string" },
"address": { "type": "string" },
"tax_id": { "type": "string" }
}
},
"invoice_number": { "type": "string" },
"date": { "type": "string", "format": "date" },
"due_date": { "type": "string", "format": "date" },
"line_items": {
"type": "array",
"items": {
"type": "object",
"properties": {
"description": { "type": "string" },
"quantity": { "type": "number" },
"unit_price": { "type": "number" },
"total": { "type": "number" }
}
}
},
"subtotal": { "type": "number" },
"tax_rate": { "type": "number" },
"tax_amount": { "type": "number" },
"total": { "type": "number" }
}
}
Upload the Scan
curl -X POST https://api.smole.tech/api/pipeline/file \
-H "Authorization: Bearer YOUR_API_KEY" \
-F "file=@scanned-invoice.pdf" \
-F "schemaId=YOUR_SCHEMA_ID" \
-F "ocr=true"
Smole detects scanned documents automatically, but you can explicitly enable OCR with the ocr flag.
Get Structured JSON
{
"vendor": {
"name": "Müller Office Supplies GmbH",
"address": "Hauptstr. 42, 10115 Berlin",
"tax_id": "DE123456789"
},
"invoice_number": "INV-2025-0394",
"date": "2025-11-01",
"due_date": "2025-12-01",
"line_items": [
{ "description": "A4 Paper (500 sheets)", "quantity": 10, "unit_price": 4.50, "total": 45.00 },
{ "description": "Toner Cartridge Black", "quantity": 2, "unit_price": 35.00, "total": 70.00 }
],
"subtotal": 115.00,
"tax_rate": 0.19,
"tax_amount": 21.85,
"total": 136.85
}
Even from a scan, you get clean, typed data you can feed directly into your system.
Supported Image Formats
Smole processes scanned documents in any common format:
- PDF — Scanned PDFs (single and multi-page)
- PNG — Screenshots, phone captures
- JPEG — Photographs of documents
- TIFF — Common in enterprise scanning workflows
- BMP, WEBP, GIF — Also supported
Tips for Better OCR Results
Document Quality
- Resolution: 300 DPI produces the best results. Lower resolutions work but may reduce accuracy on small text.
- Contrast: High contrast between text and background helps. Avoid washed-out or overexposed scans.
- Alignment: Slight skew is handled automatically, but significantly rotated documents may need straightening.
Schema Design
- Use descriptive field names — The extraction engine uses field names as context.
vendor_tax_idis more informative thanid_field_3. - Expect imperfections — OCR can occasionally misread characters (0 vs O, 1 vs l). Use numeric types for amounts so the extraction engine knows to interpret ambiguous characters as digits.
- Group related fields — Nesting related fields (like vendor name + address in a vendor object) helps the extraction engine understand relationships.
Real-World Use Cases
Digitizing Legacy Archives
Organizations often have years of paper records that were scanned to PDF. These archives contain contracts, personnel files, financial records, and compliance documents. Schema-based extraction turns them into searchable, queryable data.
Processing Vendor Invoices
Many vendors, especially smaller ones, still send invoices as scans or photos. Automating extraction from these documents eliminates manual data entry in accounts payable workflows.
Handling Forms and Applications
Insurance claims, tax forms, HR documents, and customer applications are frequently scanned. Extracting structured data from these documents speeds up processing and reduces errors.
Try It Yourself
Upload a scanned document in the Playground — even a photo from your phone works. Define a schema and see the extraction results in seconds.
For integration details, see the API documentation.
Related articles
How to Extract Tables from PDFs into Structured Data
Extract tables from PDF documents into structured JSON or CSV. Handle multi-column layouts, merged cells, and inconsistent formatting with schema-based extraction.
pythonExtract Structured Data from Documents with Python
How to extract structured JSON data from PDFs, scanned documents, and Word files using Python. Complete code examples with requests, error handling, and batch processing.
imagesHow to Extract Data from Images with an API
Extract structured data from photos, screenshots, and scanned images using OCR and schema-based extraction. Process receipts, business cards, forms, and documents captured on phones.
