Getting Started with Document Extraction

Document extraction is the process of converting unstructured documents—PDFs, images, scanned files—into structured, machine-readable data. With Smole, this process takes minutes instead of weeks.

Why Document Extraction Matters

Every day, businesses process thousands of documents: invoices, receipts, contracts, forms, and reports. Traditionally, extracting data from these documents required either manual data entry (slow and error-prone) or building complex custom pipelines (expensive and time-consuming).

Modern AI-powered extraction changes this equation entirely.

How Smole Works

Smole uses a three-step process:

Upload your document - Send any PDF, image, or scanned document to our API
Define your schema - Tell us what data you want to extract using a simple JSON schema
Get structured data - Receive clean, validated JSON that matches your schema

Your First Extraction

Here's a simple example extracting data from an invoice:

{
  "schema": {
    "vendor_name": "string",
    "invoice_number": "string",
    "total_amount": "number",
    "line_items": [
      {
        "description": "string",
        "quantity": "number",
        "unit_price": "number"
      }
    ]
  }
}

The API will return structured data like:

{
  "vendor_name": "Acme Corp",
  "invoice_number": "INV-2024-001",
  "total_amount": 1250.00,
  "line_items": [
    {
      "description": "Widget Pro",
      "quantity": 5,
      "unit_price": 250.00
    }
  ]
}

Best Practices

Start simple - Begin with a few key fields, then expand your schema
Use descriptive field names - The AI uses field names as hints for extraction
Test with varied documents - Ensure your schema works across different document formats
Validate your output - Always verify extracted data for critical workflows

Next Steps

Ready to try it yourself? Head to the Playground to test document extraction with your own files, or check out our API documentation for integration details.

Getting Started with Document Extraction

Getting Started with Document Extraction

Why Document Extraction Matters

How Smole Works

Your First Extraction

Best Practices

Next Steps

Related articles

How to Convert PDFs to JSON with an API

How to Extract Tables from PDFs into Structured Data

Extract Structured Data from Documents with Python