Back to blog
tutorialgetting-started

Getting Started with Document Extraction

January 15, 2024Smole Team

Getting Started with Document Extraction

Document extraction is the process of converting unstructured documents—PDFs, images, scanned files—into structured, machine-readable data. With Smole, this process takes minutes instead of weeks.

Why Document Extraction Matters

Every day, businesses process thousands of documents: invoices, receipts, contracts, forms, and reports. Traditionally, extracting data from these documents required either manual data entry (slow and error-prone) or building complex custom pipelines (expensive and time-consuming).

Modern AI-powered extraction changes this equation entirely.

How Smole Works

Smole uses a three-step process:

  1. Upload your document - Send any PDF, image, or scanned document to our API
  2. Define your schema - Tell us what data you want to extract using a simple JSON schema
  3. Get structured data - Receive clean, validated JSON that matches your schema

Your First Extraction

Here's a simple example extracting data from an invoice:

{
  "schema": {
    "vendor_name": "string",
    "invoice_number": "string",
    "total_amount": "number",
    "line_items": [
      {
        "description": "string",
        "quantity": "number",
        "unit_price": "number"
      }
    ]
  }
}

The API will return structured data like:

{
  "vendor_name": "Acme Corp",
  "invoice_number": "INV-2024-001",
  "total_amount": 1250.00,
  "line_items": [
    {
      "description": "Widget Pro",
      "quantity": 5,
      "unit_price": 250.00
    }
  ]
}

Best Practices

  • Start simple - Begin with a few key fields, then expand your schema
  • Use descriptive field names - The AI uses field names as hints for extraction
  • Test with varied documents - Ensure your schema works across different document formats
  • Validate your output - Always verify extracted data for critical workflows

Next Steps

Ready to try it yourself? Head to the Playground to test document extraction with your own files, or check out our API documentation for integration details.