Getting Started with Document Extraction
Getting Started with Document Extraction
Document extraction is the process of converting unstructured documents—PDFs, images, scanned files—into structured, machine-readable data. With Smole, this process takes minutes instead of weeks.
Why Document Extraction Matters
Every day, businesses process thousands of documents: invoices, receipts, contracts, forms, and reports. Traditionally, extracting data from these documents required either manual data entry (slow and error-prone) or building complex custom pipelines (expensive and time-consuming).
Modern AI-powered extraction changes this equation entirely.
How Smole Works
Smole uses a three-step process:
- Upload your document - Send any PDF, image, or scanned document to our API
- Define your schema - Tell us what data you want to extract using a simple JSON schema
- Get structured data - Receive clean, validated JSON that matches your schema
Your First Extraction
Here's a simple example extracting data from an invoice:
{
"schema": {
"vendor_name": "string",
"invoice_number": "string",
"total_amount": "number",
"line_items": [
{
"description": "string",
"quantity": "number",
"unit_price": "number"
}
]
}
}
The API will return structured data like:
{
"vendor_name": "Acme Corp",
"invoice_number": "INV-2024-001",
"total_amount": 1250.00,
"line_items": [
{
"description": "Widget Pro",
"quantity": 5,
"unit_price": 250.00
}
]
}
Best Practices
- Start simple - Begin with a few key fields, then expand your schema
- Use descriptive field names - The AI uses field names as hints for extraction
- Test with varied documents - Ensure your schema works across different document formats
- Validate your output - Always verify extracted data for critical workflows
Next Steps
Ready to try it yourself? Head to the Playground to test document extraction with your own files, or check out our API documentation for integration details.
Related articles
How to Convert PDFs to JSON with an API
A practical guide to converting PDF documents into structured JSON data using a REST API. Covers digital PDFs, scanned documents, and batch processing.
pdfHow to Extract Tables from PDFs into Structured Data
Extract tables from PDF documents into structured JSON or CSV. Handle multi-column layouts, merged cells, and inconsistent formatting with schema-based extraction.
pythonExtract Structured Data from Documents with Python
How to extract structured JSON data from PDFs, scanned documents, and Word files using Python. Complete code examples with requests, error handling, and batch processing.
