How to Convert PDFs to JSON with an API
How to Convert PDFs to JSON with an API
PDFs are everywhere — invoices, contracts, reports, forms — but the data inside them is trapped. You can't query a PDF. You can't pipe it into a database. You need that data as structured JSON.
This guide covers how to convert PDFs to JSON using a REST API, from simple text-based PDFs to scanned documents that need OCR.
Why PDF to JSON?
PDFs are designed for display, not for data. When you need to:
- Pull line items from invoices into your accounting system
- Extract terms and dates from contracts into a database
- Feed report data into dashboards or analytics tools
- Automate data entry from scanned forms
You need a way to turn that PDF into structured data your code can work with.
The Two Types of PDFs
Not all PDFs are equal. How you handle them depends on what's inside:
Digital (text-based) PDFs
These PDFs contain actual text data. You can select and copy text from them. They're created by software like Word, Google Docs, or reporting tools.
Processing: Text is extracted directly — fast and accurate.
Scanned PDFs (image-based)
These are essentially images wrapped in a PDF. They come from scanners, phone cameras, or fax machines. You can't select text in them.
Processing: OCR (Optical Character Recognition) converts the image to text first, then extraction happens on the OCR output.
Smole handles both types automatically. Upload any PDF, and the system detects whether OCR is needed.
Converting a PDF to JSON: Step by Step
1. Create a Schema
First, define what data you want to extract. A JSON Schema tells the API what fields to look for:
{
"type": "object",
"properties": {
"document_title": { "type": "string" },
"date": { "type": "string", "format": "date" },
"author": { "type": "string" },
"sections": {
"type": "array",
"items": {
"type": "object",
"properties": {
"heading": { "type": "string" },
"content": { "type": "string" }
}
}
}
}
}
Register it via the API:
curl -X POST https://api.smole.tech/api/schemas \
-H "Authorization: Bearer YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{ "name": "report-schema", "schema": { ... } }'
2. Upload and Extract
Send your PDF to the pipeline endpoint:
curl -X POST https://api.smole.tech/api/pipeline/file \
-H "Authorization: Bearer YOUR_API_KEY" \
-F "file=@report.pdf" \
-F "schemaId=YOUR_SCHEMA_ID"
This single call handles everything: conversion, OCR (if needed), and extraction.
3. Get Your JSON
Poll for the result:
curl https://api.smole.tech/api/pipeline/PIPELINE_ID \
-H "Authorization: Bearer YOUR_API_KEY"
You'll get back clean JSON matching your schema:
{
"document_title": "Q4 Sales Report",
"date": "2025-12-31",
"author": "Finance Team",
"sections": [
{
"heading": "Revenue Summary",
"content": "Total revenue for Q4 was €2.4M, a 15% increase over Q3..."
},
{
"heading": "Regional Breakdown",
"content": "DACH region contributed 45% of total revenue..."
}
]
}
Handling Different PDF Types
Invoices
{
"type": "object",
"properties": {
"vendor_name": { "type": "string" },
"invoice_number": { "type": "string" },
"date": { "type": "string", "format": "date" },
"line_items": {
"type": "array",
"items": {
"type": "object",
"properties": {
"description": { "type": "string" },
"quantity": { "type": "number" },
"unit_price": { "type": "number" },
"total": { "type": "number" }
}
}
},
"subtotal": { "type": "number" },
"tax": { "type": "number" },
"total": { "type": "number" }
}
}
Contracts
{
"type": "object",
"properties": {
"parties": {
"type": "array",
"items": {
"type": "object",
"properties": {
"name": { "type": "string" },
"role": { "type": "string" }
}
}
},
"effective_date": { "type": "string", "format": "date" },
"term_months": { "type": "integer" },
"governing_law": { "type": "string" },
"key_obligations": {
"type": "array",
"items": { "type": "string" }
}
}
}
Reports and Forms
The same approach works for any document type. Define the fields you care about, and the extraction handles the rest — regardless of layout or formatting.
Batch Processing
For high-volume scenarios, process multiple PDFs through the same pipeline:
const files = ['invoice1.pdf', 'invoice2.pdf', 'invoice3.pdf'];
const results = await Promise.all(
files.map(file =>
fetch('https://api.smole.tech/api/pipeline/file', {
method: 'POST',
headers: { 'Authorization': `Bearer ${API_KEY}` },
body: createFormData(file, schemaId)
})
)
);
Each document is processed independently, so failures in one don't affect the others.
Tips for Better Results
- Be specific with field names —
invoice_dateworks better thandate1because the name gives context about what to look for - Match your schema to the document — If the document has a table, use an array in your schema
- Use appropriate types —
"type": "number"for amounts,"format": "date"for dates - Start simple — Extract the most important fields first, then expand your schema
Try It Now
Upload a PDF in the Playground and see the extraction results in seconds. No setup required — just drag, drop, and define your schema.
For API integration details, see the documentation.
Related articles
Convert Word Documents (DOCX) to JSON via API
How to extract structured JSON data from Word documents using a REST API. Convert DOCX files to structured data for contracts, reports, and forms.
pdfHow to Extract Tables from PDFs into Structured Data
Extract tables from PDF documents into structured JSON or CSV. Handle multi-column layouts, merged cells, and inconsistent formatting with schema-based extraction.
pythonExtract Structured Data from Documents with Python
How to extract structured JSON data from PDFs, scanned documents, and Word files using Python. Complete code examples with requests, error handling, and batch processing.
