Convert PDF to CSV: Extract Tabular Data via API
Convert PDF to CSV: Extract Tabular Data via API
You have a stack of PDFs — invoices, reports, statements — and you need that data in a spreadsheet. Manually copying rows from PDFs into Excel is painful and error-prone. Here's how to automate it: extract structured JSON from PDFs via API, then convert to CSV.
The Pipeline: PDF → JSON → CSV
There's no direct "PDF to CSV" converter that works reliably, because PDFs don't have a concept of rows and columns. The reliable approach is:
- Extract structured JSON from the PDF using a schema
- Convert the JSON to CSV with a simple script
The schema-based extraction step is what makes this work — it understands the document and returns clean, typed data that maps directly to spreadsheet columns.
Step 1: Extract Data from the PDF
Define a schema for the data you want as CSV columns:
{
"type": "object",
"properties": {
"rows": {
"type": "array",
"items": {
"type": "object",
"properties": {
"date": { "type": "string", "format": "date" },
"description": { "type": "string" },
"category": { "type": "string" },
"amount": { "type": "number" }
}
}
}
}
}
Upload the PDF:
curl -X POST https://api.smole.tech/api/pipeline/file \
-H "Authorization: Bearer YOUR_API_KEY" \
-F "file=@expense-report.pdf" \
-F "schemaId=YOUR_SCHEMA_ID"
Get JSON back:
{
"rows": [
{ "date": "2025-11-01", "description": "Office supplies", "category": "Operations", "amount": 234.50 },
{ "date": "2025-11-03", "description": "Client dinner", "category": "Entertainment", "amount": 187.00 },
{ "date": "2025-11-08", "description": "Software license", "category": "IT", "amount": 599.00 }
]
}
Step 2: Convert JSON to CSV
Python
import csv
import json
data = json.loads(api_response)
rows = data["rows"]
with open("output.csv", "w", newline="") as f:
writer = csv.DictWriter(f, fieldnames=rows[0].keys())
writer.writeheader()
writer.writerows(rows)
JavaScript (Node.js)
const rows = data.rows;
const headers = Object.keys(rows[0]);
const csv = [
headers.join(","),
...rows.map(row => headers.map(h => JSON.stringify(row[h] ?? "")).join(","))
].join("\n");
fs.writeFileSync("output.csv", csv);
Command Line (with jq)
echo "date,description,category,amount" > output.csv
echo "$JSON_RESPONSE" | jq -r '.rows[] | [.date, .description, .category, .amount] | @csv' >> output.csv
Real-World Example: Invoices to CSV
Extract invoice data and flatten to spreadsheet rows:
import csv
import json
def invoices_to_csv(extracted_invoices, output_path):
"""Convert extracted invoice JSON to CSV with one row per line item."""
rows = []
for inv in extracted_invoices:
for item in inv.get("line_items", []):
rows.append({
"invoice_number": inv.get("invoice_number", ""),
"vendor": inv.get("vendor", {}).get("name", ""),
"date": inv.get("date", ""),
"item": item.get("description", ""),
"quantity": item.get("quantity", 0),
"unit_price": item.get("unit_price", 0),
"total": item.get("total", 0),
"invoice_total": inv.get("total", 0),
})
with open(output_path, "w", newline="") as f:
writer = csv.DictWriter(f, fieldnames=rows[0].keys())
writer.writeheader()
writer.writerows(rows)
invoices_to_csv(all_invoices, "invoices.csv")
Output CSV:
invoice_number,vendor,date,item,quantity,unit_price,total,invoice_total
INV-2025-001,Acme Corp,2025-11-15,Widget Pro,5,24.99,124.95,148.69
INV-2025-001,Acme Corp,2025-11-15,Shipping,1,23.74,23.74,148.69
INV-2025-002,TechParts GmbH,2025-11-20,Server RAM,4,189.00,756.00,899.64
Batch Processing: Many PDFs to One CSV
Process an entire folder of PDFs into a single CSV:
from pathlib import Path
all_rows = []
for pdf in Path("./invoices").glob("*.pdf"):
result = extract(str(pdf), schema_id) # Your extraction function
for item in result.get("line_items", []):
all_rows.append({
"file": pdf.name,
"vendor": result.get("vendor", {}).get("name", ""),
"date": result.get("date", ""),
"item": item.get("description", ""),
"amount": item.get("total", 0),
})
# Write combined CSV
with open("all_invoices.csv", "w", newline="") as f:
writer = csv.DictWriter(f, fieldnames=all_rows[0].keys())
writer.writeheader()
writer.writerows(all_rows)
Handling Different Document Types
The same PDF-to-CSV approach works for any document type — just adjust the schema:
- Bank statements → Date, Description, Debit, Credit, Balance
- Inventory lists → SKU, Name, Quantity, Location, Price
- Employee records → Name, ID, Department, Start Date, Salary
- Purchase orders → PO Number, Vendor, Item, Quantity, Total
Tips
- Design your schema for flat output — CSV works best with flat rows, so keep your schema's array items simple (no deep nesting)
- Use
format: "date"for date columns — This ensures consistent ISO date formatting in your CSV - Handle missing data — Some documents won't have every field. Default to empty strings or 0 in your CSV conversion script
- Test with one document first — Verify the schema produces the columns you want before running a batch
Try It Now
Upload a PDF in the Playground, define a schema matching the table columns, and see the structured JSON output. From there, converting to CSV is a few lines of code.
For API details, see the documentation.
Related articles
How to Convert PDFs to JSON with an API
A practical guide to converting PDF documents into structured JSON data using a REST API. Covers digital PDFs, scanned documents, and batch processing.
pdfHow to Extract Tables from PDFs into Structured Data
Extract tables from PDF documents into structured JSON or CSV. Handle multi-column layouts, merged cells, and inconsistent formatting with schema-based extraction.
pythonExtract Structured Data from Documents with Python
How to extract structured JSON data from PDFs, scanned documents, and Word files using Python. Complete code examples with requests, error handling, and batch processing.
