Convert PDF to CSV: Extract Tabular Data via API

You have a stack of PDFs — invoices, reports, statements — and you need that data in a spreadsheet. Manually copying rows from PDFs into Excel is painful and error-prone. Here's how to automate it: extract structured JSON from PDFs via API, then convert to CSV.

The Pipeline: PDF → JSON → CSV

There's no direct "PDF to CSV" converter that works reliably, because PDFs don't have a concept of rows and columns. The reliable approach is:

Extract structured JSON from the PDF using a schema
Convert the JSON to CSV with a simple script

The schema-based extraction step is what makes this work — it understands the document and returns clean, typed data that maps directly to spreadsheet columns.

Step 1: Extract Data from the PDF

Define a schema for the data you want as CSV columns:

{
  "type": "object",
  "properties": {
    "rows": {
      "type": "array",
      "items": {
        "type": "object",
        "properties": {
          "date": { "type": "string", "format": "date" },
          "description": { "type": "string" },
          "category": { "type": "string" },
          "amount": { "type": "number" }
        }
      }
    }
  }
}

Upload the PDF:

curl -X POST https://api.smole.tech/api/pipeline/file \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -F "file=@expense-report.pdf" \
  -F "schemaId=YOUR_SCHEMA_ID"

Get JSON back:

{
  "rows": [
    { "date": "2025-11-01", "description": "Office supplies", "category": "Operations", "amount": 234.50 },
    { "date": "2025-11-03", "description": "Client dinner", "category": "Entertainment", "amount": 187.00 },
    { "date": "2025-11-08", "description": "Software license", "category": "IT", "amount": 599.00 }
  ]
}

Step 2: Convert JSON to CSV

Python

import csv
import json

data = json.loads(api_response)
rows = data["rows"]

with open("output.csv", "w", newline="") as f:
    writer = csv.DictWriter(f, fieldnames=rows[0].keys())
    writer.writeheader()
    writer.writerows(rows)

JavaScript (Node.js)

const rows = data.rows;
const headers = Object.keys(rows[0]);
const csv = [
  headers.join(","),
  ...rows.map(row => headers.map(h => JSON.stringify(row[h] ?? "")).join(","))
].join("\n");

fs.writeFileSync("output.csv", csv);

Command Line (with jq)

echo "date,description,category,amount" > output.csv
echo "$JSON_RESPONSE" | jq -r '.rows[] | [.date, .description, .category, .amount] | @csv' >> output.csv

Real-World Example: Invoices to CSV

Extract invoice data and flatten to spreadsheet rows:

import csv
import json

def invoices_to_csv(extracted_invoices, output_path):
    """Convert extracted invoice JSON to CSV with one row per line item."""
    rows = []
    for inv in extracted_invoices:
        for item in inv.get("line_items", []):
            rows.append({
                "invoice_number": inv.get("invoice_number", ""),
                "vendor": inv.get("vendor", {}).get("name", ""),
                "date": inv.get("date", ""),
                "item": item.get("description", ""),
                "quantity": item.get("quantity", 0),
                "unit_price": item.get("unit_price", 0),
                "total": item.get("total", 0),
                "invoice_total": inv.get("total", 0),
            })

    with open(output_path, "w", newline="") as f:
        writer = csv.DictWriter(f, fieldnames=rows[0].keys())
        writer.writeheader()
        writer.writerows(rows)

invoices_to_csv(all_invoices, "invoices.csv")

Output CSV:

invoice_number,vendor,date,item,quantity,unit_price,total,invoice_total
INV-2025-001,Acme Corp,2025-11-15,Widget Pro,5,24.99,124.95,148.69
INV-2025-001,Acme Corp,2025-11-15,Shipping,1,23.74,23.74,148.69
INV-2025-002,TechParts GmbH,2025-11-20,Server RAM,4,189.00,756.00,899.64

Batch Processing: Many PDFs to One CSV

Process an entire folder of PDFs into a single CSV:

from pathlib import Path

all_rows = []
for pdf in Path("./invoices").glob("*.pdf"):
    result = extract(str(pdf), schema_id)  # Your extraction function
    for item in result.get("line_items", []):
        all_rows.append({
            "file": pdf.name,
            "vendor": result.get("vendor", {}).get("name", ""),
            "date": result.get("date", ""),
            "item": item.get("description", ""),
            "amount": item.get("total", 0),
        })

# Write combined CSV
with open("all_invoices.csv", "w", newline="") as f:
    writer = csv.DictWriter(f, fieldnames=all_rows[0].keys())
    writer.writeheader()
    writer.writerows(all_rows)

Handling Different Document Types

The same PDF-to-CSV approach works for any document type — just adjust the schema:

Bank statements → Date, Description, Debit, Credit, Balance
Inventory lists → SKU, Name, Quantity, Location, Price
Employee records → Name, ID, Department, Start Date, Salary
Purchase orders → PO Number, Vendor, Item, Quantity, Total

Tips

Design your schema for flat output — CSV works best with flat rows, so keep your schema's array items simple (no deep nesting)
Use format: "date" for date columns — This ensures consistent ISO date formatting in your CSV
Handle missing data — Some documents won't have every field. Default to empty strings or 0 in your CSV conversion script
Test with one document first — Verify the schema produces the columns you want before running a batch

Try It Now

Upload a PDF in the Playground, define a schema matching the table columns, and see the structured JSON output. From there, converting to CSV is a few lines of code.

For API details, see the documentation.

Convert PDF to CSV: Extract Tabular Data via API

Convert PDF to CSV: Extract Tabular Data via API

The Pipeline: PDF → JSON → CSV

Step 1: Extract Data from the PDF

Step 2: Convert JSON to CSV

Python

JavaScript (Node.js)

Command Line (with jq)

Real-World Example: Invoices to CSV

Batch Processing: Many PDFs to One CSV

Handling Different Document Types

Tips

Try It Now

Related articles

How to Convert PDFs to JSON with an API

How to Extract Tables from PDFs into Structured Data

Extract Structured Data from Documents with Python