How to Extract Tables from PDFs into Structured Data
How to Extract Tables from PDFs into Structured Data
Tables in PDFs are one of the most frustrating data sources to work with. The data looks perfectly structured on screen, but copy-pasting gives you a jumbled mess. Traditional PDF parsers struggle with merged cells, multi-line rows, and inconsistent column alignment.
Schema-based extraction solves this — you tell the API what columns to expect, and it returns clean, structured data.
Why PDF Tables Are Hard
PDFs don't actually have a concept of "tables." What you see as a table is just text positioned at specific coordinates on a page, sometimes with lines drawn between them, sometimes not. This means:
- Copy-paste scrambles column order
- PDF parsers can't reliably detect where one column ends and another begins
- Headers might span multiple columns
- Rows can wrap across multiple lines
- Some tables have borders, others don't
Schema-based extraction bypasses all of this by understanding the content rather than trying to parse the layout.
Extracting a Simple Table
Say you have a PDF with a product pricing table. Define what each row should look like:
{
"type": "object",
"properties": {
"table_title": { "type": "string" },
"rows": {
"type": "array",
"items": {
"type": "object",
"properties": {
"product_name": { "type": "string" },
"sku": { "type": "string" },
"unit_price": { "type": "number" },
"quantity_available": { "type": "integer" }
}
}
}
}
}
Upload the PDF:
curl -X POST https://api.smole.tech/api/pipeline/file \
-H "Authorization: Bearer YOUR_API_KEY" \
-F "file=@pricing-sheet.pdf" \
-F "schemaId=YOUR_SCHEMA_ID"
Get structured output:
{
"table_title": "Q1 2026 Product Pricing",
"rows": [
{ "product_name": "Widget Pro", "sku": "WP-100", "unit_price": 24.99, "quantity_available": 1500 },
{ "product_name": "Widget Lite", "sku": "WL-200", "unit_price": 12.49, "quantity_available": 3200 },
{ "product_name": "Widget Enterprise", "sku": "WE-300", "unit_price": 89.99, "quantity_available": 450 }
]
}
Every row is typed correctly — prices as numbers, quantities as integers, text as strings.
Handling Complex Tables
Multiple Tables in One Document
If a PDF contains several tables, define each in your schema:
{
"type": "object",
"properties": {
"revenue_table": {
"type": "array",
"items": {
"type": "object",
"properties": {
"region": { "type": "string" },
"q1": { "type": "number" },
"q2": { "type": "number" },
"q3": { "type": "number" },
"q4": { "type": "number" }
}
}
},
"expense_table": {
"type": "array",
"items": {
"type": "object",
"properties": {
"category": { "type": "string" },
"budgeted": { "type": "number" },
"actual": { "type": "number" },
"variance": { "type": "number" }
}
}
}
}
}
The extraction engine identifies which table is which based on context and column headers.
Tables with Merged Cells
Financial statements and reports often merge cells for category headers. The extraction handles this by understanding the hierarchy:
{
"type": "object",
"properties": {
"line_items": {
"type": "array",
"items": {
"type": "object",
"properties": {
"category": { "type": "string" },
"subcategory": { "type": "string" },
"amount": { "type": "number" }
}
}
}
}
}
A merged "Operating Expenses" header spanning multiple rows gets assigned as the category for each sub-row.
Tables Without Borders
Many PDFs use whitespace alignment instead of visible borders. This doesn't matter for schema-based extraction — it reads the content, not the lines.
Invoice Line Items
The most common table extraction use case. Invoice line items almost always follow the same pattern:
{
"type": "object",
"properties": {
"line_items": {
"type": "array",
"items": {
"type": "object",
"properties": {
"description": { "type": "string" },
"quantity": { "type": "number" },
"unit_price": { "type": "number" },
"tax_rate": { "type": "number" },
"total": { "type": "number" }
}
}
},
"subtotal": { "type": "number" },
"tax_total": { "type": "number" },
"grand_total": { "type": "number" }
}
}
This works regardless of whether the invoice uses "Qty" or "Quantity", "Unit Price" or "Price/Unit", or any other header variation.
Financial Statements
Extract balance sheets, income statements, and cash flow tables:
{
"type": "object",
"properties": {
"statement_type": { "type": "string" },
"period": { "type": "string" },
"entries": {
"type": "array",
"items": {
"type": "object",
"properties": {
"line_item": { "type": "string" },
"current_period": { "type": "number" },
"prior_period": { "type": "number" }
}
}
}
}
}
Tips for Better Table Extraction
- Name your array item properties after column headers — If the table has a "Description" column, use
"description"in your schema - Use
numberfor numeric columns — This ensures amounts, quantities, and percentages come back as numbers, not strings - Extract the table summary too — Include fields for totals, subtotals, and other summary rows alongside the array
- Test with varying formats — The same schema should work across different table layouts
Try It Now
Upload a PDF with tables in the Playground and see structured data come back in seconds. Define your schema to match the table columns, and let the extraction handle the rest.
For API integration details, see the documentation.
Related articles
How to Convert PDFs to JSON with an API
A practical guide to converting PDF documents into structured JSON data using a REST API. Covers digital PDFs, scanned documents, and batch processing.
pythonExtract Structured Data from Documents with Python
How to extract structured JSON data from PDFs, scanned documents, and Word files using Python. Complete code examples with requests, error handling, and batch processing.
pdfConvert PDF to CSV: Extract Tabular Data via API
Convert PDF documents to CSV files by extracting structured data via API. Turn invoices, reports, and tables into spreadsheet-ready formats with schema-based extraction.
