Extract Data from Spreadsheets (XLSX) via API
Extract Data from Spreadsheets (XLSX) via API
Spreadsheets are everywhere in business — budgets, inventory lists, project trackers, sales reports, HR records. They're flexible, which is why people use them for everything. But that flexibility makes them hard to process programmatically: merged cells, multiple sheets, inconsistent formatting, formulas, and hidden columns.
Schema-based extraction cuts through the complexity. Define what data you want, upload the spreadsheet, and get clean JSON back.
Why Not Just Parse the XLSX Directly?
Libraries like openpyxl (Python) or xlsx (JavaScript) can read Excel files, but you still need to handle:
- Inconsistent structure — Column A is the name in one file, column B in another
- Multiple sheets — Which sheet has the data you need?
- Merged cells — Headers spanning multiple columns
- Empty rows and columns — Padding, spacing, notes scattered throughout
- Formula results vs raw values — Some cells contain formulas, others contain static values
- Formatting as data — Currency symbols, percentage signs, date formats
Schema-based extraction handles all of this. It converts the spreadsheet to a structured format, then extracts the data you defined — regardless of layout quirks.
Extracting Data from an Excel File
Example: A Sales Report Spreadsheet
{
"type": "object",
"properties": {
"report_title": { "type": "string" },
"reporting_period": { "type": "string" },
"sales_data": {
"type": "array",
"items": {
"type": "object",
"properties": {
"product": { "type": "string" },
"region": { "type": "string" },
"units_sold": { "type": "integer" },
"revenue": { "type": "number" },
"growth_percent": { "type": "number" }
}
}
},
"total_revenue": { "type": "number" },
"total_units": { "type": "integer" }
}
}
Upload and extract:
curl -X POST https://api.smole.tech/api/pipeline/file \
-H "Authorization: Bearer YOUR_API_KEY" \
-F "file=@sales-report.xlsx" \
-F "schemaId=YOUR_SCHEMA_ID"
Result:
{
"report_title": "Q4 2025 Sales Report",
"reporting_period": "October - December 2025",
"sales_data": [
{ "product": "Widget Pro", "region": "DACH", "units_sold": 1250, "revenue": 31237.50, "growth_percent": 12.5 },
{ "product": "Widget Pro", "region": "Nordics", "units_sold": 840, "revenue": 20997.60, "growth_percent": 8.3 },
{ "product": "Widget Lite", "region": "DACH", "units_sold": 3100, "revenue": 38750.00, "growth_percent": -2.1 },
{ "product": "Widget Lite", "region": "Nordics", "units_sold": 2200, "revenue": 27500.00, "growth_percent": 15.7 }
],
"total_revenue": 118485.10,
"total_units": 7390
}
Note how growth_percent comes back as a number (-2.1), not a string ("-2.1%"). The extraction handles format conversion automatically.
Common Spreadsheet Extraction Scenarios
Budget Tracking
{
"type": "object",
"properties": {
"department": { "type": "string" },
"fiscal_year": { "type": "string" },
"line_items": {
"type": "array",
"items": {
"type": "object",
"properties": {
"category": { "type": "string" },
"budgeted": { "type": "number" },
"actual": { "type": "number" },
"variance": { "type": "number" }
}
}
},
"total_budget": { "type": "number" },
"total_actual": { "type": "number" }
}
}
Inventory Lists
{
"type": "object",
"properties": {
"items": {
"type": "array",
"items": {
"type": "object",
"properties": {
"sku": { "type": "string" },
"name": { "type": "string" },
"category": { "type": "string" },
"quantity_on_hand": { "type": "integer" },
"reorder_point": { "type": "integer" },
"unit_cost": { "type": "number" },
"location": { "type": "string" }
}
}
}
}
}
Employee Directories
{
"type": "object",
"properties": {
"employees": {
"type": "array",
"items": {
"type": "object",
"properties": {
"name": { "type": "string" },
"employee_id": { "type": "string" },
"department": { "type": "string" },
"title": { "type": "string" },
"email": { "type": "string", "format": "email" },
"start_date": { "type": "string", "format": "date" }
}
}
}
}
}
Supported Formats
| Format | Extension | Description |
|---|---|---|
| Excel (modern) | .xlsx | Standard Excel format |
| Excel with macros | .xlsm | Excel with macros (macros are ignored, data is extracted) |
| CSV | .csv | Comma-separated values (also supported via pipeline) |
Tips for Better Spreadsheet Extraction
- Name your fields after column headers — If the spreadsheet has a "Product Name" column, use
product_namein your schema - Use arrays for tabular data — Spreadsheet rows map naturally to JSON arrays
- Include summary fields — Extract totals and subtotals alongside row data for validation
- Use
integerfor counts — Quantities, headcounts, and other whole numbers should useintegertype
Try It Now
Upload a spreadsheet in the Playground and see structured data in seconds. Works with any Excel file — no matter how messy the formatting.
For API details, see the documentation.
Related articles
How to Extract Tables from PDFs into Structured Data
Extract tables from PDF documents into structured JSON or CSV. Handle multi-column layouts, merged cells, and inconsistent formatting with schema-based extraction.
pythonExtract Structured Data from Documents with Python
How to extract structured JSON data from PDFs, scanned documents, and Word files using Python. Complete code examples with requests, error handling, and batch processing.
pdfConvert PDF to CSV: Extract Tabular Data via API
Convert PDF documents to CSV files by extracting structured data via API. Turn invoices, reports, and tables into spreadsheet-ready formats with schema-based extraction.
