Convert Word Documents (DOCX) to JSON via API
Convert Word Documents (DOCX) to JSON via API
Word documents are a staple of business operations. Contracts are drafted in Word. Reports are written in Word. Forms, proposals, and policy documents — all Word. But when you need to pull structured data out of them, you're stuck with manual copy-paste or fragile parsing scripts.
This guide shows how to convert DOCX files to structured JSON using a REST API.
Why DOCX to JSON?
Word documents store content as rich text with formatting — bold, italic, tables, headers. But your systems don't need formatting. They need data: the contract start date, the vendor name, the line items, the approval status.
Converting DOCX to JSON lets you:
- Feed document data into databases without manual entry
- Automate workflows that depend on document content
- Build searchable indexes from document archives
- Integrate document data with your existing tools and APIs
How DOCX Extraction Works
Step 1: Document Conversion
The DOCX file is converted to clean Markdown. This step preserves the document's structure — headings, paragraphs, tables, and lists — while stripping away formatting noise like fonts, colors, and page layout.
Step 2: Schema-Based Extraction
The Markdown content is analyzed against your JSON schema. The extraction engine understands the document's context and maps content to your schema fields.
Unlike regex-based parsing, schema-based extraction doesn't break when the document layout changes. A new paragraph, a reordered section, or different formatting won't affect the results.
Extracting Data from a Word Document
Example: A Proposal Document
Say you receive proposals in Word format and need to extract key details for your CRM.
Define a schema:
{
"type": "object",
"properties": {
"company_name": { "type": "string" },
"contact_person": { "type": "string" },
"proposal_date": { "type": "string", "format": "date" },
"project_name": { "type": "string" },
"project_description": { "type": "string" },
"deliverables": {
"type": "array",
"items": {
"type": "object",
"properties": {
"name": { "type": "string" },
"timeline": { "type": "string" },
"cost": { "type": "number" }
}
}
},
"total_cost": { "type": "number" },
"payment_terms": { "type": "string" },
"validity_period": { "type": "string" }
}
}
Upload the DOCX:
curl -X POST https://api.smole.tech/api/pipeline/file \
-H "Authorization: Bearer YOUR_API_KEY" \
-F "file=@proposal.docx" \
-F "schemaId=YOUR_SCHEMA_ID"
Get structured JSON:
{
"company_name": "Digital Solutions AG",
"contact_person": "Sarah Hoffmann",
"proposal_date": "2025-10-15",
"project_name": "E-Commerce Platform Redesign",
"project_description": "Complete redesign of the customer-facing e-commerce platform including mobile optimization and checkout flow improvements.",
"deliverables": [
{ "name": "UX Research & Wireframes", "timeline": "4 weeks", "cost": 12000 },
{ "name": "UI Design", "timeline": "3 weeks", "cost": 9000 },
{ "name": "Frontend Development", "timeline": "8 weeks", "cost": 32000 },
{ "name": "Testing & Launch", "timeline": "2 weeks", "cost": 7000 }
],
"total_cost": 60000,
"payment_terms": "30% upfront, 40% at midpoint, 30% on delivery",
"validity_period": "30 days"
}
Common DOCX Extraction Scenarios
Contracts and Agreements
Extract parties, dates, terms, obligations, and clauses from legal documents:
{
"type": "object",
"properties": {
"agreement_type": { "type": "string" },
"parties": {
"type": "array",
"items": {
"type": "object",
"properties": {
"name": { "type": "string" },
"role": { "type": "string" }
}
}
},
"effective_date": { "type": "string", "format": "date" },
"termination_date": { "type": "string", "format": "date" },
"key_terms": {
"type": "array",
"items": { "type": "string" }
}
}
}
Reports
Pull key metrics and findings from business reports:
{
"type": "object",
"properties": {
"report_title": { "type": "string" },
"reporting_period": { "type": "string" },
"key_metrics": {
"type": "array",
"items": {
"type": "object",
"properties": {
"metric": { "type": "string" },
"value": { "type": "string" },
"change": { "type": "string" }
}
}
},
"summary": { "type": "string" },
"recommendations": {
"type": "array",
"items": { "type": "string" }
}
}
}
HR Documents
Extract employee details from onboarding forms, reviews, or policy documents.
DOCX vs PDF: Extraction Differences
| Aspect | DOCX | |
|---|---|---|
| Text quality | Always clean text | May need OCR for scans |
| Structure | Headings, lists, tables preserved | Layout-dependent |
| Tables | Well-structured | Can be inconsistent |
| Processing speed | Faster | Slightly slower (especially with OCR) |
DOCX files generally produce better extraction results because the text is already structured. PDFs, especially scanned ones, add an extra layer of complexity.
Handling Tables in Word Documents
Word documents often contain tables — line items, comparison matrices, schedules. The extraction engine handles these well:
{
"type": "object",
"properties": {
"table_data": {
"type": "array",
"items": {
"type": "object",
"properties": {
"column_a": { "type": "string" },
"column_b": { "type": "number" },
"column_c": { "type": "string" }
}
}
}
}
}
Name your array item properties to match the table's column headers for best results.
Try It Now
Upload a Word document in the Playground and define your schema. See structured JSON results in seconds — no code required.
For API integration, see the documentation.
Related articles
How to Convert PDFs to JSON with an API
A practical guide to converting PDF documents into structured JSON data using a REST API. Covers digital PDFs, scanned documents, and batch processing.
pythonExtract Structured Data from Documents with Python
How to extract structured JSON data from PDFs, scanned documents, and Word files using Python. Complete code examples with requests, error handling, and batch processing.
javascriptExtract Document Data with JavaScript and Node.js
How to extract structured JSON from PDFs and documents using JavaScript and Node.js. Complete code examples with fetch, error handling, and batch processing.
