Convert Word Documents (DOCX) to JSON via API

Word documents are a staple of business operations. Contracts are drafted in Word. Reports are written in Word. Forms, proposals, and policy documents — all Word. But when you need to pull structured data out of them, you're stuck with manual copy-paste or fragile parsing scripts.

This guide shows how to convert DOCX files to structured JSON using a REST API.

Why DOCX to JSON?

Word documents store content as rich text with formatting — bold, italic, tables, headers. But your systems don't need formatting. They need data: the contract start date, the vendor name, the line items, the approval status.

Converting DOCX to JSON lets you:

Feed document data into databases without manual entry
Automate workflows that depend on document content
Build searchable indexes from document archives
Integrate document data with your existing tools and APIs

How DOCX Extraction Works

Step 1: Document Conversion

The DOCX file is converted to clean Markdown. This step preserves the document's structure — headings, paragraphs, tables, and lists — while stripping away formatting noise like fonts, colors, and page layout.

Step 2: Schema-Based Extraction

The Markdown content is analyzed against your JSON schema. The extraction engine understands the document's context and maps content to your schema fields.

Unlike regex-based parsing, schema-based extraction doesn't break when the document layout changes. A new paragraph, a reordered section, or different formatting won't affect the results.

Extracting Data from a Word Document

Example: A Proposal Document

Say you receive proposals in Word format and need to extract key details for your CRM.

Define a schema:

{
  "type": "object",
  "properties": {
    "company_name": { "type": "string" },
    "contact_person": { "type": "string" },
    "proposal_date": { "type": "string", "format": "date" },
    "project_name": { "type": "string" },
    "project_description": { "type": "string" },
    "deliverables": {
      "type": "array",
      "items": {
        "type": "object",
        "properties": {
          "name": { "type": "string" },
          "timeline": { "type": "string" },
          "cost": { "type": "number" }
        }
      }
    },
    "total_cost": { "type": "number" },
    "payment_terms": { "type": "string" },
    "validity_period": { "type": "string" }
  }
}

Upload the DOCX:

curl -X POST https://api.smole.tech/api/pipeline/file \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -F "file=@proposal.docx" \
  -F "schemaId=YOUR_SCHEMA_ID"

Get structured JSON:

{
  "company_name": "Digital Solutions AG",
  "contact_person": "Sarah Hoffmann",
  "proposal_date": "2025-10-15",
  "project_name": "E-Commerce Platform Redesign",
  "project_description": "Complete redesign of the customer-facing e-commerce platform including mobile optimization and checkout flow improvements.",
  "deliverables": [
    { "name": "UX Research & Wireframes", "timeline": "4 weeks", "cost": 12000 },
    { "name": "UI Design", "timeline": "3 weeks", "cost": 9000 },
    { "name": "Frontend Development", "timeline": "8 weeks", "cost": 32000 },
    { "name": "Testing & Launch", "timeline": "2 weeks", "cost": 7000 }
  ],
  "total_cost": 60000,
  "payment_terms": "30% upfront, 40% at midpoint, 30% on delivery",
  "validity_period": "30 days"
}

Common DOCX Extraction Scenarios

Contracts and Agreements

Extract parties, dates, terms, obligations, and clauses from legal documents:

{
  "type": "object",
  "properties": {
    "agreement_type": { "type": "string" },
    "parties": {
      "type": "array",
      "items": {
        "type": "object",
        "properties": {
          "name": { "type": "string" },
          "role": { "type": "string" }
        }
      }
    },
    "effective_date": { "type": "string", "format": "date" },
    "termination_date": { "type": "string", "format": "date" },
    "key_terms": {
      "type": "array",
      "items": { "type": "string" }
    }
  }
}

Reports

Pull key metrics and findings from business reports:

{
  "type": "object",
  "properties": {
    "report_title": { "type": "string" },
    "reporting_period": { "type": "string" },
    "key_metrics": {
      "type": "array",
      "items": {
        "type": "object",
        "properties": {
          "metric": { "type": "string" },
          "value": { "type": "string" },
          "change": { "type": "string" }
        }
      }
    },
    "summary": { "type": "string" },
    "recommendations": {
      "type": "array",
      "items": { "type": "string" }
    }
  }
}

HR Documents

Extract employee details from onboarding forms, reviews, or policy documents.

DOCX vs PDF: Extraction Differences

Aspect	DOCX	PDF
Text quality	Always clean text	May need OCR for scans
Structure	Headings, lists, tables preserved	Layout-dependent
Tables	Well-structured	Can be inconsistent
Processing speed	Faster	Slightly slower (especially with OCR)

DOCX files generally produce better extraction results because the text is already structured. PDFs, especially scanned ones, add an extra layer of complexity.

Handling Tables in Word Documents

Word documents often contain tables — line items, comparison matrices, schedules. The extraction engine handles these well:

{
  "type": "object",
  "properties": {
    "table_data": {
      "type": "array",
      "items": {
        "type": "object",
        "properties": {
          "column_a": { "type": "string" },
          "column_b": { "type": "number" },
          "column_c": { "type": "string" }
        }
      }
    }
  }
}

Name your array item properties to match the table's column headers for best results.

Try It Now

Upload a Word document in the Playground and define your schema. See structured JSON results in seconds — no code required.

For API integration, see the documentation.

Convert Word Documents (DOCX) to JSON via API

Convert Word Documents (DOCX) to JSON via API

Why DOCX to JSON?

How DOCX Extraction Works

Step 1: Document Conversion

Step 2: Schema-Based Extraction

Extracting Data from a Word Document

Example: A Proposal Document

Common DOCX Extraction Scenarios

Contracts and Agreements

Reports

HR Documents

DOCX vs PDF: Extraction Differences

Handling Tables in Word Documents

Try It Now

Related articles

How to Convert PDFs to JSON with an API

Extract Structured Data from Documents with Python

Extract Document Data with JavaScript and Node.js