Back to blog
use-caseweb-scrapingextractioncloudflare

We Crawled the Cloudflare Blog and Turned It Into Structured Data

March 11, 2026Smole Team

We Crawled the Cloudflare Blog and Turned It Into Structured Data

Cloudflare just shipped something interesting: a /crawl endpoint that lets you crawl an entire website with a single API call. You give it a URL, it discovers pages, follows links, and hands back each page as clean markdown.

We immediately had a question: what happens if you pipe that markdown straight into Smole?

The answer: you get a pipeline that turns any website into a folder of structured JSON files, shaped exactly the way you define them. We spent an afternoon building it, testing it against real sites, and hitting every edge case we could find. Here's what happened.

The idea

The concept is dead simple. Cloudflare crawls a website and gives you markdown. Smole takes markdown and gives you structured JSON. Connect the two:

Website → Cloudflare /crawl → Markdown → Smole → Structured JSON

No headless browsers to manage. No HTML parsing. No scraping infrastructure. Just two API calls per page.

First test: our own site

We started with smole.tech — 5 pages, depth of 1, no JavaScript rendering needed. Within 30 seconds we had structured data for every page: titles, descriptions, headings, all outbound links. It just worked.

That gave us enough confidence to try something more interesting.

The Cloudflare blog

We pointed the crawler at blog.cloudflare.com. The front page alone returned 36,000 characters of markdown with 165 links. After filtering out navigation, author pages, tag pages, and CDN image URLs, we had 20 actual blog post links.

Then we defined a schema — not a simple one, but something that would actually test the extraction:

{
  "title": { "type": "string" },
  "authors": { "type": "array", "items": { "type": "string" } },
  "publishDate": { "type": "string" },
  "summary": { "type": "string" },
  "keyTopics": { "type": "array", "items": { "type": "string" } },
  "technologiesMentioned": { "type": "array", "items": { "type": "string" } },
  "problemStatement": { "type": "string" },
  "solution": { "type": "string" },
  "keyTakeaways": { "type": "array", "items": { "type": "string" } },
  "relatedProducts": { "type": "array", "items": { "type": "string" } },
  "externalReferences": {
    "type": "array",
    "items": { "type": "object", "properties": { "name": { "type": "string" }, "url": { "type": "string" } } }
  }
}

We asked for authors (plural), technologies, a problem/solution breakdown, key takeaways, and even external references with their URLs. Thirteen fields, some of them nested arrays of objects.

Here's what came back for a post about RFC 9457:

{
  "title": "Slashing agent token costs by 98% with RFC 9457-compliant error responses",
  "authors": ["Sam Marsh"],
  "publishDate": "2026-03-11",
  "summary": "Cloudflare now returns RFC 9457-compliant structured responses for all 1xxx-class error paths to AI agents, replacing traditional HTML error pages with efficient, machine-readable error messages.",
  "keyTopics": ["structured error responses", "RFC 9457", "error handling efficiency"],
  "technologiesMentioned": ["Markdown", "JSON", "HTTP", "RFC 9457"],
  "problemStatement": "AI agents are still receiving the same HTML error pages designed for human users which waste time and resources when errors occur.",
  "solution": "Cloudflare now returns RFC 9457-compliant structured Markdown and JSON error payloads to AI agents.",
  "keyTakeaways": [
    "Structured responses reduce payload size and token usage by more than 98%.",
    "Cloudflare's structured error responses provide actionable guidance for agents.",
    "Site owners do not need to configure anything; responses are automatically integrated."
  ],
  "relatedProducts": ["Cloudflare Workers", "AI Gateway"],
  "externalReferences": [
    { "name": "RFC 9457", "url": "https://www.rfc-editor.org/rfc/rfc9457" }
  ]
}

Every field filled in correctly. The extraction identified the right author, pulled the actual RFC link, separated the problem from the solution, and distilled three specific takeaways from a 32,000-character post. For a post with three authors, it correctly returned all three.

We ran this against 3 blog posts — 3 for 3, all with rich, accurate extractions. The same schema, applied across different posts with different structures and topics, consistently returned useful data.

Where it got interesting: Funda.nl

Confident from the blog test, we tried something harder — Funda.nl, the biggest real estate platform in the Netherlands. The idea was to crawl their property listings in Eindhoven and extract structured data: address, price, rooms, energy label, listing agent.

This is where things broke down, and the failures were more instructive than the successes.

Problem 1: Bot detection. Funda serves a verification page to automated requests. Instead of property listings, we got: "We houden ons platform graag veilig en spamvrij" — a polite Dutch bot wall. Even with Cloudflare's headless browser rendering enabled, Funda detected the automated access and blocked it.

Problem 2: JavaScript-heavy content. Even when we got past the initial page, Funda loads its actual listing data via API calls after the page renders. The crawler captured the page shell — navigation, footer, marketing copy — but zero property data. The listings simply weren't in the HTML that the crawler saw.

Problem 3: Query parameter stripping. The search URL we used (/zoeken/koop?selected_area=["eindhoven"]) had its query parameters stripped by the crawl endpoint. Instead of the Eindhoven search results, we got the generic search page.

None of these are Smole limitations — the extraction never even got a chance to run because the crawling step couldn't get to the actual content. But they're important to understand if you're thinking about using this approach.

What we learned

This works incredibly well for content-rich, server-rendered sites

Blogs, documentation sites, WordPress sites, static pages — anything where the content is in the HTML and there's no bot protection. The Cloudflare-to-Smole pipeline handles these effortlessly.

The key insight is that Cloudflare returns markdown, not HTML. Markdown strips away all the layout noise and gives you pure content with semantic structure intact. That's exactly what LLM-based extraction needs to work well.

Schema design is where the leverage is

The difference between a basic schema (title, description) and a detailed one (problem statement, solution, takeaways, external references) is enormous — and it costs almost the same to run. Once you have the markdown, you might as well extract everything useful from it.

Fields that can't be found in the content come back as null rather than hallucinated values. This means you can design an ambitious schema and trust that the output is honest about what it found.

The two-step approach beats the one-shot for real use cases

Crawling and extracting in one pass is convenient for demos, but in practice you want to separate discovery from extraction. Crawl a page, look at the links you got, filter out the noise, then send only the relevant URLs for extraction.

The Cloudflare blog front page had 165 links. Only 20 were actual blog posts. If we'd blindly extracted from all of them, we'd have wasted API calls on tag pages, author profiles, and CDN image URLs.

Bot detection is the real bottleneck, not technology

The crawling technology works. The extraction technology works. What stops you is whether the target site lets you in. This isn't a limitation of Cloudflare or Smole — it's a fact of the modern web. Sites that don't want to be scraped have effective ways to prevent it.

The good news: the sites where this approach works best — blogs, docs, public directories, open data sources — are also the sites least likely to block automated access.

What you can do with this

Think about any scenario where you need structured data from a collection of web pages:

  • Competitive intelligence — extract pricing, features, and positioning from competitor websites
  • Content aggregation — build datasets from blogs, news sites, or research publications
  • Lead generation — extract company info, team details, and contact information from business directories
  • Market research — pull product specs, reviews, and pricing from e-commerce sites
  • Knowledge bases — turn documentation sites into structured, searchable datasets

For each of these, you define a schema once and run the pipeline against as many pages as you need. The output is clean JSON, ready for a database, spreadsheet, or downstream application.

We've open-sourced everything we built for this experiment at github.com/smole-ai/smole-examples.