Turn any website
into LLM-ready data.
Scrape, crawl and search the entire web with one API. crawlfox returns clean markdown, structured JSON and inline citations — ready for your RAG pipeline.
Everything you need to feed your LLM the web.
One unified API. Five primitives. Zero infrastructure to maintain.
Any URL → clean markdown in ~600ms.
Reader-mode cleaning, JS execution, automatic encoding fixes. Strips ads, popups, navs and trackers so your model only sees the signal.
10× smaller. 5× faster.
Tighter payloads. Lower latency.
Map an entire site from one seed URL.
Breadth-first, polite, dedup'd. We follow robots.txt and respect rate limits — you get the sitemap and the markdown.
Semantic search across freshly-crawled indexes.
Returns ranked passages with exact source URLs and confidence scores. Drop-in replacement for embedded vector stores.
Pass a JSON schema. Get back structured data.
Our extractor maps any page to your shape — with field-level confidence scores. Returns nulls rather than hallucinating.
Rotating residential proxies, 90 countries.
Handle WAFs, captchas and the 1% of sites that block everything else.
A 2-line dependency.
Production-grade infrastructure.
Use any language with HTTP. The examples below show Python, TypeScript, Go, and curl — same response shape, every time.
# pip install crawlfox from crawlfox import Crawlfox fox = Crawlfox(api_key="cfx_live_…") # scrape a single page page = fox.scrape( url="https://stripe.com/docs/api", formats=["markdown", "links"], extract={ "endpoints": "list[str]", "auth_method": "string", }, ) print(page.markdown) # clean reader text print(page.data) # structured JSON
"url": "https://stripe.com/docs/api",
"markdown": "# Stripe API Reference\\n\\nThe Stripe API…",
"links": ["…/charges", "…/payment_intents", + 47],
"data": {
"endpoints": ["/v1/charges", "/v1/customers", …],
"auth_method": "Bearer token (sk_live_…)"
}
}
Streaming responses
Get tokens the moment they're extracted. Drop into your RAG ingestion without buffering.
Stealth mode & rotating proxies
Residential IPs across 90 countries. Handles WAFs, captchas, and the 1% that block everything else.
JS rendering, optional
Headless Chromium for SPAs and infinite-scroll. Skip it for 10× faster static pages.
Webhooks & queues
Fire-and-forget for big jobs. Get a callback when 50,000 pages are done.
Schema extraction
Define a JSON schema; we'll fit any page to it — with field-level confidence scores.
Four stages.
Zero guesswork.
The same pipeline that runs in our production cluster runs on a free-tier request. Scroll to follow a single URL through the entire stack.
Seed in,
sitemap out.
Breadth-first walk from any URL. Respects robots.txt, sitemaps, and your include / exclude rules.
Polite,
parallel.
Per-host throttling, automatic retries, and a global rate budget so you never get rate-limited.
Reader-mode
on steroids.
A purpose-built parser strips chrome, fixes encoding, resolves URLs and normalizes whitespace into model-friendly markdown.
.cookie-banner .ad .modal
→ # Pricing
→ ## Free tier
→ 1,000 pages / month
Schema,
not soup.
A purpose-trained reasoning model fits the clean page to your JSON schema. Returns nulls instead of inventing data.
"plan": "Free",
"price_per_mo": 0,
"page_quota": 1000,
"_conf": 0.97
}
Ready for
your model.
Stream straight into your vector store, RAG pipeline, or fine-tune dataset. One API. No glue code.
Start free. Scale when ready.
No seat charges. No proxy fees. No data-egress surprise bills.
Hobby
For experiments, side-projects and learning.
- 1,000 pages / month
- Scrape, crawl, search
- Community Discord
Builder
For production apps shipping AI features.
- 50,000 pages / month
- Schema extraction + streaming
- Stealth proxies included
- Email + Slack support
Scale
For teams crawling millions of pages.
- Unlimited pages
- Dedicated infra + SLAs
- SOC 2, HIPAA, custom DPAs
- Solutions engineer
Send the fox.
Keep the data.
Join 200+ teams shipping AI products on crawlfox. Be running in 60 seconds.