API reference
One endpoint for single URLs (/v1/scrape), one for batches (/v1/batch), one for search engines (/v1/serp/{engine}). Send a URL or a query, get back rendered content or ranked results. Authentication is a Bearer API key.
Authentication
Send your API key as Authorization: Bearer <key> on every request. Keys look like cfx_<id>_<secret>. Get one from the API Keys page. The plaintext secret is shown exactly once at issue — save it then; if you lose it, rotate the key from the same page.
POST /v1/scrape
Scrape a single URL. Returns the requested body formats plus metadata.
Request body
{
"url": "https://example.com/", // required
"formats": ["markdown", "html", "text", "json"], // optional, default ["markdown"]
"skip_cache": false, // optional, force a fresh fetch
"extract_main_content": false, // optional, drop nav/footer chrome
"proxy_url": null // optional, BYO proxy
}formats is multi-select. Each entry produces a key under outputs; all of them come from the same single fetch + extract upstream, so picking multiple costs only a few extra ms of formatter time.
Response shape
{
"outputs": {
"markdown": "# Example Domain\n...",
"html": "<html>...</html>",
"text": "Example Domain ...",
"json": "{\"title\":\"Example Domain\",...}"
},
"metadata": {
"title": "Example Domain",
"description": null,
"language": "en",
"links": ["https://www.iana.org/domains/example"]
},
"cached": false
}cached is true when the body came from our in-process cache (sub-100ms) and false when we fetched fresh from the target. Cache TTL is short enough that you almost always get a fresh-enough page; pass skip_cache: true to force a refetch.
POST /v1/batch
Bulk variant of /v1/scrape. Same options per request; takes a urls array (max 100). Returns an array of results in the same order.
curl -X POST https://api.crawlfox.io/v1/batch \
-H "Authorization: Bearer $CRAWLFOX_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"urls": ["https://example.com/", "https://example.org/"],
"formats": ["markdown"]
}'POST /v1/serp/{engine}
Search the open web through a real search engine and get back ranked organic results — no HTML parsing on your side. {engine} is one of google, bing, or duckduckgo. The request body and response shape are identical across all three — only the path segment changes. Same Bearer auth as the rest of the API.
Engine endpoints
| engine | path | notes |
|---|---|---|
| /v1/serp/google | Full operator support (site:, intitle:, date ranges). Highest result quality. | |
| bing | /v1/serp/bing | Same query operators. Often surfaces different long-tail results than Google. |
| duckduckgo | /v1/serp/duckduckgo | No tracking, no personalisation. Best for region-neutral, account-neutral results. |
Request body
{
"q": "rust async tutorial", // required, non-empty
"num": 20, // optional, 1–100 (default 10)
"start": 0, // optional, pagination offset
"country": "us", // optional, 2-letter ISO
"language": "en", // optional, 2-letter ISO
"proxy_id": null, // optional, managed pool entry
"proxy_url": null // optional, BYO proxy (wins over proxy_id)
}Advanced Google operators (site:, intitle:, inurl:, date ranges) are accepted inline in q and composed into the target URL automatically.
Response shape
{
"engine": "google",
"query": "rust async tutorial",
"results": [
{
"rank": 1,
"url": "https://rust-lang.github.io/async-book/",
"title": "Asynchronous Programming in Rust",
"description": "An introduction to asynchronous programming in Rust...",
"ad": false
},
{
"rank": 2,
"url": "https://tokio.rs/tokio/tutorial",
"title": "Tutorial — Tokio",
"description": "Tokio is an asynchronous runtime for the Rust...",
"ad": false
}
],
"page": { "start": 0, "num": 20 }
}rank is 1-based across the organic block. Ads and instant-answers use negative ranks so a sort on rank keeps them above organic without breaking the contiguous 1..N positions. The shape is identical across engines so you can swap {engine} without touching the deserialiser.
Streaming variant
POST /v1/serp/{engine}/stream takes the same body and returns newline-delimited JSON: one row per result as it lands. Useful when you want to start ranking before the full page resolves, or to short-circuit early on a known-good result.
Quick start
curl -X POST https://api.crawlfox.io/v1/scrape \
-H "Authorization: Bearer $CRAWLFOX_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"url": "https://example.com/",
"formats": ["markdown", "html", "text", "json"]
}'const res = await fetch("https://api.crawlfox.io/v1/scrape", {
method: "POST",
headers: {
authorization: `Bearer ${process.env.CRAWLFOX_API_KEY}`,
"content-type": "application/json",
},
body: JSON.stringify({
url: "https://example.com/",
formats: ["markdown"],
}),
});
const data = await res.json();
console.log(data.outputs.markdown);import os, requests
resp = requests.post(
"https://api.crawlfox.io/v1/scrape",
headers={"Authorization": f"Bearer {os.environ['CRAWLFOX_API_KEY']}"},
json={"url": "https://example.com/", "formats": ["markdown"]},
timeout=120,
)
resp.raise_for_status()
print(resp.json()["outputs"]["markdown"])SERP / search
curl -X POST https://api.crawlfox.io/v1/serp/google \
-H "Authorization: Bearer $CRAWLFOX_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"q": "rust async tutorial",
"num": 20,
"country": "us",
"language": "en"
}'curl -X POST https://api.crawlfox.io/v1/serp/bing \
-H "Authorization: Bearer $CRAWLFOX_API_KEY" \
-H "Content-Type: application/json" \
-d '{"q": "rust async tutorial", "num": 20}'curl -X POST https://api.crawlfox.io/v1/serp/duckduckgo \
-H "Authorization: Bearer $CRAWLFOX_API_KEY" \
-H "Content-Type: application/json" \
-d '{"q": "rust async tutorial", "num": 20}'const res = await fetch("https://api.crawlfox.io/v1/serp/google", {
method: "POST",
headers: {
authorization: `Bearer ${process.env.CRAWLFOX_API_KEY}`,
"content-type": "application/json",
},
body: JSON.stringify({ q: "rust async tutorial", num: 20 }),
});
const { results } = await res.json();
for (const r of results) console.log(r.rank, r.url, r.title);import os, requests
resp = requests.post(
"https://api.crawlfox.io/v1/serp/google",
headers={"Authorization": f"Bearer {os.environ['CRAWLFOX_API_KEY']}"},
json={"q": "rust async tutorial", "num": 20},
timeout=60,
)
resp.raise_for_status()
for r in resp.json()["results"]:
print(r["rank"], r["url"], r["title"])Errors
Every failure returns a structured envelope with a stable code and a retryable hint. Switch on code in your client; the human-readable title / message / remediation are designed to surface to end-users without modification.
{
"code": "UPSTREAM_TIMEOUT",
"status": 504,
"retryable": true,
"title": "Request timed out",
"message": "The target site did not respond in time.",
"remediation": "Retry. If a URL consistently times out, try a more specific path."
}Codes
| code | retryable | when |
|---|---|---|
| MISSING_URL | no | No URL in the request body. |
| INVALID_URL | no | The URL didn't parse. |
| UPSTREAM_UNREACHABLE | yes | Proxy couldn't tunnel to the target. Often means the URL isn't a public website (ad-tech endpoint, DNS infra). |
| UPSTREAM_NOT_FOUND | no | Upstream returned 404 / 410. |
| UPSTREAM_TIMEOUT | yes | Site didn't respond within the deadline. |
| UPSTREAM_RATE_LIMITED | yes | Site rate-limited us (HTTP 429). |
| UPSTREAM_SERVER_ERROR | yes | Site returned 5xx. |
| UPSTREAM_GEO_BLOCKED | no | HTTP 451: site refused on legal/regional grounds. Try a proxy in a different region. |
| BOT_WALL | no | Anti-bot protection blocked the request after the unblocker tier exhausted its rungs. |
| NO_PUBLIC_CONTENT | no | Page loaded but had no extractable content (rare; usually a login wall — those now surface as method=auth_wall instead). |
| INTERNAL_ERROR | yes | Bug on our side. Retry; if it persists, contact support with the code. |
Rate limits
Each key has a per-minute token bucket sized by plan. Exceeding it returns HTTP 429 with a retry-after header (seconds). See your current plan + usage on the dashboard.