← Back

API reference

One endpoint for single URLs (/v1/scrape), one for batches (/v1/batch), one for search engines (/v1/serp/{engine}). Send a URL or a query, get back rendered content or ranked results. Authentication is a Bearer API key.

Authentication

Send your API key as Authorization: Bearer <key> on every request. Keys look like cfx_<id>_<secret>. Get one from the API Keys page. The plaintext secret is shown exactly once at issue — save it then; if you lose it, rotate the key from the same page.

POST /v1/scrape

Scrape a single URL. Returns the requested body formats plus metadata.

Request body

{
  "url":              "https://example.com/",   // required
  "formats":          ["markdown", "html", "text", "json"],  // optional, default ["markdown"]
  "skip_cache":       false,                    // optional, force a fresh fetch
  "extract_main_content": false,                // optional, drop nav/footer chrome
  "proxy_url":        null                      // optional, BYO proxy
}

formats is multi-select. Each entry produces a key under outputs; all of them come from the same single fetch + extract upstream, so picking multiple costs only a few extra ms of formatter time.

Response shape

{
  "outputs": {
    "markdown": "# Example Domain\n...",
    "html": "<html>...</html>",
    "text": "Example Domain ...",
    "json": "{\"title\":\"Example Domain\",...}"
  },
  "metadata": {
    "title": "Example Domain",
    "description": null,
    "language": "en",
    "links": ["https://www.iana.org/domains/example"]
  },
  "cached": false
}

cached is true when the body came from our in-process cache (sub-100ms) and false when we fetched fresh from the target. Cache TTL is short enough that you almost always get a fresh-enough page; pass skip_cache: true to force a refetch.

POST /v1/batch

Bulk variant of /v1/scrape. Same options per request; takes a urls array (max 100). Returns an array of results in the same order.

cURL
curl -X POST https://api.crawlfox.io/v1/batch \
  -H "Authorization: Bearer $CRAWLFOX_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
        "urls": ["https://example.com/", "https://example.org/"],
        "formats": ["markdown"]
      }'

POST /v1/serp/{engine}

Search the open web through a real search engine and get back ranked organic results — no HTML parsing on your side. {engine} is one of google, bing, or duckduckgo. The request body and response shape are identical across all three — only the path segment changes. Same Bearer auth as the rest of the API.

Engine endpoints

enginepathnotes
google/v1/serp/googleFull operator support (site:, intitle:, date ranges). Highest result quality.
bing/v1/serp/bingSame query operators. Often surfaces different long-tail results than Google.
duckduckgo/v1/serp/duckduckgoNo tracking, no personalisation. Best for region-neutral, account-neutral results.

Request body

{
  "q":         "rust async tutorial",   // required, non-empty
  "num":      20,                       // optional, 1–100 (default 10)
  "start":    0,                        // optional, pagination offset
  "country":  "us",                     // optional, 2-letter ISO
  "language": "en",                     // optional, 2-letter ISO
  "proxy_id":  null,                    // optional, managed pool entry
  "proxy_url": null                     // optional, BYO proxy (wins over proxy_id)
}

Advanced Google operators (site:, intitle:, inurl:, date ranges) are accepted inline in q and composed into the target URL automatically.

Response shape

{
  "engine": "google",
  "query": "rust async tutorial",
  "results": [
    {
      "rank": 1,
      "url": "https://rust-lang.github.io/async-book/",
      "title": "Asynchronous Programming in Rust",
      "description": "An introduction to asynchronous programming in Rust...",
      "ad": false
    },
    {
      "rank": 2,
      "url": "https://tokio.rs/tokio/tutorial",
      "title": "Tutorial — Tokio",
      "description": "Tokio is an asynchronous runtime for the Rust...",
      "ad": false
    }
  ],
  "page": { "start": 0, "num": 20 }
}

rank is 1-based across the organic block. Ads and instant-answers use negative ranks so a sort on rank keeps them above organic without breaking the contiguous 1..N positions. The shape is identical across engines so you can swap {engine} without touching the deserialiser.

Streaming variant

POST /v1/serp/{engine}/stream takes the same body and returns newline-delimited JSON: one row per result as it lands. Useful when you want to start ranking before the full page resolves, or to short-circuit early on a known-good result.

Quick start

cURL — single
curl -X POST https://api.crawlfox.io/v1/scrape \
  -H "Authorization: Bearer $CRAWLFOX_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
        "url": "https://example.com/",
        "formats": ["markdown", "html", "text", "json"]
      }'
JavaScript / TypeScript
const res = await fetch("https://api.crawlfox.io/v1/scrape", {
  method: "POST",
  headers: {
    authorization: `Bearer ${process.env.CRAWLFOX_API_KEY}`,
    "content-type": "application/json",
  },
  body: JSON.stringify({
    url: "https://example.com/",
    formats: ["markdown"],
  }),
});
const data = await res.json();
console.log(data.outputs.markdown);
Python
import os, requests

resp = requests.post(
    "https://api.crawlfox.io/v1/scrape",
    headers={"Authorization": f"Bearer {os.environ['CRAWLFOX_API_KEY']}"},
    json={"url": "https://example.com/", "formats": ["markdown"]},
    timeout=120,
)
resp.raise_for_status()
print(resp.json()["outputs"]["markdown"])

SERP / search

cURL — Google
curl -X POST https://api.crawlfox.io/v1/serp/google \
  -H "Authorization: Bearer $CRAWLFOX_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
        "q": "rust async tutorial",
        "num": 20,
        "country": "us",
        "language": "en"
      }'
cURL — Bing
curl -X POST https://api.crawlfox.io/v1/serp/bing \
  -H "Authorization: Bearer $CRAWLFOX_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"q": "rust async tutorial", "num": 20}'
cURL — DuckDuckGo
curl -X POST https://api.crawlfox.io/v1/serp/duckduckgo \
  -H "Authorization: Bearer $CRAWLFOX_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{"q": "rust async tutorial", "num": 20}'
JavaScript / TypeScript
const res = await fetch("https://api.crawlfox.io/v1/serp/google", {
  method: "POST",
  headers: {
    authorization: `Bearer ${process.env.CRAWLFOX_API_KEY}`,
    "content-type": "application/json",
  },
  body: JSON.stringify({ q: "rust async tutorial", num: 20 }),
});
const { results } = await res.json();
for (const r of results) console.log(r.rank, r.url, r.title);
Python
import os, requests

resp = requests.post(
    "https://api.crawlfox.io/v1/serp/google",
    headers={"Authorization": f"Bearer {os.environ['CRAWLFOX_API_KEY']}"},
    json={"q": "rust async tutorial", "num": 20},
    timeout=60,
)
resp.raise_for_status()
for r in resp.json()["results"]:
    print(r["rank"], r["url"], r["title"])

Errors

Every failure returns a structured envelope with a stable code and a retryable hint. Switch on code in your client; the human-readable title / message / remediation are designed to surface to end-users without modification.

{
  "code": "UPSTREAM_TIMEOUT",
  "status": 504,
  "retryable": true,
  "title": "Request timed out",
  "message": "The target site did not respond in time.",
  "remediation": "Retry. If a URL consistently times out, try a more specific path."
}

Codes

coderetryablewhen
MISSING_URLnoNo URL in the request body.
INVALID_URLnoThe URL didn't parse.
UPSTREAM_UNREACHABLEyesProxy couldn't tunnel to the target. Often means the URL isn't a public website (ad-tech endpoint, DNS infra).
UPSTREAM_NOT_FOUNDnoUpstream returned 404 / 410.
UPSTREAM_TIMEOUTyesSite didn't respond within the deadline.
UPSTREAM_RATE_LIMITEDyesSite rate-limited us (HTTP 429).
UPSTREAM_SERVER_ERRORyesSite returned 5xx.
UPSTREAM_GEO_BLOCKEDnoHTTP 451: site refused on legal/regional grounds. Try a proxy in a different region.
BOT_WALLnoAnti-bot protection blocked the request after the unblocker tier exhausted its rungs.
NO_PUBLIC_CONTENTnoPage loaded but had no extractable content (rare; usually a login wall — those now surface as method=auth_wall instead).
INTERNAL_ERRORyesBug on our side. Retry; if it persists, contact support with the code.

Rate limits

Each key has a per-minute token bucket sized by plan. Exceeding it returns HTTP 429 with a retry-after header (seconds). See your current plan + usage on the dashboard.