Crawl API

BETA

The Crawl API is currently in beta. Parameters and response shapes may change in future releases.

The Crawl API is only available for Cloud plans. Contact us for more information here.

Asynchronously crawl a website and scrape every discovered page. Submit a starting URL and receive a crawl ID you can poll for status and results. Configure crawl depth, link-following rules, path filters, scrape output formats, and optional webhook notifications. Each scraped page is returned as structured, LLM-ready data.

Endpoints

Start a crawl: POST /crawl
Get crawl status and results: GET /crawl/{id}
List all crawl jobs: GET /crawl
Cancel a crawl: DELETE /crawl/{id}
Auth: token query parameter (?token=)
Content-Type: application/json
Response: application/json

Prerequisites

A Browserless API token from your account dashboard

Quickstart

cURL
Javascript
Python

curl --request POST \
  --url 'https://production-sfo.browserless.io/crawl?token=YOUR_API_TOKEN_HERE' \
  --header 'Content-Type: application/json' \
  --data '{
  "url": "https://www.browserless.io"
}'

const TOKEN = "YOUR_API_TOKEN_HERE";
const url = `https://production-sfo.browserless.io/crawl?token=${TOKEN}`;
const headers = {
  "Content-Type": "application/json"
};

const data = {
  url: "https://www.browserless.io"
};

const startCrawl = async () => {
  const response = await fetch(url, {
    method: 'POST',
    headers: headers,
    body: JSON.stringify(data)
  });

  const result = await response.json();
  console.log(result);
};

startCrawl();

import requests

TOKEN = "YOUR_API_TOKEN_HERE"
url = f"https://production-sfo.browserless.io/crawl?token={TOKEN}"
headers = {
    "Content-Type": "application/json"
}

data = {
    "url": "https://www.browserless.io"
}

response = requests.post(url, headers=headers, json=data)
result = response.json()

print(result)

Response

{
  "success": true,
  "id": "crawl_abc123def456",
  "url": "https://production-sfo.browserless.io/crawl/crawl_abc123def456"
}

The url field is a status-check URL. Use it to poll for results via GET /crawl/{id}.

Polling for results

Once you have a crawl ID, poll GET /crawl/{id} to check progress and retrieve scraped pages. Results are paginated. Use the next URL to fetch additional pages.

Query parameters

Parameter	Type	Default	Description
`token`	`string`	—	Your API token (required).
`skip`	`number`	`0`	Number of pages to skip for pagination (non-negative integer).

cURL
Javascript
Python

curl --request GET \
  --url 'https://production-sfo.browserless.io/crawl/crawl_abc123def456?token=YOUR_API_TOKEN_HERE'

const TOKEN = "YOUR_API_TOKEN_HERE";
const crawlId = "crawl_abc123def456";
const url = `https://production-sfo.browserless.io/crawl/${crawlId}?token=${TOKEN}`;

const getCrawlStatus = async () => {
  const response = await fetch(url);
  const result = await response.json();
  console.log(result);
};

getCrawlStatus();

import requests

TOKEN = "YOUR_API_TOKEN_HERE"
crawl_id = "crawl_abc123def456"
url = f"https://production-sfo.browserless.io/crawl/{crawl_id}?token={TOKEN}"

response = requests.get(url)
result = response.json()

print(result)

Response

{
  "status": "completed",
  "total": 15,
  "completed": 15,
  "failed": 0,
  "expiresAt": "2025-07-01T12:00:00.000Z",
  "next": "https://production-sfo.browserless.io/crawl/crawl_abc123def456?skip=10",
  "data": [
    {
      "status": "completed",
      "contentUrl": "https://crawl-artifacts.s3.us-east-1.amazonaws.com/crawls/crawl_abc123def456/page_0a1b2c3d4e5f.json?X-Amz-Algorithm=AWS4-HMAC-SHA256&...",
      "metadata": {
        "title": "Browserless - Headless Browser Automation",
        "description": "Headless browser automation, without the hosting headaches.",
        "language": "en",
        "scrapedAt": "2025-06-30T10:00:00.000Z",
        "sourceURL": "https://www.browserless.io",
        "statusCode": 200,
        "error": null
      }
    }
    // ...more pages
  ]
}

Depth and link control

Control how deep the crawler follows links and whether it stays within the original domain.

cURL
Javascript
Python

curl --request POST \
  --url 'https://production-sfo.browserless.io/crawl?token=YOUR_API_TOKEN_HERE' \
  --header 'Content-Type: application/json' \
  --data '{
  "url": "https://www.browserless.io",
  "maxDepth": 3,
  "limit": 50,
  "allowSubdomains": true,
  "allowExternalLinks": false
}'

const TOKEN = "YOUR_API_TOKEN_HERE";
const url = `https://production-sfo.browserless.io/crawl?token=${TOKEN}`;
const headers = {
  "Content-Type": "application/json"
};

const data = {
  url: "https://www.browserless.io",
  maxDepth: 3,
  limit: 50,
  allowSubdomains: true,
  allowExternalLinks: false
};

const startCrawl = async () => {
  const response = await fetch(url, {
    method: 'POST',
    headers: headers,
    body: JSON.stringify(data)
  });

  const result = await response.json();
  console.log(result);
};

startCrawl();

import requests

TOKEN = "YOUR_API_TOKEN_HERE"
url = f"https://production-sfo.browserless.io/crawl?token={TOKEN}"
headers = {
    "Content-Type": "application/json"
}

data = {
    "url": "https://www.browserless.io",
    "maxDepth": 3,
    "limit": 50,
    "allowSubdomains": True,
    "allowExternalLinks": False
}

response = requests.post(url, headers=headers, json=data)
result = response.json()

print(result)

Sitemap control

The sitemap parameter controls how the crawler uses XML sitemaps for URL discovery:

Mode	Description
`"auto"`	(default) Attempts to use the sitemap if available, falls back to link extraction.
`"force"`	Only uses the sitemap for URL discovery. Fails if no sitemap is found.
`"skip"`	Ignores the sitemap entirely. Only discovers URLs by following on-page links.

cURL
Javascript
Python

curl --request POST \
  --url 'https://production-sfo.browserless.io/crawl?token=YOUR_API_TOKEN_HERE' \
  --header 'Content-Type: application/json' \
  --data '{
  "url": "https://www.browserless.io",
  "sitemap": "force"
}'

const TOKEN = "YOUR_API_TOKEN_HERE";
const url = `https://production-sfo.browserless.io/crawl?token=${TOKEN}`;
const headers = {
  "Content-Type": "application/json"
};

const data = {
  url: "https://www.browserless.io",
  sitemap: "force"
};

const startCrawl = async () => {
  const response = await fetch(url, {
    method: 'POST',
    headers: headers,
    body: JSON.stringify(data)
  });

  const result = await response.json();
  console.log(result);
};

startCrawl();

import requests

TOKEN = "YOUR_API_TOKEN_HERE"
url = f"https://production-sfo.browserless.io/crawl?token={TOKEN}"
headers = {
    "Content-Type": "application/json"
}

data = {
    "url": "https://www.browserless.io",
    "sitemap": "force"
}

response = requests.post(url, headers=headers, json=data)
result = response.json()

print(result)

Path filtering

Use includePaths and excludePaths to control which URL paths the crawler visits. Both accept arrays of regex patterns matched against the URL path.

cURL
Javascript
Python

curl --request POST \
  --url 'https://production-sfo.browserless.io/crawl?token=YOUR_API_TOKEN_HERE' \
  --header 'Content-Type: application/json' \
  --data '{
  "url": "https://www.browserless.io",
  "includePaths": ["^/blog"],
  "excludePaths": ["^/blog/draft"]
}'

const TOKEN = "YOUR_API_TOKEN_HERE";
const url = `https://production-sfo.browserless.io/crawl?token=${TOKEN}`;
const headers = {
  "Content-Type": "application/json"
};

const data = {
  url: "https://www.browserless.io",
  includePaths: ["^/blog"],
  excludePaths: ["^/blog/draft"]
};

const startCrawl = async () => {
  const response = await fetch(url, {
    method: 'POST',
    headers: headers,
    body: JSON.stringify(data)
  });

  const result = await response.json();
  console.log(result);
};

startCrawl();

import requests

TOKEN = "YOUR_API_TOKEN_HERE"
url = f"https://production-sfo.browserless.io/crawl?token={TOKEN}"
headers = {
    "Content-Type": "application/json"
}

data = {
    "url": "https://www.browserless.io",
    "includePaths": ["^/blog"],
    "excludePaths": ["^/blog/draft"]
}

response = requests.post(url, headers=headers, json=data)
result = response.json()

print(result)

Scrape options

Control how each crawled page is scraped with the scrapeOptions object. Choose output formats, filter content, and set per-page timeouts.

cURL
Javascript
Python

curl --request POST \
  --url 'https://production-sfo.browserless.io/crawl?token=YOUR_API_TOKEN_HERE' \
  --header 'Content-Type: application/json' \
  --data '{
  "url": "https://www.browserless.io",
  "limit": 10,
  "scrapeOptions": {
    "formats": ["markdown", "html"],
    "onlyMainContent": true,
    "excludeTags": ["nav", "footer"]
  }
}'

const TOKEN = "YOUR_API_TOKEN_HERE";
const url = `https://production-sfo.browserless.io/crawl?token=${TOKEN}`;
const headers = {
  "Content-Type": "application/json"
};

const data = {
  url: "https://www.browserless.io",
  limit: 10,
  scrapeOptions: {
    formats: ["markdown", "html"],
    onlyMainContent: true,
    excludeTags: ["nav", "footer"]
  }
};

const startCrawl = async () => {
  const response = await fetch(url, {
    method: 'POST',
    headers: headers,
    body: JSON.stringify(data)
  });

  const result = await response.json();
  console.log(result);
};

startCrawl();

import requests

TOKEN = "YOUR_API_TOKEN_HERE"
url = f"https://production-sfo.browserless.io/crawl?token={TOKEN}"
headers = {
    "Content-Type": "application/json"
}

data = {
    "url": "https://www.browserless.io",
    "limit": 10,
    "scrapeOptions": {
        "formats": ["markdown", "html"],
        "onlyMainContent": True,
        "excludeTags": ["nav", "footer"]
    }
}

response = requests.post(url, headers=headers, json=data)
result = response.json()

print(result)

Webhook notifications

Receive real-time notifications as pages are scraped or when the crawl completes or fails. Provide an HTTPS URL and choose which events to subscribe to.

cURL
Javascript
Python

curl --request POST \
  --url 'https://production-sfo.browserless.io/crawl?token=YOUR_API_TOKEN_HERE' \
  --header 'Content-Type: application/json' \
  --data '{
  "url": "https://www.browserless.io",
  "webhook": {
    "url": "https://your-server.com/webhook",
    "events": ["page", "completed", "failed"]
  }
}'

const TOKEN = "YOUR_API_TOKEN_HERE";
const url = `https://production-sfo.browserless.io/crawl?token=${TOKEN}`;
const headers = {
  "Content-Type": "application/json"
};

const data = {
  url: "https://www.browserless.io",
  webhook: {
    url: "https://your-server.com/webhook",
    events: ["page", "completed", "failed"]
  }
};

const startCrawl = async () => {
  const response = await fetch(url, {
    method: 'POST',
    headers: headers,
    body: JSON.stringify(data)
  });

  const result = await response.json();
  console.log(result);
};

startCrawl();

import requests

TOKEN = "YOUR_API_TOKEN_HERE"
url = f"https://production-sfo.browserless.io/crawl?token={TOKEN}"
headers = {
    "Content-Type": "application/json"
}

data = {
    "url": "https://www.browserless.io",
    "webhook": {
        "url": "https://your-server.com/webhook",
        "events": ["page", "completed", "failed"]
    }
}

response = requests.post(url, headers=headers, json=data)
result = response.json()

print(result)

Using a profile

Crawl authenticated pages by passing a saved profile via the ?profile= query parameter. The browser loads the profile's cookies, localStorage, and IndexedDB before navigating, so every crawled page is accessed as the logged-in user.

cURL
Javascript
Python

curl --request POST \
  --url 'https://production-sfo.browserless.io/crawl?token=YOUR_API_TOKEN_HERE&profile=acme-prod' \
  --header 'Content-Type: application/json' \
  --data '{
  "url": "https://app.example.com/docs",
  "limit": 20,
  "maxDepth": 2
}'

const TOKEN = "YOUR_API_TOKEN_HERE";
const url = `https://production-sfo.browserless.io/crawl?token=${TOKEN}&profile=acme-prod`;
const headers = {
  "Content-Type": "application/json"
};

const data = {
  url: "https://app.example.com/docs",
  limit: 20,
  maxDepth: 2
};

const startCrawl = async () => {
  const response = await fetch(url, {
    method: 'POST',
    headers: headers,
    body: JSON.stringify(data)
  });

  const result = await response.json();
  console.log(result);
};

startCrawl();

import requests

TOKEN = "YOUR_API_TOKEN_HERE"
url = f"https://production-sfo.browserless.io/crawl?token={TOKEN}&profile=acme-prod"
headers = {
    "Content-Type": "application/json"
}

data = {
    "url": "https://app.example.com/docs",
    "limit": 20,
    "maxDepth": 2
}

response = requests.post(url, headers=headers, json=data)
result = response.json()

print(result)

tip

Create and manage profiles via the Authenticated Profiles workflow. The profile name is scoped to your API token. Other tokens cannot access your profiles.

Cancelling a crawl

Cancel a running crawl by sending a DELETE request with the crawl ID. Pages already scraped remain available.

If the crawl is already in a terminal state (completed, failed, or cancelled), the API returns a 409 Conflict:

{
  "id": "crawl_abc123def456",
  "status": "completed",
  "message": "Crawl is already completed"
}

Success response (200):

{
  "status": "cancelled"
}

cURL
Javascript
Python

curl --request DELETE \
  --url 'https://production-sfo.browserless.io/crawl/crawl_abc123def456?token=YOUR_API_TOKEN_HERE'

const TOKEN = "YOUR_API_TOKEN_HERE";
  const crawlId = "crawl_abc123def456";
  const url = `https://production-sfo.browserless.io/crawl/${crawlId}?token=${TOKEN}`;

  const cancelCrawl = async () => {
  const response = await fetch(url, {
    method: 'DELETE'
  });

  const result = await response.json();
  console.log(result);
};

cancelCrawl();

import requests

TOKEN = "YOUR_API_TOKEN_HERE"
  crawl_id = "crawl_abc123def456"
  url = f"https://production-sfo.browserless.io/crawl/{crawl_id}?token={TOKEN}"

  response = requests.delete(url)
result = response.json()

print(result)

Listing all crawls

List all crawl jobs for your account. Results are paginated. Use nextCursor to fetch the next page.

Query parameters

Parameter	Type	Default	Description
`token`	`string`	—	Your API token (required).
`limit`	`number`	`20`	Results per page (1–100).
`cursor`	`string`	—	Opaque pagination cursor from `nextCursor` in a previous response.
`status`	`string`	—	Filter by status: `"in-progress"`, `"completed"`, `"failed"`, or `"cancelled"`.

cURL
Javascript
Python

curl --request GET \
  --url 'https://production-sfo.browserless.io/crawl?token=YOUR_API_TOKEN_HERE'

const TOKEN = "YOUR_API_TOKEN_HERE";
const url = `https://production-sfo.browserless.io/crawl?token=${TOKEN}`;

const listCrawls = async () => {
  const response = await fetch(url);
  const result = await response.json();
  console.log(result);
};

listCrawls();

import requests

TOKEN = "YOUR_API_TOKEN_HERE"
url = f"https://production-sfo.browserless.io/crawl?token={TOKEN}"

response = requests.get(url)
result = response.json()

print(result)

Response

{
  "crawls": [
    {
      "id": "crawl_abc123def456",
      "url": "https://www.browserless.io",
      "status": "completed",
      "total": 15,
      "completed": 15,
      "createdAt": "2025-06-30T09:00:00.000Z",
      "completedAt": "2025-06-30T09:05:00.000Z"
    },
    {
      "id": "crawl_def456abc789",
      "url": "https://docs.browserless.io",
      "status": "in-progress",
      "total": 50,
      "completed": 23,
      "createdAt": "2025-06-30T10:00:00.000Z",
      "completedAt": null
    }
  ],
  "nextCursor": "eyJza2lwIjoxMH0"
}

Request body

POST /crawl

Field	Type	Required	Default	Description
`url`	`string`	Yes	—	The URL to crawl. Must be `http://` or `https://`.
`limit`	`number`	No	`100`	Maximum number of pages to crawl (min `1`, clamped to your plan's limit).
`maxDepth`	`number`	No	`5`	Maximum link-follow depth from the root URL (0–20).
`maxRetries`	`number`	No	`1`	Number of retry attempts per failed page (0–5).
`allowExternalLinks`	`boolean`	No	`false`	Whether to follow links to external domains.
`allowSubdomains`	`boolean`	No	`false`	Whether to follow links to subdomains of the root URL.
`sitemap`	`string`	No	`"auto"`	Sitemap handling strategy: `"auto"`, `"force"`, or `"skip"`.
`includePaths`	`string[]`	No	`[]`	Regex patterns for URL paths to include.
`excludePaths`	`string[]`	No	`[]`	Regex patterns for URL paths to exclude.
`delay`	`number`	No	`200`	Delay between requests in milliseconds (0–10,000).
`scrapeOptions`	`object`	No	—	Options controlling how each page is scraped. See below.
`webhook`	`object`	No	—	Webhook configuration for crawl event notifications. See below.

scrapeOptions

Field	Type	Required	Default	Description
`formats`	`string[]`	No	`["markdown"]`	Output formats: `"markdown"`, `"html"`, `"rawText"`
`onlyMainContent`	`boolean`	No	`true`	Whether to extract only the main content of the page.
`includeTags`	`string[]`	No	`[]`	HTML tag selectors to include.
`excludeTags`	`string[]`	No	`[]`	HTML tag selectors to exclude.
`waitFor`	`number`	No	`0`	Time in milliseconds to wait after page load before scraping (0–30,000).
`headers`	`object`	No	—	Custom HTTP headers to send with each request. The following headers are blocked and will return a 400 error: `host`, `authorization`, `proxy-authorization`, `cookie`, `set-cookie`, `x-forwarded-for`, `x-real-ip`, `forwarded`.
`timeout`	`number`	No	`30000`	Navigation timeout in milliseconds (1,000–180,000). Defaults to your server's configured timeout (30,000ms for cloud).
`proxy`	`string`	No	`"residential"`	Proxy network to route page fetches through: `"residential"` (6 units/MB) or `"datacenter"` (2 units/MB).

webhook

Field	Type	Required	Default	Description
`url`	`string`	Yes	—	The HTTPS URL to send webhook events to.
`events`	`string[]`	No	`["completed"]`	Which events to send: `"page"`, `"completed"`, `"failed"`. Defaults to `["completed"]` when omitted.

Response fields

POST /crawl (start)

Field	Type	Description
`success`	`boolean`	Whether the crawl was started successfully.
`id`	`string`	The unique crawl job ID. Use this to poll for status and results.
`url`	`string`	Status-check URL for polling results via GET /crawl/{id}.

GET /crawl/{id} (status and results)

Field	Type	Description
`status`	`string`	Crawl status: `"in-progress"`, `"completed"`, `"failed"`, or `"cancelled"`.
`total`	`number`	Total number of pages discovered.
`completed`	`number`	Number of pages successfully scraped.
`failed`	`number`	Number of pages that failed to scrape.
`expiresAt`	`string \| null`	ISO 8601 timestamp when the crawl results expire (24 hours after completion). `null` while the crawl is still in progress.
`next`	`string \| null`	URL to fetch the next page of results (uses `skip` offset parameter). `null` when all results have been returned.
`data`	`CrawlPageResponse[]`	Array of scraped page results. See below.

CrawlPageResponse

Field	Type	Description
`status`	`string`	Page status: `"queued"`, `"in-progress"`, `"completed"`, `"failed"`, or `"cancelled"`.
`contentUrl`	`string \| null`	Pre-signed S3 URL to fetch the full scraped content for this page. Expires after 1 hour. `null` if the page has not yet completed.
`metadata.title`	`string \| null`	Page title. `null` if not extracted.
`metadata.description`	`string \| null`	Page meta description. `null` if not extracted.
`metadata.language`	`string \| null`	Detected page language. `null` if not detected.
`metadata.scrapedAt`	`string \| null`	ISO 8601 timestamp when the page was scraped. `null` if not yet scraped.
`metadata.sourceURL`	`string`	The original URL that was scraped.
`metadata.statusCode`	`number \| null`	HTTP status code of the page response. `null` if not yet scraped.
`metadata.error`	`string \| null`	Error message if the page failed to scrape. `null` on success.

GET /crawl (list all)

Field	Type	Description
`crawls`	`CrawlListItem[]`	Array of crawl jobs.
`nextCursor`	`string \| null`	Cursor for fetching the next page of results. `null` when there are no more results.

CrawlListItem

Field	Type	Description
`id`	`string`	The crawl job ID.
`url`	`string`	The root URL being crawled.
`status`	`string`	Crawl status: `"in-progress"`, `"completed"`, `"failed"`, or `"cancelled"`.
`total`	`number`	Total pages discovered.
`completed`	`number`	Pages successfully scraped.
`createdAt`	`string`	ISO 8601 timestamp when the crawl was created.
`completedAt`	`string \| null`	ISO 8601 timestamp when the crawl finished. `null` if still running.

Error responses

All crawl endpoints may return the following error responses:

Status	Description
`400 Bad Request`	Invalid parameters, unrecognized query params, invalid regex in `includePaths`/`excludePaths`, or blocked headers in `scrapeOptions.headers`.
`401 Unauthorized`	Missing or invalid API token.
`404 Not Found`	Crawl ID does not exist or belongs to another token.
`409 Conflict`	(DELETE only) The crawl is already in a terminal state (`completed`, `failed`, or `cancelled`).
`429 Too Many Requests`	Concurrent crawl limit reached for your plan.
`503 Service Unavailable`	Crawl service is not running or temporarily degraded.

FAQ & Troubleshooting

Why am I getting a 401 Unauthorized error?

Your API token is missing or invalid. Include it as a ?token= query parameter or in the Authorization header. Verify the token in your account dashboard.

My request returns an empty response

The page may not have finished loading. Increase waitForTimeout or use waitForSelector to wait for specific content before extracting data.

Crawl API

Quickstart

Polling for results

Depth and link control

Sitemap control

Path filtering

Scrape options

Webhook notifications

Using a profile

Cancelling a crawl

Listing all crawls

Request body

POST /crawl

scrapeOptions

webhook

Response fields

POST /crawl (start)

GET /crawl/{id} (status and results)

CrawlPageResponse

GET /crawl (list all)

CrawlListItem

Error responses

FAQ & Troubleshooting

Next steps

REST API overview

API playground

Quickstart​

Polling for results​

Depth and link control​

Sitemap control​

Path filtering​

Scrape options​

Webhook notifications​

Using a profile​

Cancelling a crawl​

Listing all crawls​

Request body​

POST /crawl​

scrapeOptions​

webhook​

Response fields​

POST /crawl (start)​

GET /crawl/{id} (status and results)​

CrawlPageResponse​

GET /crawl (list all)​

CrawlListItem​

Error responses​

FAQ & Troubleshooting​

Next steps​

REST API overview

API playground

Quickstart

Polling for results

Depth and link control

Sitemap control

Path filtering

Scrape options

Webhook notifications

Using a profile

Cancelling a crawl

Listing all crawls

Request body

POST /crawl

scrapeOptions

webhook

Response fields

POST /crawl (start)

GET /crawl/{id} (status and results)

CrawlPageResponse

GET /crawl (list all)

CrawlListItem

Error responses

FAQ & Troubleshooting

Next steps