Crawl API
The Crawl API is currently in beta. Parameters and response shapes may change in future releases.
The Crawl API is only available for Cloud plans. Contact us for more information here.
Asynchronously crawl a website and scrape every discovered page. Submit a starting URL and receive a crawl ID you can poll for status and results. Configure crawl depth, link-following rules, path filters, scrape output formats, and optional webhook notifications. Each scraped page is returned as structured, LLM-ready data.
Endpoints
- Start a crawl:
POST /crawl - Get crawl status and results: GET /crawl/{id}
- List all crawl jobs:
GET /crawl - Cancel a crawl: DELETE /crawl/{id}
- Auth:
tokenquery parameter (?token=) - Content-Type:
application/json - Response:
application/json
Quickstart
- cURL
- Javascript
- Python
curl --request POST \
--url 'https://production-sfo.browserless.io/crawl?token=YOUR_API_TOKEN_HERE' \
--header 'Content-Type: application/json' \
--data '{
"url": "https://www.browserless.io"
}'
const TOKEN = "YOUR_API_TOKEN_HERE";
const url = `https://production-sfo.browserless.io/crawl?token=${TOKEN}`;
const headers = {
"Content-Type": "application/json"
};
const data = {
url: "https://www.browserless.io"
};
const startCrawl = async () => {
const response = await fetch(url, {
method: 'POST',
headers: headers,
body: JSON.stringify(data)
});
const result = await response.json();
console.log(result);
};
startCrawl();
import requests
TOKEN = "YOUR_API_TOKEN_HERE"
url = f"https://production-sfo.browserless.io/crawl?token={TOKEN}"
headers = {
"Content-Type": "application/json"
}
data = {
"url": "https://www.browserless.io"
}
response = requests.post(url, headers=headers, json=data)
result = response.json()
print(result)
Response
{
"success": true,
"id": "crawl_abc123def456",
"url": "https://production-sfo.browserless.io/crawl/crawl_abc123def456"
}
The url field is a status-check URL — use it to poll for results via GET /crawl/{id}.
Polling for results
Once you have a crawl ID, poll GET /crawl/{id} to check progress and retrieve scraped pages. Results are paginated — use the next URL to fetch additional pages.
Query parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
token | string | — | Your API token (required). |
skip | number | 0 | Number of pages to skip for pagination (non-negative integer). |
- cURL
- Javascript
- Python
curl --request GET \
--url 'https://production-sfo.browserless.io/crawl/crawl_abc123def456?token=YOUR_API_TOKEN_HERE'
const TOKEN = "YOUR_API_TOKEN_HERE";
const crawlId = "crawl_abc123def456";
const url = `https://production-sfo.browserless.io/crawl/${crawlId}?token=${TOKEN}`;
const getCrawlStatus = async () => {
const response = await fetch(url);
const result = await response.json();
console.log(result);
};
getCrawlStatus();
import requests
TOKEN = "YOUR_API_TOKEN_HERE"
crawl_id = "crawl_abc123def456"
url = f"https://production-sfo.browserless.io/crawl/{crawl_id}?token={TOKEN}"
response = requests.get(url)
result = response.json()
print(result)
Response
{
"status": "completed",
"total": 15,
"completed": 15,
"failed": 0,
"expiresAt": "2025-07-01T12:00:00.000Z",
"next": "https://production-sfo.browserless.io/crawl/crawl_abc123def456?skip=10",
"data": [
{
"status": "completed",
"contentUrl": "https://crawl-artifacts.s3.us-east-1.amazonaws.com/crawls/crawl_abc123def456/page_0a1b2c3d4e5f.json?X-Amz-Algorithm=AWS4-HMAC-SHA256&...",
"metadata": {
"title": "Browserless - Headless Browser Automation",
"description": "Headless browser automation, without the hosting headaches.",
"language": "en",
"scrapedAt": "2025-06-30T10:00:00.000Z",
"sourceURL": "https://www.browserless.io",
"statusCode": 200,
"error": null
}
}
// ...more pages
]
}
Depth and link control
Control how deep the crawler follows links and whether it stays within the original domain.
- cURL
- Javascript
- Python
curl --request POST \
--url 'https://production-sfo.browserless.io/crawl?token=YOUR_API_TOKEN_HERE' \
--header 'Content-Type: application/json' \
--data '{
"url": "https://www.browserless.io",
"maxDepth": 3,
"limit": 50,
"allowSubdomains": true,
"allowExternalLinks": false
}'
const TOKEN = "YOUR_API_TOKEN_HERE";
const url = `https://production-sfo.browserless.io/crawl?token=${TOKEN}`;
const headers = {
"Content-Type": "application/json"
};
const data = {
url: "https://www.browserless.io",
maxDepth: 3,
limit: 50,
allowSubdomains: true,
allowExternalLinks: false
};
const startCrawl = async () => {
const response = await fetch(url, {
method: 'POST',
headers: headers,
body: JSON.stringify(data)
});
const result = await response.json();
console.log(result);
};
startCrawl();
import requests
TOKEN = "YOUR_API_TOKEN_HERE"
url = f"https://production-sfo.browserless.io/crawl?token={TOKEN}"
headers = {
"Content-Type": "application/json"
}
data = {
"url": "https://www.browserless.io",
"maxDepth": 3,
"limit": 50,
"allowSubdomains": True,
"allowExternalLinks": False
}
response = requests.post(url, headers=headers, json=data)
result = response.json()
print(result)
Sitemap control
The sitemap parameter controls how the crawler uses XML sitemaps for URL discovery:
| Mode | Description |
|---|---|
"auto" | (default) Attempts to use the sitemap if available, falls back to link extraction. |
"force" | Only uses the sitemap for URL discovery. Fails if no sitemap is found. |
"skip" | Ignores the sitemap entirely. Only discovers URLs by following on-page links. |
- cURL
- Javascript
- Python
curl --request POST \
--url 'https://production-sfo.browserless.io/crawl?token=YOUR_API_TOKEN_HERE' \
--header 'Content-Type: application/json' \
--data '{
"url": "https://www.browserless.io",
"sitemap": "force"
}'
const TOKEN = "YOUR_API_TOKEN_HERE";
const url = `https://production-sfo.browserless.io/crawl?token=${TOKEN}`;
const headers = {
"Content-Type": "application/json"
};
const data = {
url: "https://www.browserless.io",
sitemap: "force"
};
const startCrawl = async () => {
const response = await fetch(url, {
method: 'POST',
headers: headers,
body: JSON.stringify(data)
});
const result = await response.json();
console.log(result);
};
startCrawl();
import requests
TOKEN = "YOUR_API_TOKEN_HERE"
url = f"https://production-sfo.browserless.io/crawl?token={TOKEN}"
headers = {
"Content-Type": "application/json"
}
data = {
"url": "https://www.browserless.io",
"sitemap": "force"
}
response = requests.post(url, headers=headers, json=data)
result = response.json()
print(result)
Path filtering
Use includePaths and excludePaths to control which URL paths the crawler visits. Both accept arrays of regex patterns matched against the URL path.
- cURL
- Javascript
- Python
curl --request POST \
--url 'https://production-sfo.browserless.io/crawl?token=YOUR_API_TOKEN_HERE' \
--header 'Content-Type: application/json' \
--data '{
"url": "https://www.browserless.io",
"includePaths": ["^/blog"],
"excludePaths": ["^/blog/draft"]
}'
const TOKEN = "YOUR_API_TOKEN_HERE";
const url = `https://production-sfo.browserless.io/crawl?token=${TOKEN}`;
const headers = {
"Content-Type": "application/json"
};
const data = {
url: "https://www.browserless.io",
includePaths: ["^/blog"],
excludePaths: ["^/blog/draft"]
};
const startCrawl = async () => {
const response = await fetch(url, {
method: 'POST',
headers: headers,
body: JSON.stringify(data)
});
const result = await response.json();
console.log(result);
};
startCrawl();
import requests
TOKEN = "YOUR_API_TOKEN_HERE"
url = f"https://production-sfo.browserless.io/crawl?token={TOKEN}"
headers = {
"Content-Type": "application/json"
}
data = {
"url": "https://www.browserless.io",
"includePaths": ["^/blog"],
"excludePaths": ["^/blog/draft"]
}
response = requests.post(url, headers=headers, json=data)
result = response.json()
print(result)
Scrape options
Control how each crawled page is scraped with the scrapeOptions object. Choose output formats, filter content, and set per-page timeouts.
- cURL
- Javascript
- Python
curl --request POST \
--url 'https://production-sfo.browserless.io/crawl?token=YOUR_API_TOKEN_HERE' \
--header 'Content-Type: application/json' \
--data '{
"url": "https://www.browserless.io",
"limit": 10,
"scrapeOptions": {
"formats": ["markdown", "html"],
"onlyMainContent": true,
"excludeTags": ["nav", "footer"]
}
}'
const TOKEN = "YOUR_API_TOKEN_HERE";
const url = `https://production-sfo.browserless.io/crawl?token=${TOKEN}`;
const headers = {
"Content-Type": "application/json"
};
const data = {
url: "https://www.browserless.io",
limit: 10,
scrapeOptions: {
formats: ["markdown", "html"],
onlyMainContent: true,
excludeTags: ["nav", "footer"]
}
};
const startCrawl = async () => {
const response = await fetch(url, {
method: 'POST',
headers: headers,
body: JSON.stringify(data)
});
const result = await response.json();
console.log(result);
};
startCrawl();
import requests
TOKEN = "YOUR_API_TOKEN_HERE"
url = f"https://production-sfo.browserless.io/crawl?token={TOKEN}"
headers = {
"Content-Type": "application/json"
}
data = {
"url": "https://www.browserless.io",
"limit": 10,
"scrapeOptions": {
"formats": ["markdown", "html"],
"onlyMainContent": True,
"excludeTags": ["nav", "footer"]
}
}
response = requests.post(url, headers=headers, json=data)
result = response.json()
print(result)
Webhook notifications
Receive real-time notifications as pages are scraped or when the crawl completes or fails. Provide an HTTPS URL and choose which events to subscribe to.
- cURL
- Javascript
- Python
curl --request POST \
--url 'https://production-sfo.browserless.io/crawl?token=YOUR_API_TOKEN_HERE' \
--header 'Content-Type: application/json' \
--data '{
"url": "https://www.browserless.io",
"webhook": {
"url": "https://your-server.com/webhook",
"events": ["page", "completed", "failed"]
}
}'
const TOKEN = "YOUR_API_TOKEN_HERE";
const url = `https://production-sfo.browserless.io/crawl?token=${TOKEN}`;
const headers = {
"Content-Type": "application/json"
};
const data = {
url: "https://www.browserless.io",
webhook: {
url: "https://your-server.com/webhook",
events: ["page", "completed", "failed"]
}
};
const startCrawl = async () => {
const response = await fetch(url, {
method: 'POST',
headers: headers,
body: JSON.stringify(data)
});
const result = await response.json();
console.log(result);
};
startCrawl();
import requests
TOKEN = "YOUR_API_TOKEN_HERE"
url = f"https://production-sfo.browserless.io/crawl?token={TOKEN}"
headers = {
"Content-Type": "application/json"
}
data = {
"url": "https://www.browserless.io",
"webhook": {
"url": "https://your-server.com/webhook",
"events": ["page", "completed", "failed"]
}
}
response = requests.post(url, headers=headers, json=data)
result = response.json()
print(result)
Cancelling a crawl
Cancel a running crawl by sending a DELETE request with the crawl ID. Pages already scraped remain available.
If the crawl is already in a terminal state (completed, failed, or cancelled), the API returns a 409 Conflict:
{
"id": "crawl_abc123def456",
"status": "completed",
"message": "Crawl is already completed"
}
Success response (200):
{
"status": "cancelled"
}
- cURL
- Javascript
- Python
curl --request DELETE \
--url 'https://production-sfo.browserless.io/crawl/crawl_abc123def456?token=YOUR_API_TOKEN_HERE'
const TOKEN = "YOUR_API_TOKEN_HERE";
const crawlId = "crawl_abc123def456";
const url = `https://production-sfo.browserless.io/crawl/${crawlId}?token=${TOKEN}`;
const cancelCrawl = async () => {
const response = await fetch(url, {
method: 'DELETE'
});
const result = await response.json();
console.log(result);
};
cancelCrawl();
import requests
TOKEN = "YOUR_API_TOKEN_HERE"
crawl_id = "crawl_abc123def456"
url = f"https://production-sfo.browserless.io/crawl/{crawl_id}?token={TOKEN}"
response = requests.delete(url)
result = response.json()
print(result)
Listing all crawls
List all crawl jobs for your account. Results are paginated — use nextCursor to fetch the next page.
Query parameters
| Parameter | Type | Default | Description |
|---|---|---|---|
token | string | — | Your API token (required). |
limit | number | 20 | Results per page (1–100). |
cursor | string | — | Opaque pagination cursor from nextCursor in a previous response. |
status | string | — | Filter by status: "in-progress", "completed", "failed", or "cancelled". |
- cURL
- Javascript
- Python
curl --request GET \
--url 'https://production-sfo.browserless.io/crawl?token=YOUR_API_TOKEN_HERE'
const TOKEN = "YOUR_API_TOKEN_HERE";
const url = `https://production-sfo.browserless.io/crawl?token=${TOKEN}`;
const listCrawls = async () => {
const response = await fetch(url);
const result = await response.json();
console.log(result);
};
listCrawls();
import requests
TOKEN = "YOUR_API_TOKEN_HERE"
url = f"https://production-sfo.browserless.io/crawl?token={TOKEN}"
response = requests.get(url)
result = response.json()
print(result)
Response
{
"crawls": [
{
"id": "crawl_abc123def456",
"url": "https://www.browserless.io",
"status": "completed",
"total": 15,
"completed": 15,
"createdAt": "2025-06-30T09:00:00.000Z",
"completedAt": "2025-06-30T09:05:00.000Z"
},
{
"id": "crawl_def456abc789",
"url": "https://docs.browserless.io",
"status": "in-progress",
"total": 50,
"completed": 23,
"createdAt": "2025-06-30T10:00:00.000Z",
"completedAt": null
}
],
"nextCursor": "eyJza2lwIjoxMH0"
}
Request body
POST /crawl
| Field | Type | Required | Default | Description |
|---|---|---|---|---|
url | string | Yes | — | The URL to crawl. Must be http:// or https://. |
limit | number | No | 100 | Maximum number of pages to crawl (min 1, clamped to your plan's limit). |
maxDepth | number | No | 5 | Maximum link-follow depth from the root URL (0–20). |
maxRetries | number | No | 1 | Number of retry attempts per failed page (0–5). |
allowExternalLinks | boolean | No | false | Whether to follow links to external domains. |
allowSubdomains | boolean | No | false | Whether to follow links to subdomains of the root URL. |
sitemap | string | No | "auto" | Sitemap handling strategy: "auto", "force", or "skip". |
includePaths | string[] | No | [] | Regex patterns for URL paths to include. |
excludePaths | string[] | No | [] | Regex patterns for URL paths to exclude. |
delay | number | No | 200 | Delay between requests in milliseconds (0–10,000). |
scrapeOptions | object | No | — | Options controlling how each page is scraped. See below. |
webhook | object | No | — | Webhook configuration for crawl event notifications. See below. |
scrapeOptions
| Field | Type | Required | Default | Description |
|---|---|---|---|---|
formats | string[] | No | ["markdown"] | Output formats: "markdown", "html", "rawText" |
onlyMainContent | boolean | No | true | Whether to extract only the main content of the page. |
includeTags | string[] | No | [] | HTML tag selectors to include. |
excludeTags | string[] | No | [] | HTML tag selectors to exclude. |
waitFor | number | No | 0 | Time in milliseconds to wait after page load before scraping (0–30,000). |
headers | object | No | — | Custom HTTP headers to send with each request. The following headers are blocked and will return a 400 error: host, authorization, proxy-authorization, cookie, set-cookie, x-forwarded-for, x-real-ip, forwarded. |
timeout | number | No | 30000 | Navigation timeout in milliseconds (1,000–180,000). Defaults to your server's configured timeout (30,000ms for cloud). |
webhook
| Field | Type | Required | Default | Description |
|---|---|---|---|---|
url | string | Yes | — | The HTTPS URL to send webhook events to. |
events | string[] | No | ["completed"] | Which events to send: "page", "completed", "failed". Defaults to ["completed"] when omitted. |
Response fields
POST /crawl (start)
| Field | Type | Description |
|---|---|---|
success | boolean | Whether the crawl was started successfully. |
id | string | The unique crawl job ID. Use this to poll for status and results. |
url | string | Status-check URL for polling results via GET /crawl/{id}. |
GET /crawl/{id} (status and results)
| Field | Type | Description |
|---|---|---|
status | string | Crawl status: "in-progress", "completed", "failed", or "cancelled". |
total | number | Total number of pages discovered. |
completed | number | Number of pages successfully scraped. |
failed | number | Number of pages that failed to scrape. |
expiresAt | string | null | ISO 8601 timestamp when the crawl results expire (24 hours after completion). null while the crawl is still in progress. |
next | string | null | URL to fetch the next page of results (uses skip offset parameter). null when all results have been returned. |
data | CrawlPageResponse[] | Array of scraped page results. See below. |
CrawlPageResponse
| Field | Type | Description |
|---|---|---|
status | string | Page status: "queued", "in-progress", "completed", "failed", or "cancelled". |
contentUrl | string | null | Pre-signed S3 URL to fetch the full scraped content for this page. Expires after 1 hour. null if the page has not yet completed. |
metadata.title | string | null | Page title. null if not extracted. |
metadata.description | string | null | Page meta description. null if not extracted. |
metadata.language | string | null | Detected page language. null if not detected. |
metadata.scrapedAt | string | null | ISO 8601 timestamp when the page was scraped. null if not yet scraped. |
metadata.sourceURL | string | The original URL that was scraped. |
metadata.statusCode | number | null | HTTP status code of the page response. null if not yet scraped. |
metadata.error | string | null | Error message if the page failed to scrape. null on success. |
GET /crawl (list all)
| Field | Type | Description |
|---|---|---|
crawls | CrawlListItem[] | Array of crawl jobs. |
nextCursor | string | null | Cursor for fetching the next page of results. null when there are no more results. |
CrawlListItem
| Field | Type | Description |
|---|---|---|
id | string | The crawl job ID. |
url | string | The root URL being crawled. |
status | string | Crawl status: "in-progress", "completed", "failed", or "cancelled". |
total | number | Total pages discovered. |
completed | number | Pages successfully scraped. |
createdAt | string | ISO 8601 timestamp when the crawl was created. |
completedAt | string | null | ISO 8601 timestamp when the crawl finished. null if still running. |
Error responses
All crawl endpoints may return the following error responses:
| Status | Description |
|---|---|
400 Bad Request | Invalid parameters, unrecognized query params, invalid regex in includePaths/excludePaths, or blocked headers in scrapeOptions.headers. |
401 Unauthorized | Missing or invalid API token. |
404 Not Found | Crawl ID does not exist or belongs to another token. |
409 Conflict | (DELETE only) The crawl is already in a terminal state (completed, failed, or cancelled). |
429 Too Many Requests | Concurrent crawl limit reached for your plan. |
503 Service Unavailable | Crawl service is not running or temporarily degraded. |