Skip to main content

Crawl API

BETA

The Crawl API is currently in beta. Parameters and response shapes may change in future releases.

The Crawl API is only available for Cloud plans. Contact us for more information here.

Asynchronously crawl a website and scrape every discovered page. Submit a starting URL and receive a crawl ID you can poll for status and results. Configure crawl depth, link-following rules, path filters, scrape output formats, and optional webhook notifications. Each scraped page is returned as structured, LLM-ready data.

Endpoints

  • Start a crawl: POST /crawl
  • Get crawl status and results: GET /crawl/{id}
  • List all crawl jobs: GET /crawl
  • Cancel a crawl: DELETE /crawl/{id}
  • Auth: token query parameter (?token=)
  • Content-Type: application/json
  • Response: application/json

Quickstart

curl --request POST \
--url 'https://production-sfo.browserless.io/crawl?token=YOUR_API_TOKEN_HERE' \
--header 'Content-Type: application/json' \
--data '{
"url": "https://www.browserless.io"
}'

Response

{
"success": true,
"id": "crawl_abc123def456",
"url": "https://production-sfo.browserless.io/crawl/crawl_abc123def456"
}

The url field is a status-check URL — use it to poll for results via GET /crawl/{id}.

Polling for results

Once you have a crawl ID, poll GET /crawl/{id} to check progress and retrieve scraped pages. Results are paginated — use the next URL to fetch additional pages.

Query parameters

ParameterTypeDefaultDescription
tokenstringYour API token (required).
skipnumber0Number of pages to skip for pagination (non-negative integer).
curl --request GET \
--url 'https://production-sfo.browserless.io/crawl/crawl_abc123def456?token=YOUR_API_TOKEN_HERE'

Response

{
"status": "completed",
"total": 15,
"completed": 15,
"failed": 0,
"expiresAt": "2025-07-01T12:00:00.000Z",
"next": "https://production-sfo.browserless.io/crawl/crawl_abc123def456?skip=10",
"data": [
{
"status": "completed",
"contentUrl": "https://crawl-artifacts.s3.us-east-1.amazonaws.com/crawls/crawl_abc123def456/page_0a1b2c3d4e5f.json?X-Amz-Algorithm=AWS4-HMAC-SHA256&...",
"metadata": {
"title": "Browserless - Headless Browser Automation",
"description": "Headless browser automation, without the hosting headaches.",
"language": "en",
"scrapedAt": "2025-06-30T10:00:00.000Z",
"sourceURL": "https://www.browserless.io",
"statusCode": 200,
"error": null
}
}
// ...more pages
]
}

Control how deep the crawler follows links and whether it stays within the original domain.

curl --request POST \
--url 'https://production-sfo.browserless.io/crawl?token=YOUR_API_TOKEN_HERE' \
--header 'Content-Type: application/json' \
--data '{
"url": "https://www.browserless.io",
"maxDepth": 3,
"limit": 50,
"allowSubdomains": true,
"allowExternalLinks": false
}'

Sitemap control

The sitemap parameter controls how the crawler uses XML sitemaps for URL discovery:

ModeDescription
"auto"(default) Attempts to use the sitemap if available, falls back to link extraction.
"force"Only uses the sitemap for URL discovery. Fails if no sitemap is found.
"skip"Ignores the sitemap entirely. Only discovers URLs by following on-page links.
curl --request POST \
--url 'https://production-sfo.browserless.io/crawl?token=YOUR_API_TOKEN_HERE' \
--header 'Content-Type: application/json' \
--data '{
"url": "https://www.browserless.io",
"sitemap": "force"
}'

Path filtering

Use includePaths and excludePaths to control which URL paths the crawler visits. Both accept arrays of regex patterns matched against the URL path.

curl --request POST \
--url 'https://production-sfo.browserless.io/crawl?token=YOUR_API_TOKEN_HERE' \
--header 'Content-Type: application/json' \
--data '{
"url": "https://www.browserless.io",
"includePaths": ["^/blog"],
"excludePaths": ["^/blog/draft"]
}'

Scrape options

Control how each crawled page is scraped with the scrapeOptions object. Choose output formats, filter content, and set per-page timeouts.

curl --request POST \
--url 'https://production-sfo.browserless.io/crawl?token=YOUR_API_TOKEN_HERE' \
--header 'Content-Type: application/json' \
--data '{
"url": "https://www.browserless.io",
"limit": 10,
"scrapeOptions": {
"formats": ["markdown", "html"],
"onlyMainContent": true,
"excludeTags": ["nav", "footer"]
}
}'

Webhook notifications

Receive real-time notifications as pages are scraped or when the crawl completes or fails. Provide an HTTPS URL and choose which events to subscribe to.

curl --request POST \
--url 'https://production-sfo.browserless.io/crawl?token=YOUR_API_TOKEN_HERE' \
--header 'Content-Type: application/json' \
--data '{
"url": "https://www.browserless.io",
"webhook": {
"url": "https://your-server.com/webhook",
"events": ["page", "completed", "failed"]
}
}'

Cancelling a crawl

Cancel a running crawl by sending a DELETE request with the crawl ID. Pages already scraped remain available.

If the crawl is already in a terminal state (completed, failed, or cancelled), the API returns a 409 Conflict:

{
"id": "crawl_abc123def456",
"status": "completed",
"message": "Crawl is already completed"
}

Success response (200):

{
"status": "cancelled"
}
curl --request DELETE \
--url 'https://production-sfo.browserless.io/crawl/crawl_abc123def456?token=YOUR_API_TOKEN_HERE'

Listing all crawls

List all crawl jobs for your account. Results are paginated — use nextCursor to fetch the next page.

Query parameters

ParameterTypeDefaultDescription
tokenstringYour API token (required).
limitnumber20Results per page (1–100).
cursorstringOpaque pagination cursor from nextCursor in a previous response.
statusstringFilter by status: "in-progress", "completed", "failed", or "cancelled".
curl --request GET \
--url 'https://production-sfo.browserless.io/crawl?token=YOUR_API_TOKEN_HERE'

Response

{
"crawls": [
{
"id": "crawl_abc123def456",
"url": "https://www.browserless.io",
"status": "completed",
"total": 15,
"completed": 15,
"createdAt": "2025-06-30T09:00:00.000Z",
"completedAt": "2025-06-30T09:05:00.000Z"
},
{
"id": "crawl_def456abc789",
"url": "https://docs.browserless.io",
"status": "in-progress",
"total": 50,
"completed": 23,
"createdAt": "2025-06-30T10:00:00.000Z",
"completedAt": null
}
],
"nextCursor": "eyJza2lwIjoxMH0"
}

Request body

POST /crawl

FieldTypeRequiredDefaultDescription
urlstringYesThe URL to crawl. Must be http:// or https://.
limitnumberNo100Maximum number of pages to crawl (min 1, clamped to your plan's limit).
maxDepthnumberNo5Maximum link-follow depth from the root URL (0–20).
maxRetriesnumberNo1Number of retry attempts per failed page (0–5).
allowExternalLinksbooleanNofalseWhether to follow links to external domains.
allowSubdomainsbooleanNofalseWhether to follow links to subdomains of the root URL.
sitemapstringNo"auto"Sitemap handling strategy: "auto", "force", or "skip".
includePathsstring[]No[]Regex patterns for URL paths to include.
excludePathsstring[]No[]Regex patterns for URL paths to exclude.
delaynumberNo200Delay between requests in milliseconds (0–10,000).
scrapeOptionsobjectNoOptions controlling how each page is scraped. See below.
webhookobjectNoWebhook configuration for crawl event notifications. See below.

scrapeOptions

FieldTypeRequiredDefaultDescription
formatsstring[]No["markdown"]Output formats: "markdown", "html", "rawText"
onlyMainContentbooleanNotrueWhether to extract only the main content of the page.
includeTagsstring[]No[]HTML tag selectors to include.
excludeTagsstring[]No[]HTML tag selectors to exclude.
waitFornumberNo0Time in milliseconds to wait after page load before scraping (0–30,000).
headersobjectNoCustom HTTP headers to send with each request. The following headers are blocked and will return a 400 error: host, authorization, proxy-authorization, cookie, set-cookie, x-forwarded-for, x-real-ip, forwarded.
timeoutnumberNo30000Navigation timeout in milliseconds (1,000–180,000). Defaults to your server's configured timeout (30,000ms for cloud).

webhook

FieldTypeRequiredDefaultDescription
urlstringYesThe HTTPS URL to send webhook events to.
eventsstring[]No["completed"]Which events to send: "page", "completed", "failed". Defaults to ["completed"] when omitted.

Response fields

POST /crawl (start)

FieldTypeDescription
successbooleanWhether the crawl was started successfully.
idstringThe unique crawl job ID. Use this to poll for status and results.
urlstringStatus-check URL for polling results via GET /crawl/{id}.

GET /crawl/{id} (status and results)

FieldTypeDescription
statusstringCrawl status: "in-progress", "completed", "failed", or "cancelled".
totalnumberTotal number of pages discovered.
completednumberNumber of pages successfully scraped.
failednumberNumber of pages that failed to scrape.
expiresAtstring | nullISO 8601 timestamp when the crawl results expire (24 hours after completion). null while the crawl is still in progress.
nextstring | nullURL to fetch the next page of results (uses skip offset parameter). null when all results have been returned.
dataCrawlPageResponse[]Array of scraped page results. See below.

CrawlPageResponse

FieldTypeDescription
statusstringPage status: "queued", "in-progress", "completed", "failed", or "cancelled".
contentUrlstring | nullPre-signed S3 URL to fetch the full scraped content for this page. Expires after 1 hour. null if the page has not yet completed.
metadata.titlestring | nullPage title. null if not extracted.
metadata.descriptionstring | nullPage meta description. null if not extracted.
metadata.languagestring | nullDetected page language. null if not detected.
metadata.scrapedAtstring | nullISO 8601 timestamp when the page was scraped. null if not yet scraped.
metadata.sourceURLstringThe original URL that was scraped.
metadata.statusCodenumber | nullHTTP status code of the page response. null if not yet scraped.
metadata.errorstring | nullError message if the page failed to scrape. null on success.

GET /crawl (list all)

FieldTypeDescription
crawlsCrawlListItem[]Array of crawl jobs.
nextCursorstring | nullCursor for fetching the next page of results. null when there are no more results.

CrawlListItem

FieldTypeDescription
idstringThe crawl job ID.
urlstringThe root URL being crawled.
statusstringCrawl status: "in-progress", "completed", "failed", or "cancelled".
totalnumberTotal pages discovered.
completednumberPages successfully scraped.
createdAtstringISO 8601 timestamp when the crawl was created.
completedAtstring | nullISO 8601 timestamp when the crawl finished. null if still running.

Error responses

All crawl endpoints may return the following error responses:

StatusDescription
400 Bad RequestInvalid parameters, unrecognized query params, invalid regex in includePaths/excludePaths, or blocked headers in scrapeOptions.headers.
401 UnauthorizedMissing or invalid API token.
404 Not FoundCrawl ID does not exist or belongs to another token.
409 Conflict(DELETE only) The crawl is already in a terminal state (completed, failed, or cancelled).
429 Too Many RequestsConcurrent crawl limit reached for your plan.
503 Service UnavailableCrawl service is not running or temporarily degraded.