Skip to main content

Smart Scrape API

Intelligently scrape any URL using cascading strategies that automatically escalate from fast HTTP fetching to headless browsers and captcha solving as needed. Specify output formats to receive HTML, markdown, screenshots, PDFs, or extracted links, all in a single request.

Endpoint

  • Method: POST
  • Path: /smart-scrape
  • Auth: token query parameter (?token=)
  • Content-Type: application/json
  • Response: application/json

Quickstart

curl --request POST \
--url 'https://production-sfo.browserless.io/smart-scrape?token=YOUR_API_TOKEN_HERE' \
--header 'Content-Type: application/json' \
--data '{
"url": "https://news.ycombinator.com/",
"formats": ["html", "markdown", "links"]
}'

Response

{
"ok": true,
"statusCode": 200,
"content": "<html lang=\"en\" op=\"news\"><head><meta name=\"referrer\" content=\"origin\">...</html>",
"contentType": "text/html; charset=utf-8",
"headers": {
"content-type": "text/html; charset=utf-8",
"cache-control": "private; max-age=0"
},
"strategy": "http-fetch",
"attempted": ["http-fetch"],
"message": null,
"screenshot": null,
"pdf": null,
"markdown": "# Hacker News\n\n[new](newest) | [past](front) | [comments](newcomments) | [ask](ask) | [show](show) | [jobs](jobs) | [submit](submit)\n\n1. [Motorola GrapheneOS devices will be bootloader unlockable/relockable](https://grapheneos.social/...)...",
"links": [
"https://news.ycombinator.com/news",
"https://news.ycombinator.com/newest",
"https://news.ycombinator.com/front"
]
}

How it works

The Smart Scrape API uses a cascading strategy pipeline to fetch content in the most efficient way possible. It starts with the fastest, cheapest approach and automatically escalates to more powerful strategies only when needed:

  1. Fast HTTP fetch: Makes a lightweight HTTP request that mimics a real browser's network fingerprint. This handles the majority of static and server-rendered sites in under 2 seconds.

  2. Proxied HTTP fetch: If the initial request is blocked (e.g., by datacenter IP detection), the same request is retried through a residential proxy.

  3. Headless browser: If the page requires JavaScript rendering (single-page apps, client-rendered content), a full stealth browser is launched to render the page.

  4. Browser + captcha solving: If a captcha or bot detection challenge is encountered, the browser automatically detects and solves it (supports reCAPTCHA, hCaptcha, Cloudflare Turnstile, and others).

The pipeline stops as soon as a strategy succeeds. The strategy field in the response tells you which approach was used, and the attempted array shows the full sequence of strategies tried.

Request body

FieldTypeRequiredDefaultDescription
urlstringYes-The URL to scrape. Must be http:// or https://.
formatsstring[]No["html"]Output formats to include. Options: "html", "markdown", "screenshot", "pdf", "links".

Output formats

The formats array controls what data is returned. The content field always contains the raw HTML (or parsed JSON for API endpoints) regardless of which formats you request. Additional formats populate their respective response fields.

Markdown

Converts the page content to clean markdown, stripping scripts, styles, and non-visible elements.

JSON body:

{
"url": "https://news.ycombinator.com/",
"formats": ["markdown"]
}

cURL:

curl -s -X POST "https://production-sfo.browserless.io/smart-scrape?token=YOUR_API_TOKEN_HERE" -H "Content-Type: application/json" -d '{"url":"https://news.ycombinator.com/","formats":["markdown"]}'

Response:

{
"ok": true,
"statusCode": 200,
"content": "<!DOCTYPE html><html>...</html>",
"markdown": "# Hacker News\n\n[new](newest) | [past](front) | [comments](newcomments)...",
"screenshot": null,
"pdf": null,
"links": null,
"strategy": "http-fetch",
"attempted": ["http-fetch"],
"message": null
}

Screenshot

Returns a full-page screenshot as a base64-encoded PNG string. Including "screenshot" in formats forces a headless browser to be used.

JSON body:

{
"url": "https://news.ycombinator.com/",
"formats": ["screenshot"]
}

cURL:

curl -s -X POST "https://production-sfo.browserless.io/smart-scrape?token=YOUR_API_TOKEN_HERE" -H "Content-Type: application/json" -d '{"url":"https://news.ycombinator.com/","formats":["screenshot"]}'

Response:

{
"ok": true,
"statusCode": 200,
"content": "<!DOCTYPE html><html>...</html>",
"screenshot": "iVBORw0KGgoAAAANSUhEUgAA...",
"pdf": null,
"markdown": null,
"links": null,
"strategy": "browser",
"attempted": ["browser"],
"message": null
}

PDF

Returns the page as a base64-encoded PDF string. Like "screenshot", including "pdf" forces a headless browser.

JSON body:

{
"url": "https://news.ycombinator.com/",
"formats": ["pdf"]
}

cURL:

curl -s -X POST "https://production-sfo.browserless.io/smart-scrape?token=YOUR_API_TOKEN_HERE" -H "Content-Type: application/json" -d '{"url":"https://news.ycombinator.com/","formats":["pdf"]}'

Response:

{
"ok": true,
"statusCode": 200,
"content": "<!DOCTYPE html><html>...</html>",
"pdf": "JVBERi0xLjQKMSAwIG9iago8PA...",
"screenshot": null,
"markdown": null,
"links": null,
"strategy": "browser",
"attempted": ["browser"],
"message": null
}

Extracts all links (<a href>) from the page, resolves relative URLs to absolute, and filters to http:// and https:// links only.

JSON body:

{
"url": "https://news.ycombinator.com/",
"formats": ["links"]
}

cURL:

curl -s -X POST "https://production-sfo.browserless.io/smart-scrape?token=YOUR_API_TOKEN_HERE" -H "Content-Type: application/json" -d '{"url":"https://news.ycombinator.com/","formats":["links"]}'

Response:

{
"ok": true,
"statusCode": 200,
"content": "<!DOCTYPE html><html>...</html>",
"links": [
"https://news.ycombinator.com/news",
"https://news.ycombinator.com/newest",
"https://news.ycombinator.com/front",
"https://grapheneos.social/@GrapheneOS/116160393783585567"
],
"screenshot": null,
"pdf": null,
"markdown": null,
"strategy": "http-fetch",
"attempted": ["http-fetch"],
"message": null
}

Response fields

FieldTypeDescription
okbooleanWhether the scrape succeeded.
statusCodenumber | nullThe HTTP status code from the target site, or null on network errors.
contentstring | object | nullPage content as HTML string, or a parsed JSON object if the target returns application/json. null on failure.
contentTypestring | nullThe content type of the scraped page.
headersobjectHTTP response headers from the target site.
strategystringThe strategy that produced the result (or was being attempted on failure).
attemptedstring[]All strategies attempted, in order.
messagestring | nullError message on failure, null on success.
screenshotstring | nullBase64-encoded PNG screenshot, when "screenshot" is in formats.
pdfstring | nullBase64-encoded PDF, when "pdf" is in formats.
markdownstring | nullMarkdown conversion of the page, when "markdown" is in formats.
linksstring[] | nullExtracted links, when "links" is in formats.

JSON auto-parsing

When the target URL returns JSON content (e.g., an API endpoint with Content-Type: application/json), the content field will contain the parsed JSON object rather than a raw string:

{
"ok": true,
"statusCode": 200,
"content": {
"userId": 1,
"id": 1,
"title": "Example post title",
"body": "Example post body..."
},
"contentType": "application/json; charset=utf-8",
"strategy": "http-fetch",
"attempted": ["http-fetch"],
"message": null
}

Error handling

On failure, the response still returns HTTP 200 with ok: false and a message describing the error:

{
"ok": false,
"statusCode": null,
"content": null,
"contentType": null,
"headers": {},
"strategy": "browser-captcha",
"attempted": ["http-fetch", "http-proxy", "browser", "browser-captcha"],
"message": "Captcha was detected but could not be solved",
"screenshot": null,
"pdf": null,
"markdown": null,
"links": null
}

Configuration options

The /smart-scrape API supports a timeout query parameter to control the maximum time allowed for the scrape operation:

POST /smart-scrape?token=YOUR_API_TOKEN_HERE&timeout=30000

The timeout value is in milliseconds and applies to each strategy attempt. If not specified, the server default timeout is used.