Smart Scrape API

Intelligently scrape any URL using cascading strategies that automatically escalate from fast HTTP fetching to headless browsers and captcha solving as needed. Specify output formats to receive HTML, markdown, screenshots, PDFs, or extracted links, all in a single request.

Endpoint

Method: POST
Path: /smart-scrape
Auth: token query parameter (?token=)
Content-Type: application/json
Response: application/json

Quickstart

cURL
Javascript
Python

curl --request POST \
  --url 'https://production-sfo.browserless.io/smart-scrape?token=YOUR_API_TOKEN_HERE' \
  --header 'Content-Type: application/json' \
  --data '{
  "url": "https://news.ycombinator.com/",
  "formats": ["html", "markdown", "links"]
}'

const TOKEN = "YOUR_API_TOKEN_HERE";
const url = `https://production-sfo.browserless.io/smart-scrape?token=${TOKEN}`;
const headers = {
  "Content-Type": "application/json"
};

const data = {
  url: "https://news.ycombinator.com/",
  formats: ["html", "markdown", "links"]
};

const smartScrape = async () => {
  const response = await fetch(url, {
    method: 'POST',
    headers: headers,
    body: JSON.stringify(data)
  });

  const result = await response.json();
  console.log(result);
};

smartScrape();

import requests

TOKEN = "YOUR_API_TOKEN_HERE"
url = f"https://production-sfo.browserless.io/smart-scrape?token={TOKEN}"
headers = {
    "Content-Type": "application/json"
}

data = {
    "url": "https://news.ycombinator.com/",
    "formats": ["html", "markdown", "links"]
}

response = requests.post(url, headers=headers, json=data)
result = response.json()

print(result)

Response

{
  "ok": true,
  "statusCode": 200,
  "content": "<html lang=\"en\" op=\"news\"><head><meta name=\"referrer\" content=\"origin\">...</html>",
  "contentType": "text/html; charset=utf-8",
  "headers": {
    "content-type": "text/html; charset=utf-8",
    "cache-control": "private; max-age=0"
  },
  "strategy": "http-fetch",
  "attempted": ["http-fetch"],
  "message": null,
  "screenshot": null,
  "pdf": null,
  "markdown": "# Hacker News\n\n[new](newest) | [past](front) | [comments](newcomments) | [ask](ask) | [show](show) | [jobs](jobs) | [submit](submit)\n\n1. [Motorola GrapheneOS devices will be bootloader unlockable/relockable](https://grapheneos.social/...)...",
  "links": [
    "https://news.ycombinator.com/news",
    "https://news.ycombinator.com/newest",
    "https://news.ycombinator.com/front"
  ]
}

How it works

The Smart Scrape API uses a cascading strategy pipeline to fetch content in the most efficient way possible. It starts with the fastest, cheapest approach and automatically escalates to heavier strategies only when needed:

Fast HTTP fetch: Makes a lightweight HTTP request that mimics a real browser's network fingerprint. This handles the majority of static and server-rendered sites in under 2 seconds.
Proxied HTTP fetch: If the initial request is blocked (e.g., by IP detection), the same request is retried through the selected proxy network (residential by default, or datacenter if proxy: "datacenter" is set in the request body).
Headless browser: If the page requires JavaScript rendering (single-page apps, client-rendered content), a full stealth browser is launched to render the page.
Browser + captcha solving: If a captcha or bot detection challenge is encountered, the browser automatically detects and solves it (supports reCAPTCHA, Cloudflare Turnstile, and others).

The pipeline stops as soon as a strategy succeeds. The strategy field in the response tells you which approach was used, and the attempted array shows the full sequence of strategies tried.

Captcha handling scope

Smart Scrape only solves captchas that gate access to the page itself — for example, a Cloudflare Turnstile interstitial or a reCAPTCHA that blocks the page from loading. In these cases the browser solves the challenge automatically so the underlying content can be returned.

Captchas that are embedded in a form on the page (e.g., a reCAPTCHA next to a "Submit" button on a contact or signup form) are not solved. Smart Scrape fetches and returns the rendered page but does not fill, interact with, or submit forms, so form-submission captchas are left untouched in the returned HTML. If you need to submit a form behind a captcha, use BrowserQL with the solve mutation instead.

Request body

Field	Type	Required	Default	Description
`url`	`string`	Yes	-	The URL to scrape. Must be `http://` or `https://`.
`formats`	`string[]`	No	`["html"]`	Output formats to include. Options: `"html"`, `"markdown"`, `"screenshot"`, `"pdf"`, `"links"`.
`proxy`	`string`	No	`"residential"`	Proxy network to route the scrape through: `"residential"` (6 units/MB) or `"datacenter"` (2 units/MB).

Output formats

The formats array controls what data is returned. The content field always contains the raw HTML (or parsed JSON for API endpoints) regardless of which formats you request. Additional formats populate their respective response fields.

Markdown

Converts the page content to clean markdown, stripping scripts, styles, and non-visible elements.

JSON body:

{
  "url": "https://news.ycombinator.com/",
  "formats": ["markdown"]
}

cURL:

curl -s -X POST "https://production-sfo.browserless.io/smart-scrape?token=YOUR_API_TOKEN_HERE" -H "Content-Type: application/json" -d '{"url":"https://news.ycombinator.com/","formats":["markdown"]}'

Response:

{
  "ok": true,
  "statusCode": 200,
  "content": "<!DOCTYPE html><html>...</html>",
  "markdown": "# Hacker News\n\n[new](newest) | [past](front) | [comments](newcomments)...",
  "screenshot": null,
  "pdf": null,
  "links": null,
  "strategy": "http-fetch",
  "attempted": ["http-fetch"],
  "message": null
}

Screenshot

Returns a full-page screenshot as a base64-encoded PNG string. Including "screenshot" in formats forces a headless browser to be used.

JSON body:

{
  "url": "https://news.ycombinator.com/",
  "formats": ["screenshot"]
}

cURL:

curl -s -X POST "https://production-sfo.browserless.io/smart-scrape?token=YOUR_API_TOKEN_HERE" -H "Content-Type: application/json" -d '{"url":"https://news.ycombinator.com/","formats":["screenshot"]}'

Response:

{
  "ok": true,
  "statusCode": 200,
  "content": "<!DOCTYPE html><html>...</html>",
  "screenshot": "iVBORw0KGgoAAAANSUhEUgAA...",
  "pdf": null,
  "markdown": null,
  "links": null,
  "strategy": "browser",
  "attempted": ["browser"],
  "message": null
}

PDF

Returns the page as a base64-encoded PDF string. Like "screenshot", including "pdf" forces a headless browser.

JSON body:

{
  "url": "https://news.ycombinator.com/",
  "formats": ["pdf"]
}

cURL:

curl -s -X POST "https://production-sfo.browserless.io/smart-scrape?token=YOUR_API_TOKEN_HERE" -H "Content-Type: application/json" -d '{"url":"https://news.ycombinator.com/","formats":["pdf"]}'

Response:

{
  "ok": true,
  "statusCode": 200,
  "content": "<!DOCTYPE html><html>...</html>",
  "pdf": "JVBERi0xLjQKMSAwIG9iago8PA...",
  "screenshot": null,
  "markdown": null,
  "links": null,
  "strategy": "browser",
  "attempted": ["browser"],
  "message": null
}

Links

Extracts all links (<a href>) from the page, resolves relative URLs to absolute, and filters to http:// and https:// links only.

JSON body:

{
  "url": "https://news.ycombinator.com/",
  "formats": ["links"]
}

cURL:

curl -s -X POST "https://production-sfo.browserless.io/smart-scrape?token=YOUR_API_TOKEN_HERE" -H "Content-Type: application/json" -d '{"url":"https://news.ycombinator.com/","formats":["links"]}'

Response:

{
  "ok": true,
  "statusCode": 200,
  "content": "<!DOCTYPE html><html>...</html>",
  "links": [
    "https://news.ycombinator.com/news",
    "https://news.ycombinator.com/newest",
    "https://news.ycombinator.com/front",
    "https://grapheneos.social/@GrapheneOS/116160393783585567"
  ],
  "screenshot": null,
  "pdf": null,
  "markdown": null,
  "strategy": "http-fetch",
  "attempted": ["http-fetch"],
  "message": null
}

Response fields

Field	Type	Description
`ok`	`boolean`	Whether the scrape succeeded.
`statusCode`	`number \| null`	The HTTP status code from the target site, or `null` on network errors.
`content`	`string \| object \| null`	Page content as HTML string, or a parsed JSON object if the target returns `application/json`. `null` on failure.
`contentType`	`string \| null`	The content type of the scraped page.
`headers`	`object`	HTTP response headers from the target site.
`strategy`	`string`	The strategy that produced the result (or was being attempted on failure).
`attempted`	`string[]`	All strategies attempted, in order.
`message`	`string \| null`	Error message on failure, `null` on success.
`screenshot`	`string \| null`	Base64-encoded PNG screenshot, when `"screenshot"` is in `formats`.
`pdf`	`string \| null`	Base64-encoded PDF, when `"pdf"` is in `formats`.
`markdown`	`string \| null`	Markdown conversion of the page, when `"markdown"` is in `formats`.
`links`	`string[] \| null`	Extracted links, when `"links"` is in `formats`.

JSON auto-parsing

When the target URL returns JSON content (e.g., an API endpoint with Content-Type: application/json), the content field will contain the parsed JSON object rather than a raw string:

{
  "ok": true,
  "statusCode": 200,
  "content": {
    "userId": 1,
    "id": 1,
    "title": "Example post title",
    "body": "Example post body..."
  },
  "contentType": "application/json; charset=utf-8",
  "strategy": "http-fetch",
  "attempted": ["http-fetch"],
  "message": null
}

Error handling

On failure, the response still returns HTTP 200 with ok: false and a message describing the error:

{
  "ok": false,
  "statusCode": null,
  "content": null,
  "contentType": null,
  "headers": {},
  "strategy": "browser-captcha",
  "attempted": ["http-fetch", "http-proxy", "browser", "browser-captcha"],
  "message": "Captcha was detected but could not be solved",
  "screenshot": null,
  "pdf": null,
  "markdown": null,
  "links": null
}

Using a profile

Scrape authenticated pages by passing a saved profile via the ?profile= query parameter. The browser loads the profile's cookies, localStorage, and IndexedDB before navigating, so the page is accessed as the logged-in user.

cURL
Javascript
Python

curl --request POST \
  --url 'https://production-sfo.browserless.io/smart-scrape?token=YOUR_API_TOKEN_HERE&profile=acme-prod' \
  --header 'Content-Type: application/json' \
  --data '{
  "url": "https://app.example.com/dashboard",
  "formats": ["html", "markdown"]
}'

const TOKEN = "YOUR_API_TOKEN_HERE";
const url = `https://production-sfo.browserless.io/smart-scrape?token=${TOKEN}&profile=acme-prod`;
const headers = {
  "Content-Type": "application/json"
};

const data = {
  url: "https://app.example.com/dashboard",
  formats: ["html", "markdown"]
};

const smartScrape = async () => {
  const response = await fetch(url, {
    method: 'POST',
    headers: headers,
    body: JSON.stringify(data)
  });

  const result = await response.json();
  console.log(result);
};

smartScrape();

import requests

TOKEN = "YOUR_API_TOKEN_HERE"
url = f"https://production-sfo.browserless.io/smart-scrape?token={TOKEN}&profile=acme-prod"
headers = {
    "Content-Type": "application/json"
}

data = {
    "url": "https://app.example.com/dashboard",
    "formats": ["html", "markdown"]
}

response = requests.post(url, headers=headers, json=data)
result = response.json()

print(result)

tip

Create and manage profiles via the Authenticated Profiles workflow. The profile name is scoped to your API token — other tokens cannot access your profiles.

Configuration options

The /smart-scrape API supports a timeout query parameter to control the maximum time allowed for the scrape operation:

POST /smart-scrape?token=YOUR_API_TOKEN_HERE&timeout=30000

The timeout value is in milliseconds and applies to each strategy attempt. If not specified, the server default timeout is used.

Quickstart​

How it works​

Request body​

Output formats​

Markdown​

Screenshot​

PDF​

Links​

Response fields​

JSON auto-parsing​

Error handling​

Using a profile​

Configuration options​

Quickstart

How it works

Request body

Output formats

Markdown

Screenshot

PDF

Links

Response fields

JSON auto-parsing

Error handling

Using a profile

Configuration options