Smart Scrape API
Intelligently scrape any URL using cascading strategies that automatically escalate from fast HTTP fetching to headless browsers and captcha solving as needed. Specify output formats to receive HTML, markdown, screenshots, PDFs, or extracted links, all in a single request.
Endpoint
- Method:
POST - Path:
/smart-scrape - Auth:
tokenquery parameter (?token=) - Content-Type:
application/json - Response:
application/json
Quickstart
- cURL
- Javascript
- Python
curl --request POST \
--url 'https://production-sfo.browserless.io/smart-scrape?token=YOUR_API_TOKEN_HERE' \
--header 'Content-Type: application/json' \
--data '{
"url": "https://news.ycombinator.com/",
"formats": ["html", "markdown", "links"]
}'
const TOKEN = "YOUR_API_TOKEN_HERE";
const url = `https://production-sfo.browserless.io/smart-scrape?token=${TOKEN}`;
const headers = {
"Content-Type": "application/json"
};
const data = {
url: "https://news.ycombinator.com/",
formats: ["html", "markdown", "links"]
};
const smartScrape = async () => {
const response = await fetch(url, {
method: 'POST',
headers: headers,
body: JSON.stringify(data)
});
const result = await response.json();
console.log(result);
};
smartScrape();
import requests
TOKEN = "YOUR_API_TOKEN_HERE"
url = f"https://production-sfo.browserless.io/smart-scrape?token={TOKEN}"
headers = {
"Content-Type": "application/json"
}
data = {
"url": "https://news.ycombinator.com/",
"formats": ["html", "markdown", "links"]
}
response = requests.post(url, headers=headers, json=data)
result = response.json()
print(result)
Response
{
"ok": true,
"statusCode": 200,
"content": "<html lang=\"en\" op=\"news\"><head><meta name=\"referrer\" content=\"origin\">...</html>",
"contentType": "text/html; charset=utf-8",
"headers": {
"content-type": "text/html; charset=utf-8",
"cache-control": "private; max-age=0"
},
"strategy": "http-fetch",
"attempted": ["http-fetch"],
"message": null,
"screenshot": null,
"pdf": null,
"markdown": "# Hacker News\n\n[new](newest) | [past](front) | [comments](newcomments) | [ask](ask) | [show](show) | [jobs](jobs) | [submit](submit)\n\n1. [Motorola GrapheneOS devices will be bootloader unlockable/relockable](https://grapheneos.social/...)...",
"links": [
"https://news.ycombinator.com/news",
"https://news.ycombinator.com/newest",
"https://news.ycombinator.com/front"
]
}
How it works
The Smart Scrape API uses a cascading strategy pipeline to fetch content in the most efficient way possible. It starts with the fastest, cheapest approach and automatically escalates to more powerful strategies only when needed:
-
Fast HTTP fetch: Makes a lightweight HTTP request that mimics a real browser's network fingerprint. This handles the majority of static and server-rendered sites in under 2 seconds.
-
Proxied HTTP fetch: If the initial request is blocked (e.g., by datacenter IP detection), the same request is retried through a residential proxy.
-
Headless browser: If the page requires JavaScript rendering (single-page apps, client-rendered content), a full stealth browser is launched to render the page.
-
Browser + captcha solving: If a captcha or bot detection challenge is encountered, the browser automatically detects and solves it (supports reCAPTCHA, hCaptcha, Cloudflare Turnstile, and others).
The pipeline stops as soon as a strategy succeeds. The strategy field in the response tells you which approach was used, and the attempted array shows the full sequence of strategies tried.
Request body
| Field | Type | Required | Default | Description |
|---|---|---|---|---|
url | string | Yes | - | The URL to scrape. Must be http:// or https://. |
formats | string[] | No | ["html"] | Output formats to include. Options: "html", "markdown", "screenshot", "pdf", "links". |
Output formats
The formats array controls what data is returned. The content field always contains the raw HTML (or parsed JSON for API endpoints) regardless of which formats you request. Additional formats populate their respective response fields.
Markdown
Converts the page content to clean markdown, stripping scripts, styles, and non-visible elements.
JSON body:
{
"url": "https://news.ycombinator.com/",
"formats": ["markdown"]
}
cURL:
curl -s -X POST "https://production-sfo.browserless.io/smart-scrape?token=YOUR_API_TOKEN_HERE" -H "Content-Type: application/json" -d '{"url":"https://news.ycombinator.com/","formats":["markdown"]}'
Response:
{
"ok": true,
"statusCode": 200,
"content": "<!DOCTYPE html><html>...</html>",
"markdown": "# Hacker News\n\n[new](newest) | [past](front) | [comments](newcomments)...",
"screenshot": null,
"pdf": null,
"links": null,
"strategy": "http-fetch",
"attempted": ["http-fetch"],
"message": null
}
Screenshot
Returns a full-page screenshot as a base64-encoded PNG string. Including "screenshot" in formats forces a headless browser to be used.
JSON body:
{
"url": "https://news.ycombinator.com/",
"formats": ["screenshot"]
}
cURL:
curl -s -X POST "https://production-sfo.browserless.io/smart-scrape?token=YOUR_API_TOKEN_HERE" -H "Content-Type: application/json" -d '{"url":"https://news.ycombinator.com/","formats":["screenshot"]}'
Response:
{
"ok": true,
"statusCode": 200,
"content": "<!DOCTYPE html><html>...</html>",
"screenshot": "iVBORw0KGgoAAAANSUhEUgAA...",
"pdf": null,
"markdown": null,
"links": null,
"strategy": "browser",
"attempted": ["browser"],
"message": null
}
PDF
Returns the page as a base64-encoded PDF string. Like "screenshot", including "pdf" forces a headless browser.
JSON body:
{
"url": "https://news.ycombinator.com/",
"formats": ["pdf"]
}
cURL:
curl -s -X POST "https://production-sfo.browserless.io/smart-scrape?token=YOUR_API_TOKEN_HERE" -H "Content-Type: application/json" -d '{"url":"https://news.ycombinator.com/","formats":["pdf"]}'
Response:
{
"ok": true,
"statusCode": 200,
"content": "<!DOCTYPE html><html>...</html>",
"pdf": "JVBERi0xLjQKMSAwIG9iago8PA...",
"screenshot": null,
"markdown": null,
"links": null,
"strategy": "browser",
"attempted": ["browser"],
"message": null
}
Links
Extracts all links (<a href>) from the page, resolves relative URLs to absolute, and filters to http:// and https:// links only.
JSON body:
{
"url": "https://news.ycombinator.com/",
"formats": ["links"]
}
cURL:
curl -s -X POST "https://production-sfo.browserless.io/smart-scrape?token=YOUR_API_TOKEN_HERE" -H "Content-Type: application/json" -d '{"url":"https://news.ycombinator.com/","formats":["links"]}'
Response:
{
"ok": true,
"statusCode": 200,
"content": "<!DOCTYPE html><html>...</html>",
"links": [
"https://news.ycombinator.com/news",
"https://news.ycombinator.com/newest",
"https://news.ycombinator.com/front",
"https://grapheneos.social/@GrapheneOS/116160393783585567"
],
"screenshot": null,
"pdf": null,
"markdown": null,
"strategy": "http-fetch",
"attempted": ["http-fetch"],
"message": null
}
Response fields
| Field | Type | Description |
|---|---|---|
ok | boolean | Whether the scrape succeeded. |
statusCode | number | null | The HTTP status code from the target site, or null on network errors. |
content | string | object | null | Page content as HTML string, or a parsed JSON object if the target returns application/json. null on failure. |
contentType | string | null | The content type of the scraped page. |
headers | object | HTTP response headers from the target site. |
strategy | string | The strategy that produced the result (or was being attempted on failure). |
attempted | string[] | All strategies attempted, in order. |
message | string | null | Error message on failure, null on success. |
screenshot | string | null | Base64-encoded PNG screenshot, when "screenshot" is in formats. |
pdf | string | null | Base64-encoded PDF, when "pdf" is in formats. |
markdown | string | null | Markdown conversion of the page, when "markdown" is in formats. |
links | string[] | null | Extracted links, when "links" is in formats. |
JSON auto-parsing
When the target URL returns JSON content (e.g., an API endpoint with Content-Type: application/json), the content field will contain the parsed JSON object rather than a raw string:
{
"ok": true,
"statusCode": 200,
"content": {
"userId": 1,
"id": 1,
"title": "Example post title",
"body": "Example post body..."
},
"contentType": "application/json; charset=utf-8",
"strategy": "http-fetch",
"attempted": ["http-fetch"],
"message": null
}
Error handling
On failure, the response still returns HTTP 200 with ok: false and a message describing the error:
{
"ok": false,
"statusCode": null,
"content": null,
"contentType": null,
"headers": {},
"strategy": "browser-captcha",
"attempted": ["http-fetch", "http-proxy", "browser", "browser-captcha"],
"message": "Captcha was detected but could not be solved",
"screenshot": null,
"pdf": null,
"markdown": null,
"links": null
}
Configuration options
The /smart-scrape API supports a timeout query parameter to control the maximum time allowed for the scrape operation:
POST /smart-scrape?token=YOUR_API_TOKEN_HERE&timeout=30000
The timeout value is in milliseconds and applies to each strategy attempt. If not specified, the server default timeout is used.