Skip to main content

/scrape API

The /scrape API extracts structured JSON data from pages using CSS selectors. Requires url and elements array with selectors.

You can check the full Open API schema here.

Quick Start

curl --request POST \
--url 'https://production-sfo.browserless.io/scrape?token=YOUR_API_TOKEN_HERE' \
--header 'content-type: application/json' \
--data '{
"url": "https://browserless.io/",
"elements": [
{
"selector": "h1"
}
]
}'

Response

{
"data": [
{
"results": [
{
"attributes": [
{ "name": "class", "value": "..." }
],
"height": 120,
"html": "Headless browser automation, without the hosting headaches",
"left": 32,
"text": "Headless browser automation, without the hosting headaches",
"top": 196,
"width": 736
}
],
"selector": "h1"
}
]
}

Additional Details

BrowserQL

We recommended using BrowserQL, Browserless' first-class browser automation API, to scrape content from any website.

The API uses document.querySelectorAll to retrieve all matches on a page. Using a more specific selector can narrow down the returned results. The default behavior is to navigate to the URL specified, wait for the page to load (including parsing and executing of JavaScript), then waiting for the elements for a maximum of 30 seconds.

Specifying Page-Load Behavior

The scrape API allows for setting specific page-load behaviors by setting a gotoOptions in the JSON body. This is passed directly into puppeteer's goto() method.

In the example below, we'll set a waitUntil property and a timeout.

curl --request POST \
--url 'https://production-sfo.browserless.io/scrape?token=YOUR_API_TOKEN_HERE' \
--header 'content-type: application/json' \
--data '{
"url": "https://example.com/",
"elements": [
{
"selector": "h1"
}
],
"gotoOptions": {
"timeout": 10000,
"waitUntil": "networkidle2"
}
}'

Custom behavior with waitFor options

Sometimes it's helpful to do further actions, or wait for custom events on the page before getting data. We allow this behavior with the waitFor properties.

waitForTimeout

Waits for the given number of milliseconds before continue execution.

curl --request POST \
--url 'https://production-sfo.browserless.io/scrape?token=YOUR_API_TOKEN_HERE' \
--header 'content-type: application/json' \
--data '{
"url": "https://example.com/",
"elements": [
{
"selector": "h1"
}
],
"waitForTimeout": 1000
}'

waitForSelector

Wait for a selector to appear in page. If at the moment of calling the method the selector already exists, the method will return immediately. If the selector doesn't appear after the timeout milliseconds of waiting, the function will throw an exception.

Example

curl --request POST \
--url 'https://production-sfo.browserless.io/scrape?token=YOUR_API_TOKEN_HERE' \
--header 'content-type: application/json' \
--data '{
"url": "https://example.com/",
"elements": [
{
"selector": "h1"
}
],
"waitForSelector": {
"selector": "h1",
"timeout": 5000
}
}'

waitForFunction

Waits for the provided function to return before cotinue. The function can be any valid JS function, including async functions.

Example

JS function

async () => {
const res = await fetch('https://jsonplaceholder.typicode.com/todos/1');
const json = await res.json();

document.querySelector("h1").innerText = json.title;
}
curl --request POST \
--url 'https://production-sfo.browserless.io/scrape?token=YOUR_API_TOKEN_HERE' \
--header 'content-type: application/json' \
--data '{
"url": "https://example.com/",
"elements": [
{
"selector": "h1"
}
],
"waitForFunction": {
"fn": "async()=>{let t=await fetch('\''https://jsonplaceholder.typicode.com/todos/1'\''),e=await t.json();document.querySelector('\''h1'\'').innerText=e.title}",
"timeout": 5000
}
}'

waitForEvent

Waits for an event to happen on the page before cotinue.

Example

curl --request POST \
--url 'https://production-sfo.browserless.io/scrape?token=YOUR_API_TOKEN_HERE' \
--header 'content-type: application/json' \
--data '{
"url": "https://example.com/",
"elements": [
{
"selector": "h1"
}
],
"waitForEvent": {
"event": "fullscreenchange",
"timeout": 5000
}
}'