Skip to main content
Version: v2

/scrape API

info

Currently, Browserless V2 is available in production via two domains: production-sfo.browserless.io and production-lon.browserless.io

The /scrape API allows for getting the contents a page, by specifying selectors you are interested in, and returning a structured JSON response. We also allow for setting a timeout option for asynchronously added elements.

The default behavior is to navigate to the URL specified, wait for the page to load (including parsing and executing of JavaScript), then waiting for the elements for a maximum of 30 seconds. All of these are configurable, and documented in detail below.

At a minimum you'll need to specify at least a url and an elements array.

You can check the full Open API schema here.

note

If the /scrape API is getting blocked by bot detectors, then we would recommend trying BrowserQL.

Basic Usage

Below is the most basic usage, where we'll navigate to the example.com website (waiting for page-load) and parse out all a elements.

Internally we use document.querySelectorAll to retrieve all matches on a page. Using a more specific selector can narrow down the returned results.

JSON Payload

{
"url": "https://browserless.io/",
"elements": [
{ "selector": "h1" }
]
}

cURL Request

curl -X POST \
https://production-sfo.browserless.io/scrape?token=MY_API_TOKEN \
-H 'Cache-Control: no-cache' \
-H 'Content-Type: application/json' \
-d '{
"url": "https://browserless.io/",
"elements": [
{ "selector": "h1" }
]
}'

Response Example

{
"data": [
{
"results": [
{
"attributes": [
{ "name": "class", "value": "..." }
],
"height": 120,
"html": "Headless browser automation, without the hosting headaches",
"left": 32,
"text": "Headless browser automation, without the hosting headaches",
"top": 196,
"width": 736
}
],
"selector": "h1"
}
]
}

Specifying page-load behavior

The scrape API allows for setting specific page-load behaviors by setting a gotoOptions in the JSON body. This is passed directly into puppeteer's goto() method.

In the example below, we'll set a waitUntil property and a timeout.

JSON Payload

{
"url": "https://example.com/",
"elements": [
{ "selector": "h1" }
],
"gotoOptions": {
"timeout": 10000,
"waitUntil": "networkidle2"
}
}

cURL Request

curl -X POST \
https://production-sfo.browserless.io/scrape?token=MY_API_TOKEN \
-H 'Cache-Control: no-cache' \
-H 'Content-Type: application/json' \
-d '{
"url": "https://example.com/",
"elements": [
{ "selector": "h1" }
],
"gotoOptions": {
"timeout": 10000,
"waitUntil": "networkidle2"
}
}'

Custom behavior with waitFor options

Sometimes it's helpful to do further actions, or wait for custom events on the page before getting data. We allow this behavior with the waitFor properties.

waitForTimeout

Waits for the given number of milliseconds before continue execution.

JSON Payload

{
"url": "https://example.com/",
"elements": [
{ "selector": "h1" }
],
"waitForTimeout": 1000,
}

cURL Request

curl -X POST \
https://production-sfo.browserless.io/scrape?token=MY_API_TOKEN \
-H 'Cache-Control: no-cache' \
-H 'Content-Type: application/json' \
-d '{
"url": "https://example.com/",
"elements": [
{ "selector": "h1" }
],
"waitForTimeout": 1000,
}'

waitForSelector

Wait for a selector to appear in page. If at the moment of calling the method the selector already exists, the method will return immediately. If the selector doesn't appear after the timeout milliseconds of waiting, the function will throw an exception.

Example

JSON payload

{
"url": "https://example.com/",
"elements": [
{ "selector": "h1" }
],
"waitForSelector": {
"selector": "h1",
"timeout": 5000
}
}

cURL request

curl -X POST \
https://production-sfo.browserless.io/scrape?token=MY_API_TOKEN \
-H 'Cache-Control: no-cache' \
-H 'Content-Type: application/json' \
-d '{
"url": "https://example.com/",
"elements": [
{ "selector": "h1" }
],
"waitForSelector": {
"selector": "h1",
"timeout": 5000
}
}'

waitForFunction

Waits for the provided function to return before cotinue. The function can be any valid JS function, including async functions.

Example

JS function

async () => {
const res = await fetch('https://jsonplaceholder.typicode.com/todos/1');
const json = await res.json();

document.querySelector("h1").innerText = json.title;
}

JSON payload

{
"url": "https://example.com/",
"elements": [
{ "selector": "h1" }
],
"waitForFunction": {
"fn": "async()=>{let t=await fetch('https://jsonplaceholder.typicode.com/todos/1'),e=await t.json();document.querySelector('h1').innerText=e.title}",
"timeout": 5000
}
}

cURL request

curl -X POST \
https://production-sfo.browserless.io/scrape?token=MY_API_TOKEN \
-H 'Cache-Control: no-cache' \
-H 'Content-Type: application/json' \
-d '{
"url": "https://example.com/",
"waitForFunction": {
"fn": "async()=>{let t=await fetch('https://jsonplaceholder.typicode.com/todos/1'),e=await t.json();document.querySelector('h1').innerText=e.title}",
"timeout": 5000
}
}'

waitForEvent

Waits for an event to happen on the page before cotinue.

Example

JSON payload

// Will fail since the event never fires
{
"url": "https://example.com/",
"elements": [
{ "selector": "h1" }
],
"waitForEvent": {
"event": "fullscreenchange",
"timeout": 5000
}
}

cURL request

curl -X POST \
https://production-sfo.browserless.io/content?token=MY_API_TOKEN \
-H 'Cache-Control: no-cache' \
-H 'Content-Type: application/json' \
-d '{
"url": "https://example.com/",
"elements": [
{ "selector": "h1" }
],
"waitForEvent": {
"event": "fullscreenchange",
"timeout": 5000
}
}'