Scrape structured data

Extract structured data from fully rendered JavaScript pages using Browserless.

Prerequisites

A Browserless API token from your account dashboard

Steps

AI Agent
REST API
Frameworks
BQL

Use the Browserless MCP server to scrape structured data from a webpage from any MCP-compatible AI agent (Claude Desktop, Cursor, Windsurf, ChatGPT, etc.).

1. Connect the MCP server

Send this prompt to your AI agent to install the Browserless MCP server:

Go to https://github.com/browserless/browserless-mcp/blob/main/install.md
and follow the instructions to install the Browserless MCP server
for my client.

2. Scrape a page

Use browserless_smartscraper. It extracts page content in one call with automatic bot-protection handling.

Use the browserless_smartscraper tool to scrape the main content
of https://scraping-sandbox.netlify.app/products as markdown

3. Scrape after interaction

Use browserless_agent. Some pages paginate content that requires clicking through to load all items.

Use the browserless_agent tool to navigate to
https://scraping-sandbox.netlify.app/products,
click the "Next" button to load additional pages of products,
then scrape the full page content as markdown

Use the /scrape REST endpoint to extract structured data from a page. No WebSocket connection needed.

cURL
JavaScript
Python
Java
C#

View Full Code on GitHub

1. Build the request

Append your token to the scrape endpoint and specify the selectors you want to extract:

https://production-sfo.browserless.io/scrape?token=YOUR_API_TOKEN_HERE

2. Send the request

curl -X POST \
  "https://production-sfo.browserless.io/scrape?token=YOUR_API_TOKEN_HERE" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://scraping-sandbox.netlify.app/products",
    "elements": [
      { "selector": "h1" },
      { "selector": ".product-title" }
    ]
  }'

3. Check the output

The response is JSON with a data array. Each item corresponds to one selector and includes matched elements with their text, HTML, dimensions, and position:

{
  "data": [
    {
      "selector": "h1",
      "results": [
        {
          "attributes": [],
          "height": 48,
          "html": "Product Catalog",
          "left": 192,
          "text": "Product Catalog",
          "top": 120,
          "width": 816
        }
      ]
    },
    {
      "selector": ".product-title",
      "results": [
        {
          "attributes": [{ "name": "class", "value": "product-title" }],
          "height": 20,
          "html": "ProBass A100 Wireless Headphones",
          "left": 340,
          "text": "ProBass A100 Wireless Headphones",
          "top": 480,
          "width": 250
        },
        {
          "attributes": [{ "name": "class", "value": "product-title" }],
          "height": 20,
          "html": "Aura Smart Ring Gen 4",
          "left": 620,
          "text": "Aura Smart Ring Gen 4",
          "top": 480,
          "width": 250
        }
      ]
    }
  ]
}

View Full Code on GitHub

1. Send the request

const response = await fetch(
  'https://production-sfo.browserless.io/scrape?token=YOUR_API_TOKEN_HERE',
  {
    method: 'POST',
    headers: { 'Content-Type': 'application/json' },
    body: JSON.stringify({
      url: 'https://scraping-sandbox.netlify.app/products',
      elements: [
        { selector: 'h1' },
        { selector: '.product-title' },
      ],
    }),
  }
);

const { data } = await response.json();
console.log(data);

2. Check the output

Run the script with node scrape.mjs. The extracted data is logged to the console as a structured JSON array.

View Full Code on GitHub

1. Install dependencies

pip install requests

2. Send the request

import requests

response = requests.post(
    'https://production-sfo.browserless.io/scrape?token=YOUR_API_TOKEN_HERE',
    json={
        'url': 'https://scraping-sandbox.netlify.app/products',
        'elements': [
            {'selector': 'h1'},
            {'selector': '.product-title'},
        ],
    },
)

data = response.json()['data']
print(data)

3. Check the output

Run the script with python scrape.py. The extracted data is printed as a list of selector results.

View Full Code on GitHub

1. Add dependencies

<!-- https://kong.github.io/unirest-java/ -->
<dependency>
  <groupId>com.konghq</groupId>
  <artifactId>unirest-java</artifactId>
  <version>3.14.5</version>
</dependency>

2. Send the request

import kong.unirest.HttpResponse;
import kong.unirest.Unirest;

String url = "https://production-sfo.browserless.io/scrape";
String token = "YOUR_API_TOKEN_HERE";
String endpoint = String.format("%s?token=%s", url, token);

HttpResponse<String> response = Unirest.post(endpoint)
    .header("Content-Type", "application/json")
    .body("{\"url\": \"https://scraping-sandbox.netlify.app/products\", \"elements\": [{\"selector\": \"h1\"}, {\"selector\": \".product-title\"}]}")
    .asString();

System.out.println(response.getBody());

3. Check the output

Run the class. The response body is structured JSON with matched elements for each selector.

View Full Code on GitHub

1. Send the request

using System.Net.Http;
using System.Text;
using System.Text.Json;

string url = "https://production-sfo.browserless.io/scrape";
string token = "YOUR_API_TOKEN_HERE";
string endpoint = $"{url}?token={token}";

var payload = new
{
    url = "https://scraping-sandbox.netlify.app/products",
    elements = new[] { new { selector = "h1" }, new { selector = ".product-title" } },
};

using (HttpClient httpClient = new HttpClient())
{
    var jsonPayload = JsonSerializer.Serialize(payload);
    var content = new StringContent(jsonPayload, Encoding.UTF8, "application/json");
    var response = await httpClient.PostAsync(endpoint, content);
    string responseBody = await response.Content.ReadAsStringAsync();
    Console.WriteLine(responseBody);
}

2. Check the output

Run the program. The response body is structured JSON with matched elements for each selector.

Use a browser connection to evaluate the fully rendered DOM and extract data with custom logic.

Puppeteer
Playwright

View Full Code on GitHub

1. Install dependencies

npm install puppeteer-core

2. Connect and extract

import puppeteer from 'puppeteer-core';

const browser = await puppeteer.connect({
  browserWSEndpoint: 'wss://production-sfo.browserless.io?token=YOUR_API_TOKEN_HERE',
});

try {
  const page = await browser.newPage();
  await page.goto('https://scraping-sandbox.netlify.app/products', { waitUntil: 'networkidle2' });

  const data = await page.evaluate(() => ({
    heading: document.querySelector('h1')?.textContent,
    products: [...document.querySelectorAll('.product-title')].map((el) => el.textContent),
  }));

  console.log(data);
} finally {
  // Always close to release the session even on error.
  await browser.close();
}

3. Check the output

Run the script with node scrape.mjs. The extracted data is logged to the console.

View Full Code on GitHub

1. Install dependencies

npm install playwright-core

2. Connect and extract

import { chromium } from 'playwright-core';

const browser = await chromium.connect(
  'wss://production-sfo.browserless.io/chromium/playwright?token=YOUR_API_TOKEN_HERE'
);

try {
  const page = await browser.newPage();
  await page.goto('https://scraping-sandbox.netlify.app/products', { waitUntil: 'networkidle' });

  const data = await page.evaluate(() => ({
    heading: document.querySelector('h1')?.textContent,
    products: [...document.querySelectorAll('.product-title')].map((el) => el.textContent),
  }));

  console.log(data);
} finally {
  // Always close to release the session even on error.
  await browser.close();
}

3. Check the output

Run the script with node scrape.mjs. The extracted data is logged to the console.

View Full Code on GitHub

1. Write the mutation

Navigate to the page and map elements to structured text using CSS selectors:

mutation Scrape {
  goto(url: "https://scraping-sandbox.netlify.app/products", waitUntil: domContentLoaded) {
    status
  }

  heading: mapSelector(selector: "h1") {
    innerText
  }

  products: mapSelector(selector: ".product-title") {
    innerText
  }
}

2. Run it

Paste into the BQL IDE and click Run.

3. Check the output

The response returns structured JSON with text for each matched element:

{
  "data": {
    "goto": { "status": 200 },
    "heading": [{ "innerText": "Product Catalog" }],
    "products": [
      { "innerText": "ProBass A100 Wireless Headphones" },
      { "innerText": "Aura Smart Ring Gen 4" }
    ]
  }
}

Next steps

Take a Screenshot

capture the page visually

Fill and Submit a Form

automate form interactions before scraping

Authenticated Sessions

scrape pages that require login

Steps​

Next steps​

Take a Screenshot

Fill and Submit a Form

Authenticated Sessions

Steps

Next steps