For AI agents: a documentation index is available at /llms.txt
Skip to main content

Data Scraping and Extraction

BrowserQL offers three main approaches to extracting data: grab the page HTML with the html mutation, map DOM elements to a structured JSON with mapSelector or querySelectorAll, or intercept raw API responses with the response mutation. Choose the approach that fits your downstream processing needs.

Basic Usage

The html mutation returns the full page HTML. Wait for the page to load before extracting to avoid empty results.

mutation ExtractHTML {
goto(url: "https://www.browserless.io/", waitUntil: domContentLoaded) {
status
}

html {
html
}
}

Targeting a Specific Element

Pass a selector to return HTML from a single element instead of the full page:

html(selector: ".navbar_container") {
html
}

Cleaning the HTML

The clean argument strips non-text nodes (scripts, video, canvas), DOM attributes, and excess whitespace. It can reduce payload size by up to 1,000x:

html(clean: {
removeAttributes: true
removeNonTextNodes: true
}) {
html
}

Creating a JSON with mapSelector

mapSelector is designed for pages with repetitive, hierarchical structure: product listings, comment threads, search results, or any repeating pattern. It iterates over a NodeList, similar to document.querySelectorAll, and returns a structured array of objects. Use it to extract attributes, text content, or nested elements.

The query below navigates to Hacker News and extracts the href of every post link:

mutation ScrapeHackerNews {
goto(
url: "https://news.ycombinator.com"
waitUntil: firstContentfulPaint
) {
status
}

posts: mapSelector(selector: ".submission .titleline > a", wait: true) {
link: attribute(name: "href") {
value
}
}
}

Response

{
"data": {
"posts": [
{ "link": { "value": "https://churchofturing.github.io/landscapeoflisp.html" } },
{ "link": { "value": "https://www.jjj.de/fxt/fxtbook.pdf" } },
{ "link": { "value": "https://ereader-swedish.fly.dev/" } }
]
}
}

Nest mapSelector calls to traverse hierarchical DOM structures. The example below extracts post authors and scores from each submission:

mutation ScrapeHackerNewsMetadata {
goto(url: "https://news.ycombinator.com") {
status
}

posts: mapSelector(selector: ".subtext .subline") {
author: mapSelector(selector: ".hnuser") {
authorName: innerText
}

score: mapSelector(selector: ".score") {
score: innerText
}
}
}

mapSelector always returns an array, whether one or many elements match. It returns null when no elements are found.

Scraping Network Responses

The response mutation records HTTP responses made by the browser, filtered by URL pattern, method, or resource type. BQL waits for the response automatically.

The example below captures the raw document response from a page load:

mutation DocumentResponses {
goto(url: "https://example.com/", waitUntil: load) {
status
}

response(type: document) {
url
body
headers {
name
value
}
}
}

Filter by method and operator to narrow responses to a specific type. The example below captures only XHR GET responses:

mutation AJAXGetCalls {
goto(url: "https://msn.com/", waitUntil: load) {
status
}

response(type: xhr, method: GET, operator: and) {
url
type
method
body
headers {
name
value
}
}
}

Using querySelectorAll

The querySelectorAll mutation returns an array of matched elements with their HTML properties. Use it when you need fast element extraction without the nested mapping of mapSelector.

mutation FindLinks {
goto(url: "https://browserless.io") {
status
}

links: querySelectorAll(selector: "a") {
innerText
outerHTML
}
}

Each result includes innerHTML, innerText, outerHTML, id, className, and childElementCount. Use innerText to get visible text, or outerHTML to get the full element markup.

Processing Data with JavaScript

The evaluate mutation runs JavaScript in the browser context and returns the result. Use it when you need calculations, filtering, or transformations that go beyond what mapSelector or querySelectorAll support.

mutation CountHeadings {
goto(url: "https://browserless.io") {
status
}

headingCount: evaluate(
content: "document.querySelectorAll('h2').length"
) {
value
}
}

The content field accepts any JavaScript expression or block. Wrap multi-step logic in a function body and use return to pass values back. For examples using await, external scripts, and complex transformations, see Multi-line JavaScript.

Next Steps

Frequently Asked Questions

What methods does BrowserQL offer for data extraction?

BrowserQL provides HTML extraction, CSS selector-based queries via mapSelector and querySelectorAll, response interception to capture API data, and JavaScript evaluation for custom extraction logic.

Can BrowserQL scrape JavaScript-rendered pages?

Yes. BQL runs a full browser that executes JavaScript before extraction. This means single-page apps, dynamically loaded content, and client-side rendered pages are all fully supported.

How do I handle pagination when scraping?

Use BQL's click and waitForSelector mutations to navigate through pages, or intercept the underlying API responses that supply the paginated data. You can also use conditional logic to loop until no more pages remain.

Does BrowserQL support scraping behind login walls?

Yes. Use the type and click mutations to fill login forms, or inject cookies from a previous session with the setCookies mutation. BQL persists session state across queries when using session management.