Skip to main content

Scrape and Extract Data

You have three main options for extracting data:

  • Grab the full or cleaned HTML to parse externally
  • Use the mapping function to generate a JSON
  • Use requests to grab an API response

Responding with HTML

If you already have a parser set up, then you can grab the HTML. You can include selectors such as body or use our cleaning options. For example:

mutation clean_example {
goto(url: "https://www.browserless.io/") {
status
}
html(clean: {
removeAttributes: true,
removeNonTextNodes: true
}) {
html
}
}
Waiting for Elements

For more reliable data extraction, it's important to wait for the site and elements to be ready before scraping them. Learn more about BrowserQL's built-in wait methods in our Waiting for Things documentation.

Clean Features

Features of the "clean" argument include removal of non-text nodes, removal of DOM attributes, as well as removal of excessive whitespace and newlines. Using "clean" can save nearly 1,000 times the payload size.

Creating a JSON with mapSelector

We have created an alternative to typical parsing, with our mapSelector mutation. It is similar to how “map” works in most functional programming languages, where you might use a NodeList or document.querySelectorAll.

For getting arbitrary DOM attributes back you can specify them via the attribute(name: "data-custom-attribute") property. This will return an object with name and value properties.

To exemplify this feature, the query below does the following:

  1. Navigates to https://news.ycombinator.com.
  2. Creates a map called posts, finding all elements with the .submission .titleline > a selector.
  3. Returns an array of objects, with an object for each element found. This object will present the name given to the attribute searched as key (link), and inside it, a value key with the actual value searched, in the example below, the href of each element.
mutation scraping_example {
goto(
url: "https://news.ycombinator.com",
waitUntil: firstContentfulPaint
) {
status
}

posts: mapSelector(selector: ".submission .titleline > a", wait: true) {
link: attribute(name: "href") {
value
}
}
}

You may also continuously map further nested items, for instance this query might get all posts on a page, and then a nested mapSelector call might list each author, and post score. Hierarchy of data is preserved to pass through the hierarchical data modeled inside the DOM.

mutation map_selector_example_with_metadata {
goto(url: "https://news.ycombinator.com") {
status
}

# Get all textual content
posts: mapSelector(selector: ".subtext .subline") {
# Get the author(s)
author: mapSelector(selector: ".hnuser") {
authorName: innerText
}

# Get the post score
score: mapSelector(selector: ".score") {
score: innerText
}
}
}

This API will always return a list of results back regardless if one or more items are found, or null if none are found.

Scraping Responses

BrowserQL can record responses made by the browser, filtered by the URL-pattern, method or type. BQL automatically waits for the response, which you can disable with the wait option.

mutation DocumentResponses{
goto(url: "https://example.com/", waitUntil: load) {
status
}
response(type:document) {
url
body
headers {
name
value
}
}
}

Using querySelectorAll to scrape

The querySelectorAll mutation provides a simple way to extract elements from a page, similar to the native DOM method. This is particularly useful when you need to quickly grab specific elements without the structured mapping that mapSelector provides.

Here's an example that extracts all links from the Browserless homepage:

mutation FindLinks {
goto(url: "https://browserless.io") {
status
}
links: querySelectorAll(selector: "a") {
outerHTML
}
}

This mutation will:

  1. Navigate to the Browserless homepage
  2. Find all anchor (<a>) elements on the page
  3. Return the complete HTML for each link, including attributes and content

The querySelectorAll mutation returns an array of elements, making it easy to process multiple items of the same type. You can also use other properties like innerHTML, innerText, or specific attributes depending on your needs.

Processing Data with JavaScript

For more complex data processing scenarios where you need to manipulate or transform the extracted data before returning it, you can use the evaluate mutation field. This allows you to run custom JavaScript code in the browser context, giving you full control over data processing.

The evaluate mutation is particularly powerful when you need to:

  • Combine data from multiple sources on the page
  • Perform calculations or transformations
  • Apply complex filtering logic
  • Format data in specific ways before extraction

For detailed examples and best practices on using JavaScript evaluation, see our Multi-line JavaScript documentation.

Next Steps

Ready to explore more advanced data extraction techniques? Check out this related topic: