Scrape and Extract Data

You have three main options for extracting data:

Grab the full or cleaned HTML to parse externally
Use the mapping function to generate a JSON
Use requests to grab an API response

Responding with HTML

If you already have a parser set up, then you can grab the HTML. You can include selectors such as body or use our cleaning options. For example:

Cleaned HTML
HTML of a selector

mutation clean_example {
  goto(url: "https://www.browserless.io/") {
    status
  }
html(clean: {
    removeAttributes: true,
    removeNonTextNodes: true
  }) {
    html
  }
}

mutation clean_example {
  goto(url: "https://www.browserless.io/") {
    status
  }
html(
  selector: ".navbar_container"
  clean: {
    removeAttributes: true,
    removeNonTextNodes: true
  }) {
    html
  }
}

Waiting for Elements

For more reliable data extraction, it's important to wait for the site and elements to be ready before scraping them. Learn more about BrowserQL's built-in wait methods in our Waiting for Things documentation.

Clean Features

Features of the "clean" argument include removal of non-text nodes, removal of DOM attributes, as well as removal of excessive whitespace and newlines. Using "clean" can save nearly 1,000 times the payload size.

Creating a JSON with `mapSelector`

We have created an alternative to typical parsing, with our mapSelector mutation. It is similar to how “map” works in most functional programming languages, where you might use a NodeList or document.querySelectorAll.

For getting arbitrary DOM attributes back you can specify them via the attribute(name: "data-custom-attribute") property. This will return an object with name and value properties.

To exemplify this feature, the query below does the following:

Navigates to https://news.ycombinator.com.
Creates a map called posts, finding all elements with the .submission .titleline > a selector.
Returns an array of objects, with an object for each element found. This object will present the name given to the attribute searched as key (link), and inside it, a value key with the actual value searched, in the example below, the href of each element.

Mutation
Example response

mutation scraping_example {
  goto(
    url: "https://news.ycombinator.com", 
    waitUntil: firstContentfulPaint
  ) {
    status
  }

  posts: mapSelector(selector: ".submission .titleline > a", wait: true) {
    link: attribute(name: "href") {
      value
    }
  }
}

{
  "data": {
    "goto": {
      "status": 200
    },
    "posts": [
      {
        "link": {
          "value": "https://churchofturing.github.io/landscapeoflisp.html"
        }
      },
      {
        "link": {
          "value": "https://www.jjj.de/fxt/fxtbook.pdf"
        }
      },
      ...
      {
        "link": {
          "value": "https://ereader-swedish.fly.dev/"
        }
      }
    ]
  }
}

You may also continuously map further nested items, for instance this query might get all posts on a page, and then a nested mapSelector call might list each author, and post score. Hierarchy of data is preserved to pass through the hierarchical data modeled inside the DOM.

mutation map_selector_example_with_metadata {
  goto(url: "https://news.ycombinator.com") {
    status
  }

  # Get all textual content
  posts: mapSelector(selector: ".subtext .subline") {
    # Get the author(s)
    author: mapSelector(selector: ".hnuser") {
      authorName: innerText
    }

    # Get the post score
    score: mapSelector(selector: ".score") {
      score: innerText
    }
  }
}

This API will always return a list of results back regardless if one or more items are found, or null if none are found.

Scraping Responses

BrowserQL can record responses made by the browser, filtered by the URL-pattern, method or type. BQL automatically waits for the response, which you can disable with the wait option.

Getting all Document responses
Loading all GET AJAX Response

mutation DocumentResponses{
  goto(url: "https://example.com/", waitUntil: load) {
    status
  }
  response(type:document) {
    url
    body
    headers {
      name
      value
    }
  }
}

mutation AJAXGetCalls {
  goto(url: "https://msn.com/", waitUntil: load) {
    status
  }
  response(type: xhr, method: GET, operator: and) {
    url
    type
    method
    body
    headers {
      name
      value
    }
  }
}

Using `querySelectorAll` to scrape

The querySelectorAll mutation provides a simple way to extract elements from a page, similar to the native DOM method. This is particularly useful when you need to quickly grab specific elements without the structured mapping that mapSelector provides.

Here's an example that extracts all links from the Browserless homepage:

mutation FindLinks {
  goto(url: "https://browserless.io") {
    status
  }
  links: querySelectorAll(selector: "a") {
    outerHTML
  }
}

This mutation will:

Navigate to the Browserless homepage
Find all anchor (<a>) elements on the page
Return the complete HTML for each link, including attributes and content

The querySelectorAll mutation returns an array of elements, making it easy to process multiple items of the same type. You can also use other properties like innerHTML, innerText, or specific attributes depending on your needs.

Processing Data with JavaScript

For more complex data processing scenarios where you need to manipulate or transform the extracted data before returning it, you can use the evaluate mutation field. This allows you to run custom JavaScript code in the browser context, giving you full control over data processing.

The evaluate mutation is particularly powerful when you need to:

Combine data from multiple sources on the page
Perform calculations or transformations
Apply complex filtering logic
Format data in specific ways before extraction

For detailed examples and best practices on using JavaScript evaluation, see our Multi-line JavaScript documentation.

Next Steps

Ready to explore more advanced data extraction techniques? Check out this related topic:

Multi-line JavaScript Evaluation

Learn how to use the evaluate mutation to run custom JavaScript for complex data processing and transformation.

Responding with HTML​

Creating a JSON with mapSelector​

Scraping Responses​

Using querySelectorAll to scrape​

Processing Data with JavaScript​

Next Steps​