Skip to main content

Scrape and Extract Data

You have three main options for extracting data:

  • Grab the full or cleaned HTML to parse externally
  • Use the mapping function to generate a JSON
  • Use requests to grab an API response

Responding with HTML

If you already have a parser set up, then you can grab the HTML. You can include selectors such as body or use our cleaning options. For example:

mutation clean_example {
goto(url: "https://www.browserless.io/") {
status
}
html(clean: {
removeAttributes: true,
removeNonTextNodes: true
}) {
html
}
}
Waiting for Elements

For more reliable data extraction, it's important to wait for the site and elements to be ready before scraping them. Learn more about BrowserQL's built-in wait methods in our Waiting for Things documentation.

Clean Features

Features of the "clean" argument include removal of non-text nodes, removal of DOM attributes, as well as removal of excessive whitespace and newlines. Using "clean" can save nearly 1,000 times the payload size.

Creating a JSON with mapSelector

We have created an alternative to typical parsing, with our mapSelector mutation. It is similar to how “map” works in most functional programming languages, where you might use a NodeList or document.querySelectorAll.

For getting arbitrary DOM attributes back you can specify them via the attribute(name: "data-custom-attribute") property. This will return an object with name and value properties.

To exemplify this feature, the query below does the following:

  1. Navigates to https://news.ycombinator.com.
  2. Creates a map called posts, finding all elements with the .submission .titleline > a selector.
  3. Returns an array of objects, with an object for each element found. This object will present the name given to the attribute searched as key (link), and inside it, a value key with the actual value searched, in the example below, the href of each element.
mutation scraping_example {
goto(
url: "https://news.ycombinator.com",
waitUntil: firstContentfulPaint
) {
status
}

posts: mapSelector(selector: ".submission .titleline > a", wait: true) {
link: attribute(name: "href") {
value
}
}
}

You may also continuously map further nested items, for instance this query might get all posts on a page, and then a nested mapSelector call might list each author, and post score. Hierarchy of data is preserved to pass through the hierarchical data modeled inside the DOM.

mutation map_selector_example_with_metadata {
goto(url: "https://news.ycombinator.com") {
status
}

# Get all textual content
posts: mapSelector(selector: ".subtext .subline") {
# Get the author(s)
author: mapSelector(selector: ".hnuser") {
authorName: innerText
}

# Get the post score
score: mapSelector(selector: ".score") {
score: innerText
}
}
}

This API will always return a list of results back regardless if one or more items are found, or null if none are found.

Scraping Responses

BrowserQL can record responses made by the browser, filtered by the URL-pattern, method or type. BQL automatically waits for the response, which you can disable with the wait option.

mutation DocumentResponses{
goto(url: "https://example.com/", waitUntil: load) {
status
}
response(type:document) {
url
body
headers {
name
value
}
}
}