Parsing Libraries

You can use BrowserQL to retrieve HTML content from any website, and use this content with any parsing libraries, such as Scrapy, Beaultiful Soup, or Cheerio.

With BQL, you'll navigate to the desired page, do the actions you require, like verifying or solving a captcha, and finally, retrieve the HTML you need to parse.

tip

Use BQL Editor to create and test your queries before integrating it to your code.

Example

As an example, the code below accesses https://browserless.io/, clicks on the Try it Free button, and retrieves the HTML content from the pricing page.

mutation RetrieveHTML {
  goto(url: "https://browserless.io/") {
    status
  }

  click(selector: ".button-group a.button.w-button") {
    time
  }
  
  html {
    html
  }
}

Now, you can integrate this query into your code, using BQL to retrieve the HTML, and integrating it with your preffered library to parse this content:

Beautiful Soup
Scrapy
Cheerio

import requests
from bs4 import BeautifulSoup

url = 'https://browserless.io/'
token = 'YOUR_API_TOKEN_HERE'
timeout = 5 * 60 * 1000

query = '''
mutation RetrieveHTML($url: String!) {
    goto(url: $url) {
        status
    }
    click(selector: ".button-group a.button.w-button") {
        time
    }
    html {
        html
    }
}
'''

variables = {"url": url}
endpoint = f'https://production-sfo.browserless.io/chromium/bql?timeout={timeout}&token={token}'

headers = {'content-type': 'application/json'}
payload = {"query": query, "variables": variables}

response = requests.post(endpoint, json=payload)
response_data = response.json()

# Extract HTML content
html_content = response_data['data']['html']['html']

# Parse HTML with BeautifulSoup
soup = BeautifulSoup(html_content, 'html.parser')
plans = [tag.text.strip() for tag in soup.find_all('div', class_='tag_price margin-bottom margin-large')]

print(plans)

import requests
from scrapy.selector import Selector

url = 'https://browserless.io/'
token = 'YOUR_API_TOKEN_HERE'
timeout = 5 * 60 * 1000

query = '''
mutation RetrieveHTML($url: String!) {
    goto(url: $url) {
        status
    }
    click(selector: ".button-group a.button.w-button") {
        time
    }
    html {
        html
    }
}
'''

variables = {"url": url}
endpoint = f'https://production-sfo.browserless.io/chromium/bql?timeout={timeout}&token={token}'

headers = {'content-type': 'application/json'}
payload = {"query": query, "variables": variables}

response = requests.post(endpoint, json=payload)
response_data = response.json()

# Extract HTML content
html_content = response_data['data']['html']['html']

# Parse HTML with Scrapy
selector = Selector(text=html_content)
plans = selector.css('.tag_price.margin-bottom.margin-large::text').getall()

print(plans)

const fetch = require('node-fetch');
const cheerio = require('cheerio');

const url = 'https://browserless.io/';
const token = 'YOUR_API_TOKEN_HERE';
const timeout = 5 * 60 * 1000;

const queryParams = new URLSearchParams({
  timeout,
  token,
}).toString();

const query = `
  mutation RetrieveHTML($url: String!) {
    goto(url: $url) {
      status
    }
    click(selector: ".button-group a.button.w-button") {
      time
    }
    html {
      html
    }
  }
`;

const variables = { url };
const endpoint = `https://production-sfo.browserless.io/chromium/bql?${queryParams}`;

const options = {
  method: 'POST',
  headers: {
    'content-type': 'application/json',
  },
  body: JSON.stringify({
    query,
    variables,
  }),
};

(async () => {
  try {
    const response = await fetch(endpoint, options);
    const { data } = await response.json();

    // Extract HTML content
    const htmlContent = data.html.html;

    // Parse HTML with Cheerio
    const $ = cheerio.load(htmlContent);
    const plans = [];
    $('.tag_price.margin-bottom.margin-large').each((_, element) => {
      plans.push($(element).text().trim());
    });

    console.log(plans);
  } catch (error) {
    console.error('Error fetching or parsing HTML:', error);
  }
})();

Example​

Example