Skip to main content

Parsing Libraries

You can use BrowserQL to retrieve HTML content from any website, and use this content with any parsing libraries, such as Scrapy, Beaultiful Soup, or Cheerio.

With BQL, you'll navigate to the desired page, do the actions you require, like verifying or solving a captcha, and finally, retrieve the HTML you need to parse.

tip

Use BQL Editor to create and test your queries before integrating it to your code.

Example

As an example, the code below accesses https://browserless.io/, clicks on the Try it Free button, and retrieves the HTML content from the pricing page.

mutation RetrieveHTML {
goto(url: "https://browserless.io/") {
status
}

click(selector: ".button-group a.button.w-button") {
time
}

html {
html
}
}

Now, you can integrate this query into your code, using BQL to retrieve the HTML, and integrating it with your preffered library to parse this content:

import requests
from bs4 import BeautifulSoup

url = 'https://browserless.io/'
token = 'YOUR_API_TOKEN_HERE'
timeout = 5 * 60 * 1000

query = '''
mutation RetrieveHTML($url: String!) {
goto(url: $url) {
status
}
click(selector: ".button-group a.button.w-button") {
time
}
html {
html
}
}
'''

variables = {"url": url}
endpoint = f'https://production-sfo.browserless.io/chromium/bql?timeout={timeout}&token={token}'

headers = {'content-type': 'application/json'}
payload = {"query": query, "variables": variables}

response = requests.post(endpoint, json=payload)
response_data = response.json()

# Extract HTML content
html_content = response_data['data']['html']['html']

# Parse HTML with BeautifulSoup
soup = BeautifulSoup(html_content, 'html.parser')
plans = [tag.text.strip() for tag in soup.find_all('div', class_='tag_price margin-bottom margin-large')]

print(plans)