Version: v2

Beautiful Soup (Python)

Beautiful Soup is one the most popular scrapping libraries. It allows you to use Python to easily parse an existing HTML string to scrape its data in an easy and fast manner. You can use Beautiful Soup alongside our /content or BrowserQL to scrape any website.

Both of these APIs will render the content in a browser before HTML, with the difference that BrowserQL uses advanced stealth techniques to first bypass bot detectors.

Basic Usage

Just like Cheerio, Beautiful Soup is only a parser, it does not provide any API to get the HTML string in the first place. Usually, to get the HTML string from a website, you would use the requests library to download the page, like this:

import requests
from bs4 import BeautifulSoup

url = 'https://browserless.io/'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
entries = soup.find_all('a')

for entry in entries:
    print(entry.text.strip())

Here's the main problem: since requests just downloads the HTML, it can only return the source code of a page without interacting with it. Which means that any page that relies on JavaScript and user interactions to render content, will not be downloaded properly.

On the other hand, the /content API ensures that the HTML content is not just downloaded, but rendered and evaluated inside a browser. You can use the requests library to make an HTTP request to our api, this way:

import requests
from bs4 import BeautifulSoup

response = requests.post("https://production-sfo.browserless.io/content",
                         params={ "token": "YOUR_API_TOKEN_HERE"},
                         json={
                             "waitForTimeout": 5000,
                             "url": "https://puppeteer.github.io/pptr.dev/"
                         }
)

soup = BeautifulSoup(response.text, 'html.parser')
entries = soup.find_all("a", class_="pptr-sidebar-item")

for entry in entries:
    print(entry.text.strip())

In the example, we are using the old Puppeteer doc site, which relies heavily on JS to render its content. With a usual requests or cURL request, it would only download the page's source code, the JavaScript wouldn't be interpreted and the content wouldn't be rendered.

Using our API, you can use all the options available in the /content API, use stealth mode, our residential proxies and more! For more reference, please check our OpenAPI.

Bypass bot-blockers using `/unblock`

In cases where websites implement aggressive bot-detection mechanisms, you can use the /unblock API to bypass these. The /unblock API uses a variety of tools and strategies to override and hide the footprints that headless browsers leave behind, allowing you to access bot-protected websites from a remote interface.

Similar to the /content API, the /unblock API renders and evaluates the page in a browser, but with extra stealth features. This makes it ideal for scraping highly protected websites.

import json
import requests
from bs4 import BeautifulSoup

response = requests.post("https://production-sfo.browserless.io/unblock",
                         params={ "token": "YOUR_API_TOKEN_HERE"},
                         json={
                             "waitForTimeout": 5000,
                             "url": "https://puppeteer.github.io/pptr.dev/"
                         }
)

html_content = json.loads(response.text)['content']
soup = BeautifulSoup(html_content, 'html.parser')

entries = soup.find_all("a", class_="pptr-sidebar-item")
for entry in entries:
    print(entry.text.strip())

Basic Usage​

Bypass bot-blockers using /unblock​

Basic Usage

Bypass bot-blockers using `/unblock`