Version: v2

Beautiful Soup (Python)

Beautiful Soup is one the most popular scrapping libraries. It allows you to use Python to easily parse an existing HTML string to scrape its data in an easy and fast manner. You can use Beautiful Soup alongside our /content API to scrape any website.

Just like Cheerio, Beautiful Soup is only a parser, it does not provide any API to get the HTML string in the first place. Usually, to get the HTML string from a website, you would use the requests library to download the page, like this:

import requests
from bs4 import BeautifulSoup

url = 'https://browserless.io/'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
entries = soup.find_all('a')

for entry in entries:
    print(entry.text.strip())

Here's the main problem: since requests just downloads the HTML, it can only return the source code of a page without interacting with it. Which means that any page that relies on JavaScript and user interactions to render content, will not be downloaded properly.

On the other hand, the /content API ensures that the HTML content is not just downloaded, but rendered and evaluated inside a browser. You can use the requests library to make an HTTP request to our api, this way:

import requests
from bs4 import BeautifulSoup

response = requests.post("https://chrome.browserless.io/content",
                         params={ "token": "GOES-HERE"},
                         json={
                             "gotoOptions": { "waitUntil": "networkidle0" },
                             "url": "https://puppeteer.github.io/pptr.dev/"
                         }
)

soup = BeautifulSoup(response.text, 'html.parser')
entries = soup.find_all("a", class_="pptr-sidebar-item")

for entry in entries:
    print(entry.text.strip())

In the example, we are using the old Puppeteer doc site, which relies heavily on JS to render its content. With a usual requests or cURL request, it would only download the page's source code, the JavaScript wouldn't be interpreted and the content wouldn't be rendered.

Using our API, you can use all the options available in the /content API, use stealth mode, our residential proxies and more! For more reference, please check our OpenAPI.