Beautiful Soup (Python)
Beautiful Soup is one the most popular scrapping libraries. It allows you to use Python to easily parse an existing HTML string to scrape its data in an easy and fast manner. You can use Beautiful Soup alongside our /content or /unblock API to scrape any website.
Both of these APIs will render the content in a browser before HTML, with the difference that the /unblock
API uses advanced stealth techniques to first bypass bot detectors.
Just like Cheerio, Beautiful Soup is only a parser, it does not provide any API to get the HTML string in the first place. Usually, to get the HTML string from a website, you would use the requests
library to download the page, like this:
import requests
from bs4 import BeautifulSoup
url = 'https://browserless.io/'
response = requests.get(url)
soup = BeautifulSoup(response.text, 'html.parser')
entries = soup.find_all('a')
for entry in entries:
print(entry.text.strip())
Here's the main problem: since requests
just downloads the HTML, it can only return the source code of a page without interacting with it. Which means that any page that relies on JavaScript and user interactions to render content, will not be downloaded properly.
On the other hand, the /content
API ensures that the HTML content is not just downloaded, but rendered and evaluated inside a browser. You can use the requests
library to make an HTTP request to our api, this way:
import requests
from bs4 import BeautifulSoup
response = requests.post("https://chrome.browserless.io/content",
params={ "token": "GOES-HERE"},
json={
"gotoOptions": { "waitUntil": "networkidle0" },
"url": "https://puppeteer.github.io/pptr.dev/"
}
)
soup = BeautifulSoup(response.text, 'html.parser')
entries = soup.find_all("a", class_="pptr-sidebar-item")
for entry in entries:
print(entry.text.strip())
In the example, we are using the old Puppeteer doc site, which relies heavily on JS to render its content. With a usual requests
or cURL
request, it would only download the page's source code, the JavaScript wouldn't be interpreted and the content wouldn't be rendered.
Using our API, you can use all the options available in the /content
API, use stealth mode, our residential proxies and more! For more reference, please check our OpenAPI.