Downloading files from sites
When automating browser tasks, you may need to download files from websites. This guide demonstrates how to handle file downloads using both Puppeteer and Playwright with Browserless, including advanced techniques for remote server environments.
Puppeteer
File downloads on Puppeteer depend a lot on your environment:
- Local environment: If you were running Puppeteer locally, downloading files would be trivially easy using standard download behavior settings
- Remote server environment: When using Browserless (or any remote browser service), Puppeteer doesn't provide an easy way to handle downloads since the files are downloaded to the remote server, not your local machine
Puppeteer requires advanced techniques like network interception and ReadStream to capture file data before it's written to disk. This advanced example shows how to download files from a remote server using network interception and ReadStream:
import puppeteer from 'puppeteer-core';
import fs from 'fs';
// Establish remote connection
const browserWSEndpoint = `wss://production-sfo.browserless.io/chromium/stealth?token=${process.env.BROWSERLESS_TOKEN}`;
const browser = await puppeteer.connect({
browserWSEndpoint
});
const [page] = await browser.pages();
// Create cdp session
const cdp = await page.createCDPSession();
// Navigate to target site and login if needed
await page.goto("https://slackmojis.com/", { waitUntil: "networkidle0" });
// Configure Network Interception
await cdp.send('Network.enable');
await cdp.send('Network.setRequestInterception', {
patterns: [
{
urlPattern: '*',
interceptionStage: 'HeadersReceived',
},
],
});
// Function to download file once the response was intercepted
const downloadFileFromInterceptedResponse = async (interceptionId, fileName) => {
const { stream: streamHandle } = await cdp.send('Network.takeResponseBodyForInterceptionAsStream', {
interceptionId: interceptionId,
});
const writer = fs.createWriteStream(`${fileName}`, { encoding: 'base64' });
while (true) {
const read = await cdp.send('IO.read', {
handle: streamHandle,
});
if (read.eof)
break;
writer.write(read.data);
}
// After file is saved, we need to abort the request so that the browser doesn't wait for the response.
cdp.send('Network.continueInterceptedRequest', {
interceptionId: interceptionId,
errorReason: 'Aborted',
});
};
// Listen for intercepted events events
const downloadPromises = [];
await cdp.on('Network.requestIntercepted', async (event) => {
if (event.isDownload) {
// When event is a download we call our download function
const fileName = event.request.url.split('/').pop();
downloadPromises.push(downloadFileFromInterceptedResponse(event.interceptionId, fileName));
} else {
await cdp.send('Network.continueInterceptedRequest', {
interceptionId: event.interceptionId,
});
}
});
// Trigger the downloads
await page.click('li.emoji.alert a.downloader');
// Wait for downloads to finish
await new Promise(r => setTimeout(r, 3000));
await Promise.all(downloadPromises);
browser.close();
Playwright
Playwright, on the other hand offers a native Download.saveAs()
method that makes file downloads much simpler. This works seamlessly in remote servers, including Browserless:
import playwright from "playwright-core";
// Establish remote connection
const browserWSEndpoint = `wss://production-sfo.browserless.io/chromium/playwright?token=${process.env.BROWSERLESS_TOKEN}`;
const browser = await playwright.chromium.connect(browserWSEndpoint);
const page = await browser.newPage();
// Navigate to target site and login if needed
await page.goto("https://slackmojis.com/", { waitUntil: "networkidle" });
page.on("download", (download) => {
download.saveAs(`${download.suggestedFilename()}`);
});
// Trigger the downloads
await page.click("li.emoji.alert a.downloader");
browser.close();