Skip to main content

Maintaining Sessions with Reconnects

Scraping websites with traditional tools like Puppeteer or Playwright can be inefficient due to repeated session restarts and unnecessary proxy usage. Reconnects maintains browser sessions, preserving cookies, cache, and session data across multiple requests, bringing the following benefits:

  • Reduced proxy usage: Save up to 90% of proxy bandwidth.
  • Improved efficiency: Avoid repetitive loading of static content.
  • Lower detection risk: Maintain consistent session states to avoid bot detection mechanisms like CAPTCHAs.

Implementing Reconnects with BrowserQL

This guide will go through the following steps to maintain a sessions using Reconnects:

  1. Initiate a session.
  2. Scrape data using the reconnect URL.
  3. Refresh your session URL regularly to maintain stability.

Step 1: Initial Setup

Start by initiating a session with BrowserQL. This first query opens the browser, navigates to your target URL, and provides a reconnect URL to reuse the same session.

import fetch from 'node-fetch';

const API_KEY = "YOUR_API_TOKEN_HERE";
const BQL_ENDPOINT = "https://production-sfo.browserless.io/chromium/bql";

const sessionQuery = `
mutation StartSession {
goto(url: "https://example.com", waitUntil: networkIdle) {
status
}
reconnect(timeout: 60000) { # Keeps session open for 60 seconds
BrowserQLEndpoint
}
}`;

async function startSession() {
const response = await fetch(BQL_ENDPOINT, {
method: "POST",
headers: {
"Content-Type": "application/json",
Authorization: `Bearer ${API_KEY}`,
},
body: JSON.stringify({ query: sessionQuery }),
});

const data = await response.json();
console.log("Reconnect URL:", data.data.reconnect.BrowserQLEndpoint);
return data.data.reconnect.BrowserQLEndpoint;
}

startSession();

Step 2: Using the Reconnect URL

Use the reconnect URL provided by the initial session setup to make subsequent queries without starting a new browser instance.

const RECONNECT_BQL_ENDPOINT = "YOUR_RECONNECT_BQL_ENDPOINT";

const scrapeQuery = `
mutation FetchData {
text(selector: ".product-title") {
text
}
}`;

async function fetchData() {
const response = await fetch(RECONNECT_BQL_ENDPOINT, {
method: "POST",
headers: { "Content-Type": "application/json" },
body: JSON.stringify({ query: scrapeQuery }),
});

const data = await response.json();
console.log("Fetched Data:", data.data.text.text);
}

fetchData();

Step 3: Refreshing Your Session

To avoid instability, refresh the reconnect endpoint periodically.

const refreshQuery = `
mutation RefreshSession {
reconnect(timeout: 60000) { # Extends session timeout
BrowserQLEndpoint
}
}`;

async function refreshSession() {
const response = await fetch(BQL_ENDPOINT, {
method: "POST",
headers: {
"Content-Type": "application/json",
Authorization: `Bearer ${API_KEY}`,
},
body: JSON.stringify({ query: refreshQuery }),
});

const data = await response.json();
console.log("New Reconnect URL:", data.data.reconnect.BrowserQLEndpoint);
return data.data.reconnect.BrowserQLEndpoint;
}

Full Example Code

Here's a complete example demonstrating all steps together:

import fetch from 'node-fetch';

const API_KEY = "YOUR_API_TOKEN_HERE";
const BQL_ENDPOINT = "https://production-sfo.browserless.io/chromium/bql";

const sessionQuery = `
mutation StartSession {
goto(url: "https://example.com", waitUntil: networkIdle) {
status
}
reconnect(timeout: 60000) { # Keeps session open for 60 seconds
BrowserQLEndpoint
}
}`;

async function startSession() {
const response = await fetch(BQL_ENDPOINT, {
method: "POST",
headers: {
"Content-Type": "application/json",
Authorization: `Bearer ${API_KEY}`,
},
body: JSON.stringify({ query: sessionQuery }),
});

const data = await response.json();
console.log("Reconnect URL:", data.data.reconnect.BrowserQLEndpoint);
return data.data.reconnect.BrowserQLEndpoint;
}

async function fetchData(reconnectUrl) {
const scrapeQuery = `
mutation FetchData {
text(selector: ".product-title") {
text
}
}`;

const response = await fetch(reconnectUrl, {
method: "POST",
headers: { "Content-Type": "application/json" },
body: JSON.stringify({ query: scrapeQuery }),
});

const data = await response.json();
console.log("Fetched Data:", data.data.text.text);
}


async function refreshSession() {
const refreshQuery = `
mutation RefreshSession {
reconnect(timeout: 60000) { # Extends session timeout
BrowserQLEndpoint
}
}`;
const response = await fetch(BQL_ENDPOINT, {
method: "POST",
headers: {
"Content-Type": "application/json",
Authorization: `Bearer ${API_KEY}`,
},
body: JSON.stringify({ query: refreshQuery }),
});

const data = await response.json();
console.log("New Reconnect URL:", data.data.reconnect.BrowserQLEndpoint);
return data.data.reconnect.BrowserQLEndpoint;
}

(async () => {
let reconnectUrl = await startSession();
let pagesScraped = 0;
const PAGE_LIMIT = 20;

for (let i = 0; i < 100; i++) {
if (pagesScraped >= PAGE_LIMIT) {
reconnectUrl = await refreshSession();
pagesScraped = 0;
}

const data = await fetchData(reconnectUrl);
console.log(`Scraped Page ${i + 1}:`, data);
pagesScraped++;
}
})();

Improving Efficiency with BrowserQL's Reject API

BrowserQL also lets you reject unnecessary requests (e.g., images, media) to optimize resource usage:

mutation OptimizeSession {
setRequestInterception(enabled: true)
reject(patterns: ["*.png", "*.jpg", "*.mp4"])
}

Use this to further streamline your scraping tasks.

Common Issues

CAPTCHA Challenges

If you encounter CAPTCHAs, ensure your session maintains human-like interaction patterns by reducing request rates and maintaining stable sessions.

Session Timeouts

Set appropriate timeout values to ensure sessions remain active without resource leaks:

reconnect(timeout: 120000)