Skip to main content

/export API

info

Currently, Browserless V2 is available in production via two domains: production-sfo.browserless.io and production-lon.browserless.io

The export API allows you to retrieve the content of any URL in its native format (HTML, PDF, images, etc.). The response format is determined by the content type of the page being accessed, with appropriate headers set to facilitate downloading or viewing the content.

You can check the full Open API schema here.

Basic Usage

The export API accepts a JSON payload with the target URL and configuration options.

JSON Payload Format

{
"url": "https://example.com/",
"headers": {
"User-Agent": "Custom User Agent"
},
"gotoOptions": {
"waitUntil": "networkidle0",
"timeout": 30000
},
"waitForSelector": {
"selector": "#main-content",
"timeout": 5000
},
"waitForTimeout": 1000,
"bestAttempt": false
}

Parameters

Required Parameters

  • url (string) - The URL of the resource to export

Optional Parameters

  • headers (object) - Custom HTTP headers to send with the request
  • gotoOptions (object) - Navigation options
    • waitUntil (string) - When to consider navigation succeeded. Options: 'load', 'domcontentloaded', 'networkidle', 'commit'. Default: 'networkidle0'
    • timeout (number) - Maximum navigation time in milliseconds
    • referer (string) - Referer header value
  • waitForEvent (object) - Wait for a specific event before proceeding
  • waitForFunction (object) - Wait for a specific function to return true
  • waitForSelector (object) - Wait for a specific selector to be present
    • selector (string) - CSS selector to wait for
    • timeout (number) - Maximum time to wait in milliseconds
  • waitForTimeout (number) - Time in milliseconds to wait after page load
  • bestAttempt (boolean) - Whether to continue on errors. Default: false

Response

The API returns a streaming response with the content of the requested URL. The behavior depends on the content type detected:

  • HTML Content: Returns the HTML with Content-Type: text/html. No attachment header is set, allowing the content to be rendered in the browser.
  • PDF Content: Returns a PDF buffer with Content-Type: application/pdf and sets a Content-Disposition: attachment header with an appropriate filename.
  • Images and Other Binary Content: Returns the binary content with the appropriate MIME type (e.g., image/jpeg, image/png) and sets a Content-Disposition: attachment header with an appropriate filename.

The streaming nature of the response means you should handle it accordingly in your code, using appropriate methods for reading streams rather than assuming all content can be processed as text.

Handling Different Content Types

The export API can return various content types depending on the URL being accessed. Here's how to properly handle the different response types:

HTML Content

When accessing a standard web page, the API returns HTML content with Content-Type: text/html:

const response = await fetch(url, options);
if (response.headers.get('content-type')?.includes('text/html')) {
const htmlContent = await response.text();
// Process HTML content
}

PDF Content

When accessing PDF files or when the server returns PDF content, the API returns a PDF buffer with Content-Type: application/pdf:

const response = await fetch(url, options);
if (response.headers.get('content-type')?.includes('application/pdf')) {
const arrayBuffer = await response.arrayBuffer();
const pdfBuffer = Buffer.from(arrayBuffer);
// Save or process PDF buffer
}

Binary Content (Images, etc.)

For other binary content like images, the API returns the appropriate content type and sets attachment headers:

const response = await fetch(url, options);
const contentType = response.headers.get('content-type');
if (contentType?.includes('image/') || !contentType?.includes('text/')) {
const arrayBuffer = await response.arrayBuffer();
const binaryBuffer = Buffer.from(arrayBuffer);
// Save or process binary buffer
}

Error Handling

The API may return the following error responses:

  • 400 Bad Request - Invalid parameters, missing URL, or no content received
  • 404 Not Found - Page not found
  • 408 Request Timeout - Page load timeout
  • 500 Internal Server Error - Server-side error

Examples

Basic Export Request

This example demonstrates how to export a web page using the most basic configuration. It shows how to properly handle the streamed response by detecting the content type and saving the content with the appropriate file extension.

curl -X POST \
https://production-sfo.browserless.io/export?token=YOUR_API_TOKEN_HERE \
-H 'Content-Type: application/json' \
-d '{
"url": "https://example.com/"
}'

Export with Custom Navigation Options

This example demonstrates how to export a web page with custom navigation options, such as waiting for specific network events or DOM elements to load. These options help ensure the page is fully rendered before capturing the content.

curl -X POST \
https://production-sfo.browserless.io/export?token=YOUR_API_TOKEN_HERE \
-H 'Content-Type: application/json' \
-d '{
"url": "https://example.com/",
"gotoOptions": {
"waitUntil": "networkidle0",
"timeout": 60000
},
"waitForSelector": {
"selector": "#main-content",
"timeout": 5000
}
}'

Export with Custom Headers

This example demonstrates how to export a web page with custom HTTP headers. Custom headers allow you to modify the browser's behavior when accessing the page, such as changing the User-Agent or setting language preferences.

curl -X POST \
https://production-sfo.browserless.io/export?token=YOUR_API_TOKEN_HERE \
-H 'Content-Type: application/json' \
-d '{
"url": "https://example.com/",
"headers": {
"User-Agent": "Custom User Agent",
"Accept-Language": "en-US"
}
}'

Handling Different Content Types

The export API can return various content types depending on the URL being accessed. Here's how to properly handle the different response types:

HTML Content

When accessing a standard web page, the API returns HTML content with Content-Type: text/html:

const response = await fetch(url, options);
if (response.headers.get('content-type')?.includes('text/html')) {
const htmlContent = await response.text();
// Process HTML content
}

PDF Content

When accessing PDF files or when the server returns PDF content, the API returns a PDF buffer with Content-Type: application/pdf:

const response = await fetch(url, options);
if (response.headers.get('content-type')?.includes('application/pdf')) {
const arrayBuffer = await response.arrayBuffer();
const pdfBuffer = Buffer.from(arrayBuffer);
// Save or process PDF buffer
}

Binary Content (Images, etc.)

For other binary content like images, the API returns the appropriate content type and sets attachment headers:

const response = await fetch(url, options);
const contentType = response.headers.get('content-type');
if (contentType?.includes('image/') || !contentType?.includes('text/')) {
const arrayBuffer = await response.arrayBuffer();
const binaryBuffer = Buffer.from(arrayBuffer);
// Save or process binary buffer
}

Best Practices

  1. Page Load Strategies

    • Use appropriate waitUntil options based on your needs:
      • load - Wait for the load event (good for static pages)
      • domcontentloaded - Wait for the DOMContentLoaded event (faster but may miss dynamic content)
      • networkidle0 - Wait until there are no network connections for at least 500ms (good for single-page applications)
      • networkidle2 - Wait until there are no more than 2 network connections for at least 500ms (good for pages with background activity)
  2. Timeout Management

    • Set reasonable timeout values based on your target page's complexity
    • Consider increasing timeouts for:
      • Pages with heavy JavaScript execution
      • Pages with large media files
      • Pages with complex animations
      • Pages with slow network conditions
  3. Content Waiting

    • Use waitForSelector when you need to ensure specific content is loaded
    • Combine with waitForTimeout for additional stability
    • Consider using multiple selectors for critical content
    • Use bestAttempt: true for more resilient scraping, but be aware it may return incomplete content

Resource Management

  1. Asset Handling

    • Use includeAssets wisely to control export size
    • Consider excluding unnecessary resource types:
      • Images for text-only exports
      • Stylesheets for raw content
      • Scripts for static content
    • Use rejectResourceTypes to filter specific asset types
    • Implement size limits for large resources
  2. Network Optimization

    • Use rejectRequestPattern to exclude unnecessary requests
    • Consider implementing request throttling
    • Cache frequently accessed resources
    • Monitor and optimize network usage

Error Handling and Reliability

  1. Robust Error Handling

    • Implement proper error handling for:
      • Network timeouts
      • Resource loading failures
      • Invalid URLs
      • Rate limiting
    • Use appropriate HTTP status codes
    • Implement retry mechanisms for transient failures
  2. Content Validation

    • Verify content completeness
    • Check for expected elements
    • Validate content structure
    • Implement checksums for critical content

Security Considerations

  1. URL and Content Safety

    • Always use HTTPS URLs when possible
    • Validate URLs before making requests
    • Sanitize user-provided URLs
    • Implement content size limits
    • Be cautious when setting custom headers
  2. Authentication and Authorization

    • Use secure methods for API token storage
    • Implement proper access controls
    • Monitor and log access attempts
    • Rotate API tokens regularly

Performance Optimization

  1. Export Size Management

    • Implement compression where appropriate
    • Use appropriate export formats
    • Consider splitting large exports
    • Implement cleanup mechanisms for temporary files
  2. Concurrent Operations

    • Implement proper rate limiting
    • Use appropriate concurrency levels
    • Monitor system resources
    • Implement queue management for high-volume operations

Monitoring and Maintenance

  1. Logging and Monitoring

    • Implement comprehensive logging
    • Monitor success/failure rates
    • Track export sizes and durations
    • Set up alerts for failures
    • Monitor rate limit usage
  2. Maintenance

    • Regularly review and update selectors
    • Monitor for changes in target sites
    • Update error handling as needed
    • Review and optimize timeout values
    • Maintain documentation of changes

For additional support, please refer to the Browserless documentation or contact support.