OpenAI CUA Integration

OpenAI's Computer Use Agent (CUA) analyzes screenshots and returns structured actions — click, type, scroll — that Playwright executes in a Browserless cloud browser. This enables tasks like form filling, web research, and data extraction without managing browser infrastructure yourself.

Prerequisites

Browserless API token (available in your account dashboard)
OpenAI API key (get it from OpenAI API Keys)
Node.js 18+ or Python 3.10+

How it works

Capture screenshot — take a screenshot of the current browser state
Send to CUA model — call the Responses API with a computer tool
Execute actions — parse the computer_call response and run actions via Playwright
Loop — repeat until the task is complete

Step-by-Step Setup

In this guide you'll build an example that navigates to Bing, searches for "Browserless.io", and returns a summary of what the company does. We use stealth mode to avoid bot detection.

1. Set your API keys

Grab your Browserless token from your account dashboard and your OpenAI key from OpenAI API Keys.

.env file
Command line

BROWSERLESS_API_KEY=your-browserless-token
OPENAI_API_KEY=your-openai-key

export BROWSERLESS_API_KEY=your-browserless-token
export OPENAI_API_KEY=your-openai-key

2. Install dependencies

TypeScript
Python

npm install openai playwright-core typescript ts-node @types/node

pip install openai playwright

3. Connect to Browserless

Use Playwright's CDP connection with stealth mode (recommended for avoiding bot detection):

TypeScript
Python

import { chromium, Page } from "playwright-core";
import OpenAI from "openai";

const client = new OpenAI();
const browser = await chromium.connectOverCDP(
  `wss://production-sfo.browserless.io/chromium/stealth?token=${process.env.BROWSERLESS_API_KEY}`,
  { timeout: 60000 }
);
const context = await browser.newContext({ viewport: { width: 1024, height: 768 } });
const page = await context.newPage();

import os
import base64
from openai import OpenAI
from playwright.async_api import async_playwright

client = OpenAI()

p = await async_playwright().start()
browser = await p.chromium.connect_over_cdp(
    f"wss://production-sfo.browserless.io/chromium/stealth?token={os.environ['BROWSERLESS_API_KEY']}",
    timeout=60000
)
context = await browser.new_context(viewport={"width": 1024, "height": 768})
page = await context.new_page()

tip

All subsequent Python snippets in this guide run inside the same async function. When you're done, call await browser.close() and await p.stop() to clean up.

4. Navigate and capture the initial screenshot

Navigate to Bing and capture the initial screenshot to send to the model. Over remote WebSocket connections, standard Playwright screenshots can timeout, so we include a CDP fallback:

TypeScript
Python

await page.goto("https://www.bing.com", { waitUntil: "networkidle" });

async function getScreenshot(page: Page): Promise<string> {
  try {
    const buffer = await page.screenshot({ timeout: 10000 });
    return buffer.toString("base64");
  } catch {
    // Fallback: use CDP directly
    const cdp = await page.context().newCDPSession(page);
    const result = await cdp.send("Page.captureScreenshot", { format: "png" });
    await cdp.detach();
    return result.data;
  }
}

const screenshotBase64 = await getScreenshot(page);

await page.goto("https://www.bing.com", wait_until="networkidle")

async def get_screenshot(page) -> str:
    try:
        screenshot_bytes = await page.screenshot(timeout=10000)
        return base64.b64encode(screenshot_bytes).decode("utf-8")
    except Exception:
        # Fallback: use CDP directly
        cdp = await page.context.new_cdp_session(page)
        result = await cdp.send("Page.captureScreenshot", {"format": "png"})
        await cdp.detach()
        return result["data"]

screenshot_base64 = await get_screenshot(page)

5. Send the initial request to the model

Define the task and send it along with the screenshot:

TypeScript
Python

const task = "Search for 'Browserless.io' and tell me what the company does";

let response = await client.responses.create({
  model: "computer-use-preview",
  tools: [
    {
      type: "computer_use_preview",
      display_width: 1024,
      display_height: 768,
      environment: "browser",
    },
  ],
  input: [
    {
      role: "user",
      content: [
        { type: "input_text", text: task },
        {
          type: "input_image",
          image_url: `data:image/png;base64,${screenshotBase64}`,
        },
      ],
    },
  ],
  truncation: "auto",
});

task = "Search for 'Browserless.io' and tell me what the company does"

response = client.responses.create(
    model="computer-use-preview",
    tools=[
        {
            "type": "computer_use_preview",
            "display_width": 1024,
            "display_height": 768,
            "environment": "browser",
        }
    ],
    input=[
        {
            "role": "user",
            "content": [
                {"type": "input_text", "text": task},
                {
                    "type": "input_image",
                    "image_url": f"data:image/png;base64,{screenshot_base64}",
                },
            ],
        }
    ],
    truncation="auto",
)

6. Process actions and loop

The model returns a computer_call item with an action to execute. Run the action, capture a new screenshot, and send it back. Repeat until no more computer_call items appear (task complete).

note

The model may return key names like CTRL or CMD that Playwright doesn't recognize. The examples below map these to Playwright's expected format (e.g., Control, Meta).

TypeScript
Python

// Map model key names to Playwright key names
const keyMap: Record<string, string> = {
  enter: "Enter", return: "Enter",
  ctrl: "Control", cmd: "Meta",
  esc: "Escape", backspace: "Backspace",
  tab: "Tab", space: "Space",
  up: "ArrowUp", down: "ArrowDown",
  left: "ArrowLeft", right: "ArrowRight",
};

while (true) {
  const computerCalls = response.output.filter(
    (item: { type: string }) => item.type === "computer_call"
  );

  if (computerCalls.length === 0) {
    // Task complete — print result
    console.log(response.output_text);
    break;
  }

  const computerCall = computerCalls[0];
  const action = computerCall.action;

  switch (action.type) {
    case "click":
      await page.mouse.click(action.x, action.y);
      break;
    case "double_click":
      await page.mouse.dblclick(action.x, action.y);
      break;
    case "type":
      await page.keyboard.type(action.text);
      break;
    case "keypress": {
      const mappedKeys = action.keys.map(
        (key: string) => keyMap[key.toLowerCase()] || key
      );
      await page.keyboard.press(mappedKeys.join("+"));
      break;
    }
    case "scroll":
      await page.mouse.move(action.x, action.y);
      await page.evaluate(
        `window.scrollBy(${action.scroll_x}, ${action.scroll_y})`
      );
      break;
    case "screenshot":
      // Model wants a fresh screenshot — just continue
      break;
  }

  // Capture new screenshot and send back
  const newScreenshot = await getScreenshot(page);

  response = await client.responses.create({
    model: "computer-use-preview",
    previous_response_id: response.id,
    tools: [
      {
        type: "computer_use_preview",
        display_width: 1024,
        display_height: 768,
        environment: "browser",
      },
    ],
    input: [
      {
        type: "computer_call_output",
        call_id: computerCall.call_id,
        output: {
          type: "input_image",
          image_url: `data:image/png;base64,${newScreenshot}`,
        },
      },
    ],
    truncation: "auto",
  });
}

# Map model key names to Playwright key names
key_map = {
    "enter": "Enter", "return": "Enter",
    "ctrl": "Control", "cmd": "Meta",
    "esc": "Escape", "backspace": "Backspace",
    "tab": "Tab", "space": "Space",
    "up": "ArrowUp", "down": "ArrowDown",
    "left": "ArrowLeft", "right": "ArrowRight",
}

while True:
    computer_calls = [
        item for item in response.output
        if item.type == "computer_call"
    ]

    if not computer_calls:
        # Task complete — print result
        print(response.output_text)
        break

    computer_call = computer_calls[0]
    action = computer_call.action

    if action.type == "click":
        await page.mouse.click(action.x, action.y)
    elif action.type == "double_click":
        await page.mouse.dblclick(action.x, action.y)
    elif action.type == "type":
        await page.keyboard.type(action.text)
    elif action.type == "keypress":
        mapped_keys = [key_map.get(key.lower(), key) for key in action.keys]
        await page.keyboard.press("+".join(mapped_keys))
    elif action.type == "scroll":
        await page.mouse.move(action.x, action.y)
        await page.evaluate(
            f"window.scrollBy({action.scroll_x}, {action.scroll_y})"
        )
    elif action.type == "screenshot":
        pass  # Model wants a fresh screenshot — just continue

    # Capture new screenshot and send back
    screenshot_base64 = await get_screenshot(page)

    response = client.responses.create(
        model="computer-use-preview",
        previous_response_id=response.id,
        tools=[
            {
                "type": "computer_use_preview",
                "display_width": 1024,
                "display_height": 768,
                "environment": "browser",
            }
        ],
        input=[
            {
                "type": "computer_call_output",
                "call_id": computer_call.call_id,
                "output": {
                    "type": "input_image",
                    "image_url": f"data:image/png;base64,{screenshot_base64}",
                },
            }
        ],
        truncation="auto",
    )

Supported actions

Action	Properties	Description
`click`	`x`, `y`, `button`	Click at coordinates
`double_click`	`x`, `y`	Double-click at coordinates
`type`	`text`	Type text
`keypress`	`keys[]`	Press keyboard keys
`scroll`	`x`, `y`, `scroll_x`, `scroll_y`	Scroll at position
`drag`	`start_x`, `start_y`, `end_x`, `end_y`	Drag from start to end
`wait`	`ms`	Wait for milliseconds
`screenshot`	-	Request new screenshot

Advanced configuration

Without stealth mode

If you don't need anti-detection and just want a managed cloud browser:

TypeScript
Python

const browser = await chromium.connectOverCDP(
  `wss://production-sfo.browserless.io?token=${process.env.BROWSERLESS_API_KEY}`
);

browser = await p.chromium.connect_over_cdp(
    f"wss://production-sfo.browserless.io?token={os.environ['BROWSERLESS_API_KEY']}"
)

Residential proxies

Route traffic through real residential IPs for additional anti-detection:

TypeScript
Python

const browser = await chromium.connectOverCDP(
  `wss://production-sfo.browserless.io?token=${process.env.BROWSERLESS_API_KEY}&proxy=residential&proxyCountry=us`
);

browser = await p.chromium.connect_over_cdp(
    f"wss://production-sfo.browserless.io?token={os.environ['BROWSERLESS_API_KEY']}&proxy=residential&proxyCountry=us"
)

Regional endpoints

Connect to the closest region for lower latency. See Connection URLs for all available endpoints.

Troubleshooting

Screenshot timeout

If the CDP fallback in Step 4 still times out, try increasing the timeout or check your network connection to Browserless. You can also increase the Playwright connection timeout:

TypeScript
Python

const browser = await chromium.connectOverCDP(url, { timeout: 120000 });

browser = await p.chromium.connect_over_cdp(url, timeout=120000)

Model returns unrecognized keys

The keyMap / key_map in Step 6 covers the most common mismatches. If you encounter new ones, add them to the map — the Playwright keyboard API docs list all valid key names.

How it works​

Step-by-Step Setup​

1. Set your API keys​

2. Install dependencies​

3. Connect to Browserless​

4. Navigate and capture the initial screenshot​

5. Send the initial request to the model​

6. Process actions and loop​

Supported actions​

Advanced configuration​

Without stealth mode​

Residential proxies​

Regional endpoints​

Troubleshooting​

Screenshot timeout​

Model returns unrecognized keys​

Resources​