OpenAI CUA Integration
OpenAI's Computer Use Agent (CUA) analyzes screenshots and returns structured actions — click, type, scroll — that Playwright executes in a Browserless cloud browser. This enables tasks like form filling, web research, and data extraction without managing browser infrastructure yourself.
How it works
- Capture screenshot — take a screenshot of the current browser state
- Send to CUA model — call the Responses API with a
computertool - Execute actions — parse the
computer_callresponse and run actions via Playwright - Loop — repeat until the task is complete
Prerequisites
- Browserless API token (available in your account dashboard)
- OpenAI API key (get it from OpenAI API Keys)
- Node.js 18+ or Python 3.10+
Step-by-Step Setup
In this guide you'll build an example that navigates to Bing, searches for "Browserless.io", and returns a summary of what the company does. We use stealth mode to avoid bot detection.
1. Set your API keys
Grab your Browserless token from your account dashboard and your OpenAI key from OpenAI API Keys.
- .env file
- Command line
BROWSERLESS_API_KEY=your-browserless-token
OPENAI_API_KEY=your-openai-key
export BROWSERLESS_API_KEY=your-browserless-token
export OPENAI_API_KEY=your-openai-key
2. Install dependencies
- TypeScript
- Python
npm install openai playwright-core typescript ts-node @types/node
pip install openai playwright
3. Connect to Browserless
Use Playwright's CDP connection with stealth mode (recommended for avoiding bot detection):
- TypeScript
- Python
import { chromium, Page } from "playwright-core";
import OpenAI from "openai";
const client = new OpenAI();
const browser = await chromium.connectOverCDP(
`wss://production-sfo.browserless.io/chromium/stealth?token=${process.env.BROWSERLESS_API_KEY}`,
{ timeout: 60000 }
);
const context = await browser.newContext({ viewport: { width: 1024, height: 768 } });
const page = await context.newPage();
import os
import base64
from openai import OpenAI
from playwright.async_api import async_playwright
client = OpenAI()
p = await async_playwright().start()
browser = await p.chromium.connect_over_cdp(
f"wss://production-sfo.browserless.io/chromium/stealth?token={os.environ['BROWSERLESS_API_KEY']}",
timeout=60000
)
context = await browser.new_context(viewport={"width": 1024, "height": 768})
page = await context.new_page()
All subsequent Python snippets in this guide run inside the same async function. When you're done, call await browser.close() and await p.stop() to clean up.
4. Navigate and capture the initial screenshot
Navigate to Bing and capture the initial screenshot to send to the model. Over remote WebSocket connections, standard Playwright screenshots can timeout, so we include a CDP fallback:
- TypeScript
- Python
await page.goto("https://www.bing.com", { waitUntil: "networkidle" });
async function getScreenshot(page: Page): Promise<string> {
try {
const buffer = await page.screenshot({ timeout: 10000 });
return buffer.toString("base64");
} catch {
// Fallback: use CDP directly
const cdp = await page.context().newCDPSession(page);
const result = await cdp.send("Page.captureScreenshot", { format: "png" });
await cdp.detach();
return result.data;
}
}
const screenshotBase64 = await getScreenshot(page);
await page.goto("https://www.bing.com", wait_until="networkidle")
async def get_screenshot(page) -> str:
try:
screenshot_bytes = await page.screenshot(timeout=10000)
return base64.b64encode(screenshot_bytes).decode("utf-8")
except Exception:
# Fallback: use CDP directly
cdp = await page.context.new_cdp_session(page)
result = await cdp.send("Page.captureScreenshot", {"format": "png"})
await cdp.detach()
return result["data"]
screenshot_base64 = await get_screenshot(page)
5. Send the initial request to the model
Define the task and send it along with the screenshot:
- TypeScript
- Python
const task = "Search for 'Browserless.io' and tell me what the company does";
let response = await client.responses.create({
model: "computer-use-preview",
tools: [
{
type: "computer_use_preview",
display_width: 1024,
display_height: 768,
environment: "browser",
},
],
input: [
{
role: "user",
content: [
{ type: "input_text", text: task },
{
type: "input_image",
image_url: `data:image/png;base64,${screenshotBase64}`,
},
],
},
],
truncation: "auto",
});
task = "Search for 'Browserless.io' and tell me what the company does"
response = client.responses.create(
model="computer-use-preview",
tools=[
{
"type": "computer_use_preview",
"display_width": 1024,
"display_height": 768,
"environment": "browser",
}
],
input=[
{
"role": "user",
"content": [
{"type": "input_text", "text": task},
{
"type": "input_image",
"image_url": f"data:image/png;base64,{screenshot_base64}",
},
],
}
],
truncation="auto",
)
6. Process actions and loop
The model returns a computer_call item with an action to execute. Run the action, capture a new screenshot, and send it back. Repeat until no more computer_call items appear (task complete).
The model may return key names like CTRL or CMD that Playwright doesn't recognize. The examples below map these to Playwright's expected format (e.g., Control, Meta).
- TypeScript
- Python
// Map model key names to Playwright key names
const keyMap: Record<string, string> = {
enter: "Enter", return: "Enter",
ctrl: "Control", cmd: "Meta",
esc: "Escape", backspace: "Backspace",
tab: "Tab", space: "Space",
up: "ArrowUp", down: "ArrowDown",
left: "ArrowLeft", right: "ArrowRight",
};
while (true) {
const computerCalls = response.output.filter(
(item: { type: string }) => item.type === "computer_call"
);
if (computerCalls.length === 0) {
// Task complete — print result
console.log(response.output_text);
break;
}
const computerCall = computerCalls[0];
const action = computerCall.action;
switch (action.type) {
case "click":
await page.mouse.click(action.x, action.y);
break;
case "double_click":
await page.mouse.dblclick(action.x, action.y);
break;
case "type":
await page.keyboard.type(action.text);
break;
case "keypress": {
const mappedKeys = action.keys.map(
(key: string) => keyMap[key.toLowerCase()] || key
);
await page.keyboard.press(mappedKeys.join("+"));
break;
}
case "scroll":
await page.mouse.move(action.x, action.y);
await page.evaluate(
`window.scrollBy(${action.scroll_x}, ${action.scroll_y})`
);
break;
case "screenshot":
// Model wants a fresh screenshot — just continue
break;
}
// Capture new screenshot and send back
const newScreenshot = await getScreenshot(page);
response = await client.responses.create({
model: "computer-use-preview",
previous_response_id: response.id,
tools: [
{
type: "computer_use_preview",
display_width: 1024,
display_height: 768,
environment: "browser",
},
],
input: [
{
type: "computer_call_output",
call_id: computerCall.call_id,
output: {
type: "input_image",
image_url: `data:image/png;base64,${newScreenshot}`,
},
},
],
truncation: "auto",
});
}
# Map model key names to Playwright key names
key_map = {
"enter": "Enter", "return": "Enter",
"ctrl": "Control", "cmd": "Meta",
"esc": "Escape", "backspace": "Backspace",
"tab": "Tab", "space": "Space",
"up": "ArrowUp", "down": "ArrowDown",
"left": "ArrowLeft", "right": "ArrowRight",
}
while True:
computer_calls = [
item for item in response.output
if item.type == "computer_call"
]
if not computer_calls:
# Task complete — print result
print(response.output_text)
break
computer_call = computer_calls[0]
action = computer_call.action
if action.type == "click":
await page.mouse.click(action.x, action.y)
elif action.type == "double_click":
await page.mouse.dblclick(action.x, action.y)
elif action.type == "type":
await page.keyboard.type(action.text)
elif action.type == "keypress":
mapped_keys = [key_map.get(key.lower(), key) for key in action.keys]
await page.keyboard.press("+".join(mapped_keys))
elif action.type == "scroll":
await page.mouse.move(action.x, action.y)
await page.evaluate(
f"window.scrollBy({action.scroll_x}, {action.scroll_y})"
)
elif action.type == "screenshot":
pass # Model wants a fresh screenshot — just continue
# Capture new screenshot and send back
screenshot_base64 = await get_screenshot(page)
response = client.responses.create(
model="computer-use-preview",
previous_response_id=response.id,
tools=[
{
"type": "computer_use_preview",
"display_width": 1024,
"display_height": 768,
"environment": "browser",
}
],
input=[
{
"type": "computer_call_output",
"call_id": computer_call.call_id,
"output": {
"type": "input_image",
"image_url": f"data:image/png;base64,{screenshot_base64}",
},
}
],
truncation="auto",
)
Supported actions
| Action | Properties | Description |
|---|---|---|
click | x, y, button | Click at coordinates |
double_click | x, y | Double-click at coordinates |
type | text | Type text |
keypress | keys[] | Press keyboard keys |
scroll | x, y, scroll_x, scroll_y | Scroll at position |
drag | start_x, start_y, end_x, end_y | Drag from start to end |
wait | ms | Wait for milliseconds |
screenshot | - | Request new screenshot |
Advanced configuration
Without stealth mode
If you don't need anti-detection and just want a managed cloud browser:
- TypeScript
- Python
const browser = await chromium.connectOverCDP(
`wss://production-sfo.browserless.io?token=${process.env.BROWSERLESS_API_KEY}`
);
browser = await p.chromium.connect_over_cdp(
f"wss://production-sfo.browserless.io?token={os.environ['BROWSERLESS_API_KEY']}"
)
Residential proxies
Route traffic through real residential IPs for additional anti-detection:
- TypeScript
- Python
const browser = await chromium.connectOverCDP(
`wss://production-sfo.browserless.io?token=${process.env.BROWSERLESS_API_KEY}&proxy=residential&proxyCountry=us`
);
browser = await p.chromium.connect_over_cdp(
f"wss://production-sfo.browserless.io?token={os.environ['BROWSERLESS_API_KEY']}&proxy=residential&proxyCountry=us"
)
Regional endpoints
Connect to the closest region for lower latency. See Connection URLs for all available endpoints.
Troubleshooting
Screenshot timeout
If the CDP fallback in Step 4 still times out, try increasing the timeout or check your network connection to Browserless. You can also increase the Playwright connection timeout:
- TypeScript
- Python
const browser = await chromium.connectOverCDP(url, { timeout: 120000 });
browser = await p.chromium.connect_over_cdp(url, timeout=120000)
Model returns unrecognized keys
The keyMap / key_map in Step 6 covers the most common mismatches. If you encounter new ones, add them to the map — the Playwright keyboard API docs list all valid key names.