Drive a browser with OpenAI Computer Use

Connect OpenAI's Computer Use Assistant to a Steel browser session for autonomous web interactions.

examples/openai-computer-use-ts
Contributors: , Updated
Terminal

Scaffolds a starter project locally. Requires the Steel CLI.

OpenAI's computer-use model ships as a single tool declaration: { type: "computer" }. You hand it to the Responses API, send a screenshot, and the model returns computer_call items with actions like click, type, keypress, scroll. Your job is to execute them against a real browser, capture the next screenshot, and feed it back.

The loop

The Responses API threads conversation state server-side, so each turn carries only the new tool outputs plus a previous_response_id:

const response = await createResponse({
model: this.model,
instructions: this.systemPrompt,
input: nextInput,
tools: this.tools,
previous_response_id: previousResponseId,
reasoning: { effort: "medium" },
truncation: "auto",
});

First turn: nextInput is [{ role: "user", content: task }]. Subsequent turns: nextInput is just the array of tool outputs from the previous iteration.

The response's output array mixes three item types:

  • message: a plain text reply. The final message becomes the return value.
  • reasoning: the model's own summary; printed but not fed back.
  • computer_call: one or more actions to execute. Each call has a call_id that must be echoed back in the matching computer_call_output.

Actions in, screenshots out

executeComputerAction is the translation layer:

case "click": {
const coords = this.toCoords(actionArgs.x, actionArgs.y);
const button = this.mapButton(actionArgs.button);
const clicks = this.toNumber(actionArgs.num_clicks, 1);
body = {
action: "click_mouse",
button,
coordinates: coords,
...(clicks > 1 ? { num_clicks: clicks } : {}),
screenshot: true,
};
break;
}

The screenshot goes back as a computer_call_output, matched by call_id:

toolOutputs.push({
type: "computer_call_output",
call_id: item.call_id,
acknowledged_safety_checks: pendingChecks,
output: {
type: "computer_screenshot",
image_url: `data:image/png;base64,${screenshotBase64}`,
},
});

A few translation details:

  • keypress takes a list. normalizeKey rewrites synonyms (CTRL to Control, CMD to Meta, ENTER to Enter).
  • scroll is delta-based. OpenAI sends scroll_x/scroll_y in pixels. Steel's scroll takes delta_x/delta_y directly.
  • drag gives a path. OpenAI provides the full point list in path; Steel's drag_mouse wants the same shape.
  • Unknown actions fall through to take_screenshot.

Safety checks

A computer_call can attach pending_safety_checks when the planned action looks sensitive. The call won't take effect until you echo those check IDs back in acknowledged_safety_checks. The starter auto-acknowledges everything; for production, flip autoAcknowledgeSafety to false and gate each check on a human approval.

Run it

cd examples/openai-computer-use-ts
cp .env.example .env # set STEEL_API_KEY and OPENAI_API_KEY
npm install
npm start

Get keys from app.steel.dev and platform.openai.com. Override the task inline:

TASK="Find the current weather in New York City" npm start

Your output varies. Structure looks like this:

Steel Session created successfully!
View live session at: https://app.steel.dev/sessions/ab12cd34...
Executing task: Go to Steel.dev and find the latest news
============================================================
I'll navigate to steel.dev and scan the landing page.
click({"x":720,"y":48})
type({"text":"https://steel.dev"})
keypress({"keys":["Enter"]})
scroll({"x":720,"y":450,"scroll_y":600})
Steel's latest release adds ...
============================================================
TASK EXECUTION COMPLETED
============================================================
Duration: 71.4 seconds

Expect roughly 60-120 seconds and 15-40 turns for a simple browsing task.

Make it yours

  • Change the viewport. viewportWidth and viewportHeight in the Agent constructor set the Steel session dimensions.
  • Swap the model. The default is gpt-5.5. Update this.model in the Agent constructor.
  • Tune reasoning effort. reasoning: { effort: "medium" } trades latency for planning quality.
  • Rewrite the system prompt. BROWSER_SYSTEM_PROMPT holds the browsing conventions.
  • Persist a login. Pass sessionContext to sessions.create. See credentials and auth-context.
  • Turn off auto-ack. Flip autoAcknowledgeSafety to false to make pending safety checks raise.

Computer use guide · Python version · Anthropic equivalent

examples/openai-computer-use-py
Contributors: , Updated
Terminal

Scaffolds a starter project locally. Requires the Steel CLI.

OpenAI's Computer Use models expose one tool ({"type": "computer"}) and emit computer_call items containing an action the model wants performed on a screen. You execute the action, return a screenshot as a computer_call_output, and the next turn the model sees the result. The action vocabulary (click, type, keypress, scroll, drag, wait, screenshot) is fixed by OpenAI.

This recipe uses OpenAI's Responses API, not Chat Completions. Responses keeps conversation state on OpenAI's side via previous_response_id, so each turn only sends the new tool outputs rather than the full screenshot history.

The loop

params = {
"model": self.model,
"instructions": self.system_prompt,
"input": next_input,
"tools": self.tools,
"reasoning": {"effort": "medium"},
"truncation": "auto",
}
if previous_response_id:
params["previous_response_id"] = previous_response_id
response = create_response(**params)
previous_response_id = response.get("id")

Each response["output"] is a list of items with a type. The loop walks them:

  • reasoning: model's internal thinking, printed.
  • message: terminal prose; the agent stores the last one as the final result.
  • computer_call: one or more actions to execute.

execute_computer_action maps OpenAI's action vocabulary onto Steel's Input API. Each branch builds a Steel request body and sends it through self.steel.sessions.computer(...) with screenshot: True:

elif action_type in ("click",):
coords = self.to_coords(action_args.get("x"), action_args.get("y"))
button = self.map_button(action_args.get("button"))
num_clicks = int(self.to_number(action_args.get("num_clicks"), 1))
payload = {
"action": "click_mouse",
"button": button,
"coordinates": [coords[0], coords[1]],
"screenshot": True,
}
if num_clicks > 1:
payload["num_clicks"] = num_clicks
body = payload

keypress arrives with OpenAI names (CTRL, ENTER, ESC, UP); normalize_key rewrites them into the Steel / DOM vocabulary (Control, Enter, Escape, ArrowUp).

The screenshot goes back as a computer_call_output:

tool_outputs.append({
"type": "computer_call_output",
"call_id": item["call_id"],
"acknowledged_safety_checks": pending_checks,
"output": {
"type": "computer_screenshot",
"image_url": f"data:image/png;base64,{screenshot_base64}",
},
})

Safety checks

A computer_call can include pending_safety_checks. You must echo them back in acknowledged_safety_checks on the next turn, or the model stalls. The default here is auto_acknowledge_safety = True, which suits a starter but is not what you want in production. Flip it to False and surface the check to a human before proceeding.

Run it

cd examples/openai-computer-use-py
cp .env.example .env # set STEEL_API_KEY and OPENAI_API_KEY
uv run main.py

Get keys from app.steel.dev and platform.openai.com.

Override the task per run:

TASK="Find the current weather in New York City" python main.py

Your output varies. Structure looks like this:

Starting Steel session...
Steel Session created successfully!
View live session at: https://app.steel.dev/sessions/ab12cd34…
Executing task: Go to Steel.dev and find the latest news
============================================================
I'll open steel.dev and check the blog.
keypress({"keys": ["CTRL", "L"]})
type({"text": "https://steel.dev"})
keypress({"keys": ["ENTER"]})
wait({"ms": 1500})
Steel's latest release notes mention …
TASK EXECUTION COMPLETED
Duration: 62.8 seconds

A run typically takes 60-180 seconds and 10-30 iterations. Screenshots are cached between turns via previous_response_id, so per-turn input cost stays roughly flat even on long loops. The finally block in main() calls sessions.release().

Make it yours

  • Change the task. Edit TASK in .env or pass it inline.
  • Swap the model. The default is gpt-5.5. Update self.model in Agent.__init__.
  • Tune the viewport. viewport_width / viewport_height in Agent.__init__ flow into sessions.create(dimensions=...).
  • Turn off auto-ack. Flip auto_acknowledge_safety = False to make pending safety checks raise.
  • Persist a login. Pass session_context to sessions.create. See credentials.
  • Adjust reasoning. "effort": "medium" trades latency for deeper plans. Drop to "low" for fast lookups, raise to "high" for multi-step research.

TypeScript version · Claude version · OpenAI Computer Use guide · Responses API reference