Drive a browser with OpenAI Computer Use
Connect OpenAI's Computer Use Assistant to a Steel browser session for autonomous web interactions.
Scaffolds a starter project locally. Requires the Steel CLI.
OpenAI's computer-use model ships as a single tool declaration: { type: "computer" }. You hand it to the Responses API, send a screenshot, and the model returns computer_call items with actions like click, type, keypress, scroll. Your job is to execute them against a real browser, capture the next screenshot, and feed it back.
The loop
The Responses API threads conversation state server-side, so each turn carries only the new tool outputs plus a previous_response_id:
const response = await createResponse({model: this.model,instructions: this.systemPrompt,input: nextInput,tools: this.tools,previous_response_id: previousResponseId,reasoning: { effort: "medium" },truncation: "auto",});
First turn: nextInput is [{ role: "user", content: task }]. Subsequent turns: nextInput is just the array of tool outputs from the previous iteration.
The response's output array mixes three item types:
message: a plain text reply. The final message becomes the return value.reasoning: the model's own summary; printed but not fed back.computer_call: one or more actions to execute. Each call has acall_idthat must be echoed back in the matchingcomputer_call_output.
Actions in, screenshots out
executeComputerAction is the translation layer:
case "click": {const coords = this.toCoords(actionArgs.x, actionArgs.y);const button = this.mapButton(actionArgs.button);const clicks = this.toNumber(actionArgs.num_clicks, 1);body = {action: "click_mouse",button,coordinates: coords,...(clicks > 1 ? { num_clicks: clicks } : {}),screenshot: true,};break;}
The screenshot goes back as a computer_call_output, matched by call_id:
toolOutputs.push({type: "computer_call_output",call_id: item.call_id,acknowledged_safety_checks: pendingChecks,output: {type: "computer_screenshot",image_url: `data:image/png;base64,${screenshotBase64}`,},});
A few translation details:
keypresstakes a list.normalizeKeyrewrites synonyms (CTRLtoControl,CMDtoMeta,ENTERtoEnter).scrollis delta-based. OpenAI sendsscroll_x/scroll_yin pixels. Steel'sscrolltakesdelta_x/delta_ydirectly.draggives a path. OpenAI provides the full point list inpath; Steel'sdrag_mousewants the same shape.- Unknown actions fall through to
take_screenshot.
Safety checks
A computer_call can attach pending_safety_checks when the planned action looks sensitive. The call won't take effect until you echo those check IDs back in acknowledged_safety_checks. The starter auto-acknowledges everything; for production, flip autoAcknowledgeSafety to false and gate each check on a human approval.
Run it
cd examples/openai-computer-use-tscp .env.example .env # set STEEL_API_KEY and OPENAI_API_KEYnpm installnpm start
Get keys from app.steel.dev and platform.openai.com. Override the task inline:
TASK="Find the current weather in New York City" npm start
Your output varies. Structure looks like this:
Steel Session created successfully!View live session at: https://app.steel.dev/sessions/ab12cd34...Executing task: Go to Steel.dev and find the latest news============================================================I'll navigate to steel.dev and scan the landing page.click({"x":720,"y":48})type({"text":"https://steel.dev"})keypress({"keys":["Enter"]})scroll({"x":720,"y":450,"scroll_y":600})Steel's latest release adds ...============================================================TASK EXECUTION COMPLETED============================================================Duration: 71.4 seconds
Expect roughly 60-120 seconds and 15-40 turns for a simple browsing task.
Make it yours
- Change the viewport.
viewportWidthandviewportHeightin theAgentconstructor set the Steel sessiondimensions. - Swap the model. The default is
gpt-5.5. Updatethis.modelin theAgentconstructor. - Tune reasoning effort.
reasoning: { effort: "medium" }trades latency for planning quality. - Rewrite the system prompt.
BROWSER_SYSTEM_PROMPTholds the browsing conventions. - Persist a login. Pass
sessionContexttosessions.create. See credentials and auth-context. - Turn off auto-ack. Flip
autoAcknowledgeSafetytofalseto make pending safety checks raise.
Related
Scaffolds a starter project locally. Requires the Steel CLI.
OpenAI's Computer Use models expose one tool ({"type": "computer"}) and emit computer_call items containing an action the model wants performed on a screen. You execute the action, return a screenshot as a computer_call_output, and the next turn the model sees the result. The action vocabulary (click, type, keypress, scroll, drag, wait, screenshot) is fixed by OpenAI.
This recipe uses OpenAI's Responses API, not Chat Completions. Responses keeps conversation state on OpenAI's side via previous_response_id, so each turn only sends the new tool outputs rather than the full screenshot history.
The loop
params = {"model": self.model,"instructions": self.system_prompt,"input": next_input,"tools": self.tools,"reasoning": {"effort": "medium"},"truncation": "auto",}if previous_response_id:params["previous_response_id"] = previous_response_idresponse = create_response(**params)previous_response_id = response.get("id")
Each response["output"] is a list of items with a type. The loop walks them:
reasoning: model's internal thinking, printed.message: terminal prose; the agent stores the last one as the final result.computer_call: one or more actions to execute.
execute_computer_action maps OpenAI's action vocabulary onto Steel's Input API. Each branch builds a Steel request body and sends it through self.steel.sessions.computer(...) with screenshot: True:
elif action_type in ("click",):coords = self.to_coords(action_args.get("x"), action_args.get("y"))button = self.map_button(action_args.get("button"))num_clicks = int(self.to_number(action_args.get("num_clicks"), 1))payload = {"action": "click_mouse","button": button,"coordinates": [coords[0], coords[1]],"screenshot": True,}if num_clicks > 1:payload["num_clicks"] = num_clicksbody = payload
keypress arrives with OpenAI names (CTRL, ENTER, ESC, UP); normalize_key rewrites them into the Steel / DOM vocabulary (Control, Enter, Escape, ArrowUp).
The screenshot goes back as a computer_call_output:
tool_outputs.append({"type": "computer_call_output","call_id": item["call_id"],"acknowledged_safety_checks": pending_checks,"output": {"type": "computer_screenshot","image_url": f"data:image/png;base64,{screenshot_base64}",},})
Safety checks
A computer_call can include pending_safety_checks. You must echo them back in acknowledged_safety_checks on the next turn, or the model stalls. The default here is auto_acknowledge_safety = True, which suits a starter but is not what you want in production. Flip it to False and surface the check to a human before proceeding.
Run it
cd examples/openai-computer-use-pycp .env.example .env # set STEEL_API_KEY and OPENAI_API_KEYuv run main.py
Get keys from app.steel.dev and platform.openai.com.
Override the task per run:
TASK="Find the current weather in New York City" python main.py
Your output varies. Structure looks like this:
Starting Steel session...Steel Session created successfully!View live session at: https://app.steel.dev/sessions/ab12cd34…Executing task: Go to Steel.dev and find the latest news============================================================I'll open steel.dev and check the blog.keypress({"keys": ["CTRL", "L"]})type({"text": "https://steel.dev"})keypress({"keys": ["ENTER"]})wait({"ms": 1500})…Steel's latest release notes mention …TASK EXECUTION COMPLETEDDuration: 62.8 seconds
A run typically takes 60-180 seconds and 10-30 iterations. Screenshots are cached between turns via previous_response_id, so per-turn input cost stays roughly flat even on long loops. The finally block in main() calls sessions.release().
Make it yours
- Change the task. Edit
TASKin.envor pass it inline. - Swap the model. The default is
gpt-5.5. Updateself.modelinAgent.__init__. - Tune the viewport.
viewport_width/viewport_heightinAgent.__init__flow intosessions.create(dimensions=...). - Turn off auto-ack. Flip
auto_acknowledge_safety = Falseto make pending safety checks raise. - Persist a login. Pass
session_contexttosessions.create. See credentials. - Adjust reasoning.
"effort": "medium"trades latency for deeper plans. Drop to"low"for fast lookups, raise to"high"for multi-step research.
Related
TypeScript version · Claude version · OpenAI Computer Use guide · Responses API reference
Related recipes
Drive a browser with Gemini Computer Use
Connect Google's Gemini Computer Use to a Steel browser session for autonomous web interactions.
Drive a mobile browser with Claude Computer Use
Claude Computer Use with Steel for autonomous task execution in mobile browser environments.
Drive a browser with Claude Computer Use
Connect Claude to a Steel browser session for autonomous web interactions.