Drive a browser with Claude Computer Use
Connect Claude to a Steel browser session for autonomous web interactions.
Scaffolds a starter project locally. Requires the Steel CLI.
Claude sees the screen as an image and returns concrete actions at pixel coordinates: left_click [640, 412], type "claude 4.7 opus", scroll down 3. Something has to execute those actions against a real browser and send the next screenshot back. That "something" is the agent loop in index.ts, and the browser is a Steel session.
The loop
The whole thing fits in one while block inside Agent.executeTask. Each iteration sends the growing message history plus the computer tool definition to Claude:
const response = await this.client.beta.messages.create({model: this.model,max_tokens: 4096,messages: this.messages,tools: this.tools,betas: ["computer-use-2025-11-24"],});
The tool definition declares computer_20251124 with the viewport's display_width_px and display_height_px. Keep it consistent with the Steel session's dimensions (1280x768 here) or clicks land in the wrong place.
executeComputerAction is the translation layer. Claude emits computer-use actions (left_click, type, key, scroll, screenshot, ...); Steel's Input API speaks a parallel vocabulary (click_mouse, type_text, press_key, scroll, take_screenshot):
case "left_click":case "right_click":case "middle_click":case "double_click":case "triple_click": {body = {action: "click_mouse",button: buttonMap[action],coordinates: coords,screenshot: true,};break;}
Every action sets screenshot: true, so Steel returns a fresh base64 PNG after each interaction. That PNG becomes the content of a tool_result block in the next user message.
A few translation details:
- Keys get normalized.
normalizeKeymaps synonyms (CTRLtoControl,CMDtoMeta,ENTERtoEnter) before sending to Steel. - Scroll is delta-based. Claude says
scroll_direction: "down", scroll_amount: 3; Steel expectsdelta_x/delta_yin pixels. The code multiplies by 100 per step. - Drags default from center.
left_click_dragonly gives an end coordinate, so the start is the viewport center.
Stop conditions
- No tool calls. Claude wrote only text. Task is complete.
- Repetition.
detectRepetitioncompares the last assistant message against the previous three by word overlap (>80%). - Iteration cap. 50 iterations by default.
The finally block in main always calls agent.cleanup(), which releases the Steel session.
Run it
cd examples/claude-computer-use-tscp .env.example .env # set STEEL_API_KEY and ANTHROPIC_API_KEYnpm installnpm start
Get keys from app.steel.dev and console.anthropic.com. Override the task inline:
TASK="Find the current weather in New York City" npm start
Your output varies. Structure looks like this:
Steel Session created successfully!View live session at: https://app.steel.dev/sessions/ab12cd34...Executing task: Go to Steel.dev and find the latest news============================================================I'll navigate to Steel.dev and look for the latest news.computer({"action":"screenshot"})computer({"action":"left_click","coordinate":[640,48]})computer({"action":"type","text":"https://steel.dev"})computer({"action":"key","text":"Return"})...Task complete - no further actions requestedTASK EXECUTION COMPLETEDDuration: 84.3 secondsResult: Steel's latest news includes ...
Expect ~60-120 seconds and 15-40 iterations for a simple browsing task.
Make it yours
- Change the viewport.
viewportWidthandviewportHeightin theAgentconstructor set both the Steel session dimensions and the tool definition'sdisplay_width_px/display_height_px. Keep them in sync. - Tune the system prompt.
BROWSER_SYSTEM_PROMPTis where the browsing conventions live: date injection, screenshot-after-submit rule, black-screen recovery. - Raise the ceiling. Long tasks bump against the 50-iteration default in
executeTask. - Hand off auth. Pair this recipe with Steel's credentials or auth contexts to start the session authenticated.
Related
Scaffolds a starter project locally. Requires the Steel CLI.
Computer use is Anthropic's primitive for giving Claude direct control of a screen. You declare a computer tool with a viewport size; Claude replies with actions like left_click at (x, y), type with text, scroll, key. You execute each one and hand back a screenshot.
Steel supplies the screen. A Steel session is a headful Chromium in a VM reachable over HTTPS, and the Input API (sessions.computer) executes mouse and keyboard actions and returns a PNG in the same call.
The loop
Everything in main.py hangs off a single loop in Agent.execute_task. Seed the conversation with a system prompt and the task, then on each turn:
response = self.client.beta.messages.create(model=self.model,max_tokens=4096,messages=self.messages,tools=self.tools,betas=["computer-use-2025-11-24"],)text, has_actions = self.process_response(response)if not has_actions:break
tools declares the computer tool Claude is allowed to call:
self.tools = [{"type": "computer_20251124","name": "computer","display_width_px": self.viewport_width,"display_height_px": self.viewport_height,"display_number": 1,}]
The viewport (1280x768) has to match what Steel renders or clicks land in the wrong place.
tool_use blocks go to execute_computer_action, which maps each Anthropic action name onto a Steel Input API call:
elif action in ("left_click", "right_click", "middle_click","double_click", "triple_click"):body = {"action": "click_mouse","button": button_map[action],"coordinates": [coords[0], coords[1]],"screenshot": True,}
screenshot: True tells Steel to attach a base64 PNG to the response, so a click and the screenshot that proves it landed are one round-trip. The PNG goes back into messages as a tool_result with the matching tool_use_id.
Two normalization details: key / hold_key run names like CTRL+A through normalize_key (CTRL to Control, ESC to Escape, UP to ArrowUp), and scroll_amount is multiplied by 100 pixels per step.
Two things end the loop: Claude responds with only text (task done), or the last two assistant messages overlap 80%+ on word content (detect_repetition). A hard cap of 50 iterations catches anything that slips past both.
Run it
cd examples/claude-computer-use-pycp .env.example .env # set STEEL_API_KEY and ANTHROPIC_API_KEYuv run main.py
Get keys from app.steel.dev and console.anthropic.com. Default task lives in .env as TASK; you can override per-run:
TASK="Find the current weather in New York City" python main.py
Your output varies. Structure looks like this:
Starting Steel session...Steel Session created successfully!View live session at: https://app.steel.dev/sessions/ab12cd34…Executing task: Go to Steel.dev and find the latest news============================================================I'll navigate to Steel.dev and look for the latest news.computer({"action": "key", "text": "ctrl+l"})computer({"action": "type", "text": "https://steel.dev"})computer({"action": "key", "text": "Return"})computer({"action": "screenshot"})…Task complete - no further actions requestedTASK EXECUTION COMPLETEDDuration: 74.3 secondsResult: Steel just shipped …Releasing Steel session...
A run typically takes 60-180 seconds and 10-30 loop iterations.
Make it yours
- Change the task. Edit
TASKin.envor pass it per-run. - Tune the viewport.
viewport_width/viewport_heightinAgent.__init__. - Rework the system prompt.
BROWSER_SYSTEM_PROMPTis where site-specific knowledge lives. - Persist a login. Pass
session_contexttosessions.createto resume with cookies and local storage. See credentials. - Raise the ceiling.
max_iterations=50inexecute_taskis the safety net.
Related
Related recipes
Drive a browser with Gemini Computer Use
Connect Google's Gemini Computer Use to a Steel browser session for autonomous web interactions.
Drive a mobile browser with Claude Computer Use
Claude Computer Use with Steel for autonomous task execution in mobile browser environments.
Drive a browser with OpenAI Computer Use
Connect OpenAI's Computer Use Assistant to a Steel browser session for autonomous web interactions.