Drive a browser with Claude Computer Use

Connect Claude to a Steel browser session for autonomous web interactions.

examples/claude-computer-use-ts
Contributors: Jun RyuUpdated

Computer use is Anthropic's primitive, not a wrapper. Claude sees the screen as an image and returns concrete actions at pixel coordinates: left_click [640, 412], type "claude 4.7 opus", scroll down 3. Something has to execute those actions against a real browser and send the next screenshot back. That "something" is the agent loop in index.ts, and the browser is a Steel session.

The loop

The whole thing fits in one while block inside Agent.executeTask. Each iteration sends the growing message history plus the computer tool definition to Claude:

const response = await this.client.beta.messages.create({
model: this.model,
max_tokens: 4096,
messages: this.messages,
tools: this.tools,
betas: ["computer-use-2025-11-24"],
});

The tool definition declares computer_20251124 with the viewport's display_width_px and display_height_px. This is the coordinate system Claude plans in. Keep it consistent with the Steel session's dimensions (1280x768 here) or clicks land in the wrong place.

Claude's response is a list of content blocks. processResponse walks them: text blocks get printed, tool_use blocks for the computer tool get dispatched to executeComputerAction. When a turn contains no tool_use blocks, the loop exits. That's how the agent signals it's done.

Actions in, screenshots out

executeComputerAction is the translation layer. Claude emits computer-use actions (left_click, type, key, scroll, screenshot, ...); Steel's Input API speaks a parallel vocabulary (click_mouse, type_text, press_key, scroll, take_screenshot). A switch statement maps one to the other and sends the result through steel.sessions.computer(sessionId, body).

case "left_click":
case "right_click":
case "middle_click":
case "double_click":
case "triple_click": {
// ... build a click_mouse body with button, coordinates, num_clicks
body = {
action: "click_mouse",
button: buttonMap[action],
coordinates: coords,
screenshot: true,
// ...
};
break;
}

Every action sets screenshot: true, so Steel returns a fresh base64 PNG after each interaction. That PNG becomes the content of a tool_result block in the next user message:

toolResults.push({
type: "tool_result",
tool_use_id: block.id,
content: [{
type: "image",
source: { type: "base64", media_type: "image/png", data: screenshotBase64 },
}],
});

Claude now has a new frame to reason over. The conversation grows turn by turn (assistant actions, user screenshots) until Claude stops calling tools.

A few of the translation details are worth knowing:

  • Keys get normalized. Claude might emit CTRL+a or cmd+enter; normalizeKey maps synonyms (CTRL to Control, CMD to Meta, ENTER to Enter) before sending to Steel.
  • Scroll is delta-based. Claude says scroll_direction: "down", scroll_amount: 3; Steel expects delta_x/delta_y in pixels. The code multiplies by 100 per step.
  • Drags default from center. left_click_drag only gives an end coordinate, so the start is the viewport center. A pragmatic convention, not an Anthropic spec.

Stop conditions

The loop has three exits:

  • No tool calls. Claude wrote only text. Task is complete.
  • Repetition. detectRepetition compares the last assistant message against the previous three by word overlap (>80%). Models occasionally loop on ambiguous states; this catches that cheaply.
  • Iteration cap. 50 iterations by default. Hitting it prints a warning and returns whatever text was last produced.

The finally block in main always calls agent.cleanup(), which releases the Steel session. Forgetting this keeps the browser billed until the 15-minute timeout set in initialize().

Run it

cd examples/claude-computer-use-ts
cp .env.example .env # set STEEL_API_KEY and ANTHROPIC_API_KEY
npm install
npm start

Get keys from app.steel.dev and console.anthropic.com. Override the task inline:

TASK="Find the current weather in New York City" npm start

A session viewer URL prints as the script starts. Open it in another tab to watch Claude pilot the browser. Your output varies, but the structure looks like this:

Steel Session created successfully!
View live session at: https://app.steel.dev/sessions/ab12cd34...
Executing task: Go to Steel.dev and find the latest news
============================================================
I'll navigate to Steel.dev and look for the latest news.
computer({"action":"screenshot"})
computer({"action":"left_click","coordinate":[640,48]})
computer({"action":"type","text":"https://steel.dev"})
computer({"action":"key","text":"Return"})
...
Task complete - no further actions requested
TASK EXECUTION COMPLETED
Duration: 84.3 seconds
Result: Steel's latest news includes ...

Expect ~60-120 seconds and on the order of 15-40 iterations for a simple browsing task. Cost is Steel session time plus Anthropic tokens. Screenshots are the expensive part of the token bill.

Make it yours

  • Change the viewport. viewportWidth and viewportHeight in the Agent constructor set both the Steel session dimensions and the tool definition's display_width_px/display_height_px. Keep them in sync.
  • Tune the system prompt. BROWSER_SYSTEM_PROMPT is where the browsing conventions live: date injection, screenshot-after-submit rule, black-screen recovery. Edit it to match your site or workflow.
  • Raise the ceiling. Long tasks bump against the 50-iteration default in executeTask. The third argument lets you lift it.
  • Hand off auth. Computer use can act on any page the browser is already logged into. Pair this recipe with Steel's credentials or auth contexts to start the session authenticated.

Computer use docs · Python version · Mobile variant

examples/claude-computer-use-py
Contributors: Hussien Hussien, Jun RyuUpdated

Computer use is Anthropic's primitive for giving Claude direct control of a screen. You declare a computer tool with a viewport size; Claude replies with actions like left_click at (x, y), type with text, scroll, key. You execute each one and hand back a screenshot. The model sees what it just did and decides what to do next. That's the whole interface. No DOM, no selectors, no scaffolding.

Steel supplies the screen. A Steel session is a headful Chromium in a VM reachable over HTTPS, and the Input API (sessions.computer) executes mouse and keyboard actions and returns a PNG in the same call. One round-trip per Claude action.

The loop

Everything in main.py hangs off a single loop in Agent.execute_task. Seed the conversation with a system prompt and the task, then on each turn:

response = self.client.beta.messages.create(
model=self.model,
max_tokens=4096,
messages=self.messages,
tools=self.tools,
betas=["computer-use-2025-11-24"],
)
text, has_actions = self.process_response(response)
if not has_actions:
break

tools declares the computer tool Claude is allowed to call:

self.tools = [
{
"type": "computer_20251124",
"name": "computer",
"display_width_px": self.viewport_width,
"display_height_px": self.viewport_height,
"display_number": 1,
}
]

The viewport (1280x768) has to match what Steel renders. Claude picks coordinates against this canvas, so if the numbers disagree the clicks land in the wrong place. The computer_20251124 type pairs with the computer-use-2025-11-24 beta; both move together when Anthropic ships a new version.

process_response walks the content blocks in Claude's reply. Text blocks get printed. tool_use blocks go to execute_computer_action, which maps each Anthropic action name onto a Steel Input API call:

elif action in ("left_click", "right_click", "middle_click",
"double_click", "triple_click"):
body = {
"action": "click_mouse",
"button": button_map[action],
"coordinates": [coords[0], coords[1]],
"screenshot": True,
}

screenshot: True tells Steel to attach a base64 PNG to the response, so a click and the screenshot that proves it landed are one round-trip, not two. The PNG goes back into messages as a tool_result with the matching tool_use_id, and the next iteration Claude sees exactly what changed.

Actions are thin translations with two bits of normalization worth knowing: key / hold_key run names like CTRL+A through normalize_key (Anthropic's vocabulary into Steel's: CTRL becomes Control, ESC becomes Escape, UP becomes ArrowUp), and scroll_amount is multiplied by 100 pixels per step because Claude counts in "tick" units.

Two things end the loop. has_actions == False means Claude responded with only text, meaning it thinks the task is done. Or the last two assistant messages overlap 80%+ on word content (detect_repetition), which usually means it's stuck retrying the same thing. A hard cap of 50 iterations catches anything that slips past both.

The system prompt matters

BROWSER_SYSTEM_PROMPT in main.py is tuned for this setup and worth reading before you change anything:

  • Never click the address bar. Claude reaches for it with the mouse by default and misses half the time. The prompt teaches it Ctrl+L, type URL, Enter.
  • Clear before typing. Ctrl+A then Delete, otherwise typed text appends to whatever was already there.
  • Black first screenshot? Click the center and try again. Focus sometimes lands off-window on a cold session.
  • Today's date is injected at startup so Claude doesn't browse as if it were two years ago.

These aren't decorative; they're the difference between "works most of the time" and "falls over on the first form."

Run it

cd examples/claude-computer-use-py
cp .env.example .env # set STEEL_API_KEY and ANTHROPIC_API_KEY
uv run main.py

Get keys from app.steel.dev and console.anthropic.com. The script prints a session viewer URL as soon as the Steel session is up. Open it in another tab to watch Claude drive the browser live. Each tool call also prints as computer({...}) so you can follow along in the terminal.

Default task lives in .env as TASK; you can override per-run:

TASK="Find the current weather in New York City" python main.py

Your output varies. Structure looks like this:

Starting Steel session...
Steel Session created successfully!
View live session at: https://app.steel.dev/sessions/ab12cd34…
Executing task: Go to Steel.dev and find the latest news
============================================================
I'll navigate to Steel.dev and look for the latest news.
computer({"action": "key", "text": "ctrl+l"})
computer({"action": "type", "text": "https://steel.dev"})
computer({"action": "key", "text": "Return"})
computer({"action": "screenshot"})
Task complete - no further actions requested
TASK EXECUTION COMPLETED
Duration: 74.3 seconds
Result: Steel just shipped …
Releasing Steel session...

A run typically takes 60-180 seconds and 10-30 loop iterations, depending on the task. You pay for Steel session-minutes and for Anthropic tokens; every screenshot goes into the message history, so longer tasks cost more on both sides. Release the session in finally (the template already does) so the browser doesn't idle until the 15-minute timeout.

Make it yours

  • Change the task. Edit TASK in .env or pass it per-run. The agent takes one natural-language instruction and runs to completion; no conversation, no follow-up questions.
  • Tune the viewport. viewport_width / viewport_height in Agent.__init__. Claude picks coordinates against whatever you set, so update both the tools declaration and the sessions.create call together; they already read from the same attribute.
  • Rework the system prompt. BROWSER_SYSTEM_PROMPT is where site-specific knowledge lives. Add rules for sites you care about ("on github.com, use the / shortcut to focus search"), constraints ("never submit forms on finance.example.com"), or persona.
  • Persist a login. Pass session_context to sessions.create to resume with cookies and local storage from a previous run. Claude skips the login flow entirely. See credentials for the pattern.
  • Raise the ceiling. max_iterations=50 in execute_task is the safety net. Long research tasks may want 100+; short lookups can drop to 20.

Anthropic computer use docs · TypeScript version