Drive a browser with Claude Computer Use

Connect Claude to a Steel browser session for autonomous web interactions.

examples/claude-computer-use-ts
Contributors: Updated
Terminal

Scaffolds a starter project locally. Requires the Steel CLI.

Claude sees the screen as an image and returns concrete actions at pixel coordinates: left_click [640, 412], type "claude 4.7 opus", scroll down 3. Something has to execute those actions against a real browser and send the next screenshot back. That "something" is the agent loop in index.ts, and the browser is a Steel session.

The loop

The whole thing fits in one while block inside Agent.executeTask. Each iteration sends the growing message history plus the computer tool definition to Claude:

const response = await this.client.beta.messages.create({
model: this.model,
max_tokens: 4096,
messages: this.messages,
tools: this.tools,
betas: ["computer-use-2025-11-24"],
});

The tool definition declares computer_20251124 with the viewport's display_width_px and display_height_px. Keep it consistent with the Steel session's dimensions (1280x768 here) or clicks land in the wrong place.

executeComputerAction is the translation layer. Claude emits computer-use actions (left_click, type, key, scroll, screenshot, ...); Steel's Input API speaks a parallel vocabulary (click_mouse, type_text, press_key, scroll, take_screenshot):

case "left_click":
case "right_click":
case "middle_click":
case "double_click":
case "triple_click": {
body = {
action: "click_mouse",
button: buttonMap[action],
coordinates: coords,
screenshot: true,
};
break;
}

Every action sets screenshot: true, so Steel returns a fresh base64 PNG after each interaction. That PNG becomes the content of a tool_result block in the next user message.

A few translation details:

  • Keys get normalized. normalizeKey maps synonyms (CTRL to Control, CMD to Meta, ENTER to Enter) before sending to Steel.
  • Scroll is delta-based. Claude says scroll_direction: "down", scroll_amount: 3; Steel expects delta_x/delta_y in pixels. The code multiplies by 100 per step.
  • Drags default from center. left_click_drag only gives an end coordinate, so the start is the viewport center.

Stop conditions

  • No tool calls. Claude wrote only text. Task is complete.
  • Repetition. detectRepetition compares the last assistant message against the previous three by word overlap (>80%).
  • Iteration cap. 50 iterations by default.

The finally block in main always calls agent.cleanup(), which releases the Steel session.

Run it

cd examples/claude-computer-use-ts
cp .env.example .env # set STEEL_API_KEY and ANTHROPIC_API_KEY
npm install
npm start

Get keys from app.steel.dev and console.anthropic.com. Override the task inline:

TASK="Find the current weather in New York City" npm start

Your output varies. Structure looks like this:

Steel Session created successfully!
View live session at: https://app.steel.dev/sessions/ab12cd34...
Executing task: Go to Steel.dev and find the latest news
============================================================
I'll navigate to Steel.dev and look for the latest news.
computer({"action":"screenshot"})
computer({"action":"left_click","coordinate":[640,48]})
computer({"action":"type","text":"https://steel.dev"})
computer({"action":"key","text":"Return"})
...
Task complete - no further actions requested
TASK EXECUTION COMPLETED
Duration: 84.3 seconds
Result: Steel's latest news includes ...

Expect ~60-120 seconds and 15-40 iterations for a simple browsing task.

Make it yours

  • Change the viewport. viewportWidth and viewportHeight in the Agent constructor set both the Steel session dimensions and the tool definition's display_width_px/display_height_px. Keep them in sync.
  • Tune the system prompt. BROWSER_SYSTEM_PROMPT is where the browsing conventions live: date injection, screenshot-after-submit rule, black-screen recovery.
  • Raise the ceiling. Long tasks bump against the 50-iteration default in executeTask.
  • Hand off auth. Pair this recipe with Steel's credentials or auth contexts to start the session authenticated.

Computer use docs · Python version · Mobile variant

examples/claude-computer-use-py
Contributors: , Updated
Terminal

Scaffolds a starter project locally. Requires the Steel CLI.

Computer use is Anthropic's primitive for giving Claude direct control of a screen. You declare a computer tool with a viewport size; Claude replies with actions like left_click at (x, y), type with text, scroll, key. You execute each one and hand back a screenshot.

Steel supplies the screen. A Steel session is a headful Chromium in a VM reachable over HTTPS, and the Input API (sessions.computer) executes mouse and keyboard actions and returns a PNG in the same call.

The loop

Everything in main.py hangs off a single loop in Agent.execute_task. Seed the conversation with a system prompt and the task, then on each turn:

response = self.client.beta.messages.create(
model=self.model,
max_tokens=4096,
messages=self.messages,
tools=self.tools,
betas=["computer-use-2025-11-24"],
)
text, has_actions = self.process_response(response)
if not has_actions:
break

tools declares the computer tool Claude is allowed to call:

self.tools = [
{
"type": "computer_20251124",
"name": "computer",
"display_width_px": self.viewport_width,
"display_height_px": self.viewport_height,
"display_number": 1,
}
]

The viewport (1280x768) has to match what Steel renders or clicks land in the wrong place.

tool_use blocks go to execute_computer_action, which maps each Anthropic action name onto a Steel Input API call:

elif action in ("left_click", "right_click", "middle_click",
"double_click", "triple_click"):
body = {
"action": "click_mouse",
"button": button_map[action],
"coordinates": [coords[0], coords[1]],
"screenshot": True,
}

screenshot: True tells Steel to attach a base64 PNG to the response, so a click and the screenshot that proves it landed are one round-trip. The PNG goes back into messages as a tool_result with the matching tool_use_id.

Two normalization details: key / hold_key run names like CTRL+A through normalize_key (CTRL to Control, ESC to Escape, UP to ArrowUp), and scroll_amount is multiplied by 100 pixels per step.

Two things end the loop: Claude responds with only text (task done), or the last two assistant messages overlap 80%+ on word content (detect_repetition). A hard cap of 50 iterations catches anything that slips past both.

Run it

cd examples/claude-computer-use-py
cp .env.example .env # set STEEL_API_KEY and ANTHROPIC_API_KEY
uv run main.py

Get keys from app.steel.dev and console.anthropic.com. Default task lives in .env as TASK; you can override per-run:

TASK="Find the current weather in New York City" python main.py

Your output varies. Structure looks like this:

Starting Steel session...
Steel Session created successfully!
View live session at: https://app.steel.dev/sessions/ab12cd34…
Executing task: Go to Steel.dev and find the latest news
============================================================
I'll navigate to Steel.dev and look for the latest news.
computer({"action": "key", "text": "ctrl+l"})
computer({"action": "type", "text": "https://steel.dev"})
computer({"action": "key", "text": "Return"})
computer({"action": "screenshot"})
Task complete - no further actions requested
TASK EXECUTION COMPLETED
Duration: 74.3 seconds
Result: Steel just shipped …
Releasing Steel session...

A run typically takes 60-180 seconds and 10-30 loop iterations.

Make it yours

  • Change the task. Edit TASK in .env or pass it per-run.
  • Tune the viewport. viewport_width / viewport_height in Agent.__init__.
  • Rework the system prompt. BROWSER_SYSTEM_PROMPT is where site-specific knowledge lives.
  • Persist a login. Pass session_context to sessions.create to resume with cookies and local storage. See credentials.
  • Raise the ceiling. max_iterations=50 in execute_task is the safety net.

Anthropic computer use docs · TypeScript version