Drive a browser with Gemini Computer Use
Connect Google's Gemini Computer Use to a Steel browser session for autonomous web interactions.
Gemini exposes computer use as a built-in tool type, not a hand-written schema. You set config.tools = [{ computerUse: { environment: Environment.ENVIRONMENT_BROWSER } }] on a generateContent call and the model plans against a fixed action vocabulary (click_at, type_text_at, navigate, scroll_document, search, drag_and_drop, key_combination, ...) with coordinates in a normalized 0-1000 grid. You never declare the viewport to Gemini. You do have to denormalize its coordinates to pixels, fan a few of its compound actions into several Steel calls, and return each screenshot as inlineData. That's the translator in index.ts, wrapped around a Steel session.
The model defaults to gemini-3-flash-preview. Conversation state lives entirely on your side, appended to this.contents turn by turn.
The translator
Agent.executeComputerAction is where Gemini's action names meet Steel's Input API. They overlap conceptually but rarely line up one-to-one, so the switch is where most of the design decisions sit.
Coordinates come first. Gemini plans in a 1000x1000 normalized grid regardless of the browser dimensions; denormalizeX and denormalizeY scale back to pixels off viewportWidth/viewportHeight (1440x900 by default).
private denormalizeX(x: number): number {return Math.round((x / MAX_COORDINATE) * this.viewportWidth);}
A click at (500, 500) from the model lands at (720, 450) on a 1440x900 session. These two fields also set the Steel dimensions, so they're the single source of truth for the viewport. Change them in one place.
Several of Gemini's actions are compound; the starter expands them. type_text_at takes x, y, text, plus optional press_enter and clear_before_typing flags and fans into click, Ctrl+A, Backspace, type_text, Enter, wait, screenshot. navigate and search skip the URL bar hunt by doing the Chrome Ctrl+L trick: focus the address bar, type, press Enter, wait. scroll_document splits up/down into PageUp/PageDown keystrokes and left/right into pixel-delta scroll calls. key_combination arrives as a +-joined string like "Control+Enter"; splitKeys and normalizeKey break it apart and rewrite synonyms (CTRL to Control, CMD to Meta, ARROWUP to ArrowUp, F1-F12 passthrough) before it reaches Steel.
open_web_browser is a no-op. Steel hands you a browser already open, so the mapping just takes a screenshot and returns.
Every other mapped action sets screenshot: true on the Steel call. The PNG comes back in the same response, avoiding a follow-up take_screenshot.
Sending frames back
This is where Gemini diverges hardest from Anthropic's tool_result and OpenAI's computer_call_output. Each completed call produces two parts in a single user-role turn: a functionResponse that names the call and echoes the current URL, then an inlineData part carrying the screenshot as raw base64.
const functionResponse: FunctionResponse = {name: fc.name ?? "",response: { url: result.url ?? this.currentUrl },};parts.push({ functionResponse });parts.push({inlineData: {mimeType: "image/png",data: result.screenshotBase64,},});
No call_id to match, no data:image/png;base64, URL prefix, no separate source wrapper. The whole batch is built in buildFunctionResponseParts and pushed onto this.contents as one role: "user" entry. Get the shape wrong and the next turn either hallucinates or stalls; this is the main thing worth inspecting if the model loses the plot.
The loop itself
Agent.executeTask alternates between calling generateContent and replaying the result. Each response's first candidate is split two ways:
extractFunctionCallscollectspart.functionCallentries.extractTextcollectspart.text, with a regex (/^[\s\d]*$/) that drops stray digit-only parts Gemini 3 Flash occasionally emits alongside real reasoning.
The raw candidate.content is appended to this.contents as-is, so the model sees its own plan on the next call.
Four exits, in rough order of frequency:
- Text, no function calls. The model wrote a final message. Loop breaks and that text becomes the return value.
- Empty turn. No calls, no text.
consecutiveNoActionsincrements. Three in a row stops the loop. MALFORMED_FUNCTION_CALLwith nothing else. A known quirk of the preview model. The loop simply continues to the next iteration.- Iteration cap. 50 turns by default (third argument to
executeTask).
After the loop, a reverse walk over this.contents pulls the last role: "model" text, again filtering stray-digit parts, as the returned summary.
The finally in main calls agent.cleanup(), which releases the Steel session. Skipping it keeps the browser billed until the 15-minute timeout set in initialize().
Safety decisions
Gemini doesn't return a separate safety-checks array. When an action looks sensitive, the model attaches a safety_decision object inside the function call's own args. The starter inspects it during dispatch:
const safetyDecision = actionArgs.safety_decision as| Record<string, unknown>| undefined;if (safetyDecision?.decision === "require_confirmation") {console.log(`Safety confirmation required: ${safetyDecision.explanation}`);console.log("Auto-acknowledging safety check");}
There's no check ID to echo back. Logging the explanation and proceeding is the protocol. For production, gate the executeComputerAction call behind a human when decision === "require_confirmation"; the safety block runs right before dispatch.
Run it
cd examples/gemini-computer-use-tscp .env.example .env # set STEEL_API_KEY and GEMINI_API_KEYnpm installnpm start
Get keys from app.steel.dev and aistudio.google.com. Override the task inline:
TASK="Find the current weather in New York City" npm start
A session viewer URL prints as the script starts. Open it in another tab to watch Gemini pilot the browser. Your output varies. Structure looks like this:
Steel Session created successfully!View live session at: https://app.steel.dev/sessions/ab12cd34...Executing task: Go to Steel.dev and find the latest news============================================================I'll navigate to steel.dev and scan the landing page for news.navigate({"url":"https://steel.dev"})scroll_document({"direction":"down"})click_at({"x":520,"y":410})Steel's latest release adds ...============================================================TASK EXECUTION COMPLETED============================================================Duration: 78.2 seconds
Expect roughly 60-120 seconds and 15-40 turns for a simple browsing task. Cost is Steel session time plus Gemini tokens, dominated by the per-turn inlineData screenshots.
Make it yours
- Resize the viewport.
viewportWidth/viewportHeightin theAgentconstructor feed both the Steel sessiondimensionsand thedenormalizeX/denormalizeYmath. Nothing in the tool declaration needs to change. - Swap the model.
this.model = "gemini-3-flash-preview"is the only version string. Point it at another checkpoint that supports computer use without touching the loop. - Tune the system prompt.
BROWSER_SYSTEM_PROMPTcarries the browsing conventions: today's date viaformatToday(), clear-before-typing, batch-actions-when-possible, black-screen recovery. Edit it for your site or workflow. - Gate safety decisions. Replace the auto-acknowledgement branch with a human approval before the next
executeComputerActionfires. - Hand off auth. Pair this recipe with Steel's credentials or auth contexts to start the session already logged in; computer use then just uses what the browser can already see.
Related
Computer use docs · Python version · Anthropic equivalent · OpenAI equivalent
Gemini's computer use ships through google.genai as a single built-in tool: Tool(computer_use=ComputerUse(environment=ENVIRONMENT_BROWSER)). You never write the schema. Setting ENVIRONMENT_BROWSER unlocks a fixed vocabulary of browser function calls (click_at, type_text_at, scroll_document, scroll_at, navigate, search, key_combination, drag_and_drop, hover_at, go_back, go_forward, open_web_browser, wait_5_seconds). Each one arrives as a FunctionCall with named arguments. You run it, send back a FunctionResponse plus a screenshot Blob, and the next turn sees the new frame.
Steel supplies the screen. A Steel session is a headful Chromium in a VM, and sessions.computer(session_id, action=...) runs mouse and keyboard actions with a base64 PNG attached to the response. One round-trip per Gemini action.
The loop
Agent.execute_task seeds two user-role Parts (BROWSER_SYSTEM_PROMPT and the task) into self.contents, then calls generate_content in a loop:
response = self.client.models.generate_content(model=self.model,contents=self.contents,config=self.config,)candidate = response.candidates[0]if candidate.content:self.contents.append(candidate.content)reasoning = self.extract_text(candidate)function_calls = self.extract_function_calls(candidate)
self.config holds the single computer-use tool. The model is gemini-3-flash-preview. Gemini doesn't keep server-side conversation state like OpenAI's Responses API, so every turn resends the full contents list: task, prior candidates, every screenshot. That is the dominant cost on long runs (see "Cost shape" below).
extract_function_calls walks candidate.content.parts pulling out each part.function_call. extract_text does the same for part.text with one filter: re.fullmatch(r"[\s\d]*", part.text) drops stray digit-only and whitespace-only parts ("0", "00") that gemini-3-flash-preview occasionally emits alongside real reasoning. Swap models and you can probably drop this filter.
Named function calls, not one computer tool
Claude emits actions inside one computer tool. OpenAI emits computer_call items with a fixed action field. Gemini emits distinct FunctionCalls by name. execute_computer_action dispatches on function_call.name:
elif name == "click_at":x = self.denormalize_x(args.get("x", 0))y = self.denormalize_y(args.get("y", 0))resp = self.steel.sessions.computer(self.session.id,action="click_mouse",button="left",coordinates=[x, y],screenshot=True,)
Most branches are thin wrappers around one sessions.computer call with screenshot=True, so the action and its resulting frame come back in one round-trip. A few compose multiple Steel calls. type_text_at clicks the target, optionally runs Ctrl+A then Backspace to clear, types, optionally presses Enter, waits 1 second, then screenshots. navigate and search press Ctrl+L to focus the address bar, type the URL, and press Enter (far more reliable than asking the model to click the address bar itself). scroll_document translates directional scrolls into PageUp / PageDown for vertical and a wheel scroll for horizontal.
Coordinates live in a 0-1000 canvas
Gemini never emits pixel coordinates. Every spatial argument (x, y, destination_x, destination_y, magnitude) is scaled against MAX_COORDINATE = 1000 regardless of viewport. denormalize_x and denormalize_y rescale onto Steel's viewport before each action:
def denormalize_x(self, x: int) -> int:return int(x / MAX_COORDINATE * self.viewport_width)
Changing viewport_width / viewport_height (default 1440x900) requires no change to the tool declaration. The model keeps pointing in 0-1000 space and the denormalizer stretches onto whatever Steel renders. Denormalize in every branch that takes spatial arguments, including scroll_at's magnitude which Gemini specifies in the same 0-1000 range as positions.
Key names get a separate pass. key_combination arrives as a keys string like "Ctrl+A". split_keys breaks on +, then normalize_key maps Gemini's vocabulary onto Steel / DOM names (CTRL to Control, ESC to Escape, UP to ArrowUp, function keys pass through as F1 through F12).
Sending screenshots back
Gemini expects each function response as two Parts in a user-role Content: a FunctionResponse with metadata, then an inline_data Blob carrying the PNG. build_function_response_parts builds the pair:
function_response = FunctionResponse(name=fc.name or "",response={"url": url or self.current_url},)parts.append(Part(function_response=function_response))parts.append(Part(inline_data=types.Blob(mime_type="image/png",data=screenshot_base64,)))
Two details. The screenshot is raw base64, not a data: URL. The SDK wraps it in a Blob with an explicit mime_type. And the response field carries only the current URL. The visual channel (the Blob) is what Gemini actually reads; the URL helps it ground "did my navigate land where I expected" without parsing the screenshot. When multiple function calls come back in one turn, every one gets its own (FunctionResponse, Blob) pair and all of them go into one user Content before the next generate_content.
Safety decisions and malformed calls
Gemini surfaces safety concerns differently from OpenAI's pending_safety_checks. The review rides inside the function-call arguments as a safety_decision dict:
safety_decision = action_args.get("safety_decision")if (isinstance(safety_decision, dict)and safety_decision.get("decision") == "require_confirmation"):print(f"Safety confirmation required: {safety_decision.get('explanation')}")print("Auto-acknowledging safety check")
The starter prints and proceeds. For production, gate on a human response: if decision == "require_confirmation", skip execute_computer_action until someone approves, and send back a refusal FunctionResponse otherwise.
Separately, Gemini sometimes returns candidate.finish_reason == FinishReason.MALFORMED_FUNCTION_CALL with no callable function calls and no reasoning. The loop catches this and re-runs generate_content, which usually recovers on the next iteration.
Stopping conditions
execute_task ends one of three ways:
- 1Gemini emits only text, no function calls. That's the "I'm done" signal. The latest text becomes the returned result.
- 2Three consecutive iterations produce neither text nor function calls (
consecutive_no_actions >= 3). Safety net for stalls. - 3
max_iterations=50caps total turns.
On exit the loop walks self.contents in reverse looking for the last role == "model" content with filtered text, and returns that as the final answer.
The system prompt carries browser habits
BROWSER_SYSTEM_PROMPT encodes things Gemini won't discover on its own:
- Clear before typing:
Ctrl+A, thenDelete. Otherwise typed text concatenates with whatever is already in the field.type_text_atalready does this whenclear_before_typing=True(the default), but the prompt reinforces it for custom flows. - Batch related actions. Gemini can return multiple
FunctionCalls in one turn; reminding it reduces round-trips. - Black first screenshot? Click the center and retry. Cold sessions sometimes land focus off-window.
- Today's date is injected via
format_today()so Gemini doesn't browse as if it were a year ago.
Edit when you add site-specific knowledge. Don't strip the typing and black-screenshot rules.
Run it
cd examples/gemini-computer-use-pycp .env.example .env # set STEEL_API_KEY and GEMINI_API_KEYuv run main.py
Steel keys live at app.steel.dev/settings/api-keys. Gemini keys come from aistudio.google.com/apikey. The script prints a session viewer URL as soon as the Steel session is up. Open it in another tab to watch Gemini drive the browser. Every function call prints as click_at({...}) or type_text_at({...}) so you can follow along.
Override the task per run:
TASK="Find the current weather in New York City" python main.py
Your output varies. Structure looks like this:
Steel Session created successfully!View live session at: https://app.steel.dev/sessions/ab12cd34...Executing task: Go to Steel.dev and find the latest news============================================================I'll open steel.dev and look for the latest news.navigate({"url": "https://steel.dev"})scroll_document({"direction": "down"})click_at({"x": 512, "y": 340})...Task complete - model provided final responseTASK EXECUTION COMPLETEDDuration: 78.4 secondsResult: Steel's latest release notes mention ...
Cost shape
A run typically takes 60-180 seconds and 10-30 iterations. You pay for Steel session-minutes and Gemini tokens. Because generate_content has no server-side state, every new turn resends the full self.contents list including every prior Blob. Image tokens dominate on long tasks; halving iterations roughly halves the tail cost. The finally block in main() calls sessions.release(). Without it the session idles until Steel's default timeout.
Make it yours
- Change the task. Edit
TASKin.envor pass it inline. The agent runs one instruction to completion; no follow-up turns. - Swap the model.
self.model = "gemini-3-flash-preview"inAgent.__init__. Any Gemini computer-use-capable model works. If you move off 3 Flash, consider dropping there.fullmatch(r"[\s\d]*", ...)filter inextract_textsince the stray-digit quirk is model-specific. - Tune the viewport.
viewport_widthandviewport_heightinAgent.__init__flow intosessions.create(dimensions=...). The 0-1000 coordinate space means the tool declaration never changes;denormalize_x/denormalize_ypick up the new size. - Gate safety confirmations. Replace the auto-acknowledge branch in
execute_taskwith a human prompt (or a refusalFunctionResponse). Gemini stops short of the action until you return a response. - Persist a login. Pass
session_contexttosessions.createto resume with cookies and local storage so Gemini skips the login flow. See credentials for the pattern. - Raise the ceiling.
max_iterations=50inexecute_taskbounds a single task. Long research benefits from 100+; short lookups can drop to 20.
Related
TypeScript version · Gemini Computer Use guide · google-genai SDK
Related recipes
Drive a mobile browser with Claude Computer Use
Claude Computer Use with Steel for autonomous task execution in mobile browser environments.
Drive a browser with Claude Computer Use
Connect Claude to a Steel browser session for autonomous web interactions.
Drive a browser with OpenAI Computer Use
Connect OpenAI's Computer Use Assistant to a Steel browser session for autonomous web interactions.