Drive a browser with Gemini Computer Use
Connect Google's Gemini Computer Use to a Steel browser session for autonomous web interactions.
Scaffolds a starter project locally. Requires the Steel CLI.
Gemini exposes computer use as a built-in tool type, not a hand-written schema. You set config.tools = [{ computerUse: { environment: Environment.ENVIRONMENT_BROWSER } }] on a generateContent call and the model plans against a fixed action vocabulary (click_at, type_text_at, navigate, scroll_document, search, drag_and_drop, key_combination, ...) with coordinates in a normalized 0-1000 grid.
The model defaults to gemini-3-flash-preview. Conversation state lives entirely on your side, appended to this.contents turn by turn.
Coordinates and action mapping
Gemini plans in a 1000x1000 normalized grid regardless of the browser dimensions; denormalizeX and denormalizeY scale back to pixels off viewportWidth/viewportHeight (1440x900 by default).
private denormalizeX(x: number): number {return Math.round((x / MAX_COORDINATE) * this.viewportWidth);}
Several of Gemini's actions are compound; the starter expands them. type_text_at fans into click, Ctrl+A, Backspace, type_text, Enter, wait, screenshot. navigate and search skip the URL bar hunt by doing the Chrome Ctrl+L trick: focus the address bar, type, press Enter, wait. key_combination arrives as a +-joined string like "Control+Enter"; splitKeys and normalizeKey break it apart and rewrite synonyms (CTRL to Control, CMD to Meta, ARROWUP to ArrowUp).
Every mapped action sets screenshot: true on the Steel call. The PNG comes back in the same response.
Sending frames back
Each completed call produces two parts in a single user-role turn: a functionResponse that names the call and echoes the current URL, then an inlineData part carrying the screenshot as raw base64.
const functionResponse: FunctionResponse = {name: fc.name ?? "",response: { url: result.url ?? this.currentUrl },};parts.push({ functionResponse });parts.push({inlineData: {mimeType: "image/png",data: result.screenshotBase64,},});
The loop
Four exits, in rough order of frequency:
- Text, no function calls. The model wrote a final message.
- Empty turn. No calls, no text.
consecutiveNoActionsincrements. Three in a row stops the loop. MALFORMED_FUNCTION_CALLwith nothing else. A known quirk of the preview model; the loop continues to the next iteration.- Iteration cap. 50 turns by default.
The finally in main calls agent.cleanup(), which releases the Steel session.
Run it
cd examples/gemini-computer-use-tscp .env.example .env # set STEEL_API_KEY and GEMINI_API_KEYnpm installnpm start
Get keys from app.steel.dev and aistudio.google.com. Override the task inline:
TASK="Find the current weather in New York City" npm start
Your output varies. Structure looks like this:
Steel Session created successfully!View live session at: https://app.steel.dev/sessions/ab12cd34...Executing task: Go to Steel.dev and find the latest news============================================================I'll navigate to steel.dev and scan the landing page for news.navigate({"url":"https://steel.dev"})scroll_document({"direction":"down"})click_at({"x":520,"y":410})Steel's latest release adds ...============================================================TASK EXECUTION COMPLETED============================================================Duration: 78.2 seconds
Expect roughly 60-120 seconds and 15-40 turns for a simple browsing task.
Make it yours
- Resize the viewport.
viewportWidth/viewportHeightin theAgentconstructor feed both the Steel sessiondimensionsand thedenormalizeX/denormalizeYmath. - Swap the model.
this.model = "gemini-3-flash-preview"is the only version string. - Tune the system prompt.
BROWSER_SYSTEM_PROMPTcarries the browsing conventions: today's date viaformatToday(), clear-before-typing, batch-actions-when-possible, black-screen recovery. - Gate safety decisions. Replace the auto-acknowledgement branch with a human approval before the next
executeComputerActionfires. - Hand off auth. Pair this recipe with Steel's credentials or auth contexts to start the session already logged in.
Related
Computer use docs · Python version · Anthropic equivalent · OpenAI equivalent
Scaffolds a starter project locally. Requires the Steel CLI.
Gemini's computer use ships through google.genai as a single built-in tool: Tool(computer_use=ComputerUse(environment=ENVIRONMENT_BROWSER)). Setting ENVIRONMENT_BROWSER unlocks a fixed vocabulary of browser function calls (click_at, type_text_at, scroll_document, scroll_at, navigate, search, key_combination, drag_and_drop, hover_at, go_back, go_forward, open_web_browser, wait_5_seconds).
Steel supplies the screen. A Steel session is a headful Chromium in a VM, and sessions.computer(session_id, action=...) runs mouse and keyboard actions with a base64 PNG attached to the response.
The loop
Agent.execute_task seeds two user-role Parts (BROWSER_SYSTEM_PROMPT and the task) into self.contents, then calls generate_content in a loop:
response = self.client.models.generate_content(model=self.model,contents=self.contents,config=self.config,)candidate = response.candidates[0]if candidate.content:self.contents.append(candidate.content)reasoning = self.extract_text(candidate)function_calls = self.extract_function_calls(candidate)
Gemini doesn't keep server-side conversation state, so every turn resends the full contents list including every prior screenshot.
Coordinates live in a 0-1000 canvas
Gemini never emits pixel coordinates. Every spatial argument (x, y, destination_x, destination_y, magnitude) is scaled against MAX_COORDINATE = 1000 regardless of viewport. denormalize_x and denormalize_y rescale onto Steel's viewport before each action:
def denormalize_x(self, x: int) -> int:return int(x / MAX_COORDINATE * self.viewport_width)
Sending screenshots back
Gemini expects each function response as two Parts in a user-role Content: a FunctionResponse with metadata, then an inline_data Blob carrying the PNG.
function_response = FunctionResponse(name=fc.name or "",response={"url": url or self.current_url},)parts.append(Part(function_response=function_response))parts.append(Part(inline_data=types.Blob(mime_type="image/png",data=screenshot_base64,)))
Stopping conditions
execute_task ends one of three ways:
- 1Gemini emits only text, no function calls.
- 2Three consecutive iterations produce neither text nor function calls.
- 3
max_iterations=50caps total turns.
Run it
cd examples/gemini-computer-use-pycp .env.example .env # set STEEL_API_KEY and GEMINI_API_KEYuv run main.py
Steel keys live at app.steel.dev/settings/api-keys. Gemini keys come from aistudio.google.com/apikey.
Override the task per run:
TASK="Find the current weather in New York City" python main.py
Your output varies. Structure looks like this:
Steel Session created successfully!View live session at: https://app.steel.dev/sessions/ab12cd34...Executing task: Go to Steel.dev and find the latest news============================================================I'll open steel.dev and look for the latest news.navigate({"url": "https://steel.dev"})scroll_document({"direction": "down"})click_at({"x": 512, "y": 340})...Task complete - model provided final responseTASK EXECUTION COMPLETEDDuration: 78.4 secondsResult: Steel's latest release notes mention ...
A run typically takes 60-180 seconds and 10-30 iterations. Because generate_content has no server-side state, every new turn resends the full self.contents list including every prior Blob. The finally block in main() calls sessions.release().
Make it yours
- Change the task. Edit
TASKin.envor pass it inline. - Swap the model.
self.model = "gemini-3-flash-preview"inAgent.__init__. - Tune the viewport.
viewport_widthandviewport_heightinAgent.__init__flow intosessions.create(dimensions=...). - Gate safety confirmations. Replace the auto-acknowledge branch in
execute_taskwith a human prompt. - Persist a login. Pass
session_contexttosessions.createto resume with cookies and local storage. See credentials. - Raise the ceiling.
max_iterations=50inexecute_taskbounds a single task.
Related
TypeScript version · Gemini Computer Use guide · google-genai SDK
Related recipes
Drive a mobile browser with Claude Computer Use
Claude Computer Use with Steel for autonomous task execution in mobile browser environments.
Drive a browser with Claude Computer Use
Connect Claude to a Steel browser session for autonomous web interactions.
Drive a browser with OpenAI Computer Use
Connect OpenAI's Computer Use Assistant to a Steel browser session for autonomous web interactions.