Drive a browser with Gemini Computer Use

Connect Google's Gemini Computer Use to a Steel browser session for autonomous web interactions.

examples/gemini-computer-use-ts
Contributors: Updated
Terminal

Scaffolds a starter project locally. Requires the Steel CLI.

Gemini exposes computer use as a built-in tool type, not a hand-written schema. You set config.tools = [{ computerUse: { environment: Environment.ENVIRONMENT_BROWSER } }] on a generateContent call and the model plans against a fixed action vocabulary (click_at, type_text_at, navigate, scroll_document, search, drag_and_drop, key_combination, ...) with coordinates in a normalized 0-1000 grid.

The model defaults to gemini-3-flash-preview. Conversation state lives entirely on your side, appended to this.contents turn by turn.

Coordinates and action mapping

Gemini plans in a 1000x1000 normalized grid regardless of the browser dimensions; denormalizeX and denormalizeY scale back to pixels off viewportWidth/viewportHeight (1440x900 by default).

private denormalizeX(x: number): number {
return Math.round((x / MAX_COORDINATE) * this.viewportWidth);
}

Several of Gemini's actions are compound; the starter expands them. type_text_at fans into click, Ctrl+A, Backspace, type_text, Enter, wait, screenshot. navigate and search skip the URL bar hunt by doing the Chrome Ctrl+L trick: focus the address bar, type, press Enter, wait. key_combination arrives as a +-joined string like "Control+Enter"; splitKeys and normalizeKey break it apart and rewrite synonyms (CTRL to Control, CMD to Meta, ARROWUP to ArrowUp).

Every mapped action sets screenshot: true on the Steel call. The PNG comes back in the same response.

Sending frames back

Each completed call produces two parts in a single user-role turn: a functionResponse that names the call and echoes the current URL, then an inlineData part carrying the screenshot as raw base64.

const functionResponse: FunctionResponse = {
name: fc.name ?? "",
response: { url: result.url ?? this.currentUrl },
};
parts.push({ functionResponse });
parts.push({
inlineData: {
mimeType: "image/png",
data: result.screenshotBase64,
},
});

The loop

Four exits, in rough order of frequency:

  • Text, no function calls. The model wrote a final message.
  • Empty turn. No calls, no text. consecutiveNoActions increments. Three in a row stops the loop.
  • MALFORMED_FUNCTION_CALL with nothing else. A known quirk of the preview model; the loop continues to the next iteration.
  • Iteration cap. 50 turns by default.

The finally in main calls agent.cleanup(), which releases the Steel session.

Run it

cd examples/gemini-computer-use-ts
cp .env.example .env # set STEEL_API_KEY and GEMINI_API_KEY
npm install
npm start

Get keys from app.steel.dev and aistudio.google.com. Override the task inline:

TASK="Find the current weather in New York City" npm start

Your output varies. Structure looks like this:

Steel Session created successfully!
View live session at: https://app.steel.dev/sessions/ab12cd34...
Executing task: Go to Steel.dev and find the latest news
============================================================
I'll navigate to steel.dev and scan the landing page for news.
navigate({"url":"https://steel.dev"})
scroll_document({"direction":"down"})
click_at({"x":520,"y":410})
Steel's latest release adds ...
============================================================
TASK EXECUTION COMPLETED
============================================================
Duration: 78.2 seconds

Expect roughly 60-120 seconds and 15-40 turns for a simple browsing task.

Make it yours

  • Resize the viewport. viewportWidth / viewportHeight in the Agent constructor feed both the Steel session dimensions and the denormalizeX / denormalizeY math.
  • Swap the model. this.model = "gemini-3-flash-preview" is the only version string.
  • Tune the system prompt. BROWSER_SYSTEM_PROMPT carries the browsing conventions: today's date via formatToday(), clear-before-typing, batch-actions-when-possible, black-screen recovery.
  • Gate safety decisions. Replace the auto-acknowledgement branch with a human approval before the next executeComputerAction fires.
  • Hand off auth. Pair this recipe with Steel's credentials or auth contexts to start the session already logged in.

Computer use docs · Python version · Anthropic equivalent · OpenAI equivalent

examples/gemini-computer-use-py
Contributors: Updated
Terminal

Scaffolds a starter project locally. Requires the Steel CLI.

Gemini's computer use ships through google.genai as a single built-in tool: Tool(computer_use=ComputerUse(environment=ENVIRONMENT_BROWSER)). Setting ENVIRONMENT_BROWSER unlocks a fixed vocabulary of browser function calls (click_at, type_text_at, scroll_document, scroll_at, navigate, search, key_combination, drag_and_drop, hover_at, go_back, go_forward, open_web_browser, wait_5_seconds).

Steel supplies the screen. A Steel session is a headful Chromium in a VM, and sessions.computer(session_id, action=...) runs mouse and keyboard actions with a base64 PNG attached to the response.

The loop

Agent.execute_task seeds two user-role Parts (BROWSER_SYSTEM_PROMPT and the task) into self.contents, then calls generate_content in a loop:

response = self.client.models.generate_content(
model=self.model,
contents=self.contents,
config=self.config,
)
candidate = response.candidates[0]
if candidate.content:
self.contents.append(candidate.content)
reasoning = self.extract_text(candidate)
function_calls = self.extract_function_calls(candidate)

Gemini doesn't keep server-side conversation state, so every turn resends the full contents list including every prior screenshot.

Coordinates live in a 0-1000 canvas

Gemini never emits pixel coordinates. Every spatial argument (x, y, destination_x, destination_y, magnitude) is scaled against MAX_COORDINATE = 1000 regardless of viewport. denormalize_x and denormalize_y rescale onto Steel's viewport before each action:

def denormalize_x(self, x: int) -> int:
return int(x / MAX_COORDINATE * self.viewport_width)

Sending screenshots back

Gemini expects each function response as two Parts in a user-role Content: a FunctionResponse with metadata, then an inline_data Blob carrying the PNG.

function_response = FunctionResponse(
name=fc.name or "",
response={"url": url or self.current_url},
)
parts.append(Part(function_response=function_response))
parts.append(
Part(
inline_data=types.Blob(
mime_type="image/png",
data=screenshot_base64,
)
)
)

Stopping conditions

execute_task ends one of three ways:

  1. 1Gemini emits only text, no function calls.
  2. 2Three consecutive iterations produce neither text nor function calls.
  3. 3max_iterations=50 caps total turns.

Run it

cd examples/gemini-computer-use-py
cp .env.example .env # set STEEL_API_KEY and GEMINI_API_KEY
uv run main.py

Steel keys live at app.steel.dev/settings/api-keys. Gemini keys come from aistudio.google.com/apikey.

Override the task per run:

TASK="Find the current weather in New York City" python main.py

Your output varies. Structure looks like this:

Steel Session created successfully!
View live session at: https://app.steel.dev/sessions/ab12cd34...
Executing task: Go to Steel.dev and find the latest news
============================================================
I'll open steel.dev and look for the latest news.
navigate({"url": "https://steel.dev"})
scroll_document({"direction": "down"})
click_at({"x": 512, "y": 340})
...
Task complete - model provided final response
TASK EXECUTION COMPLETED
Duration: 78.4 seconds
Result: Steel's latest release notes mention ...

A run typically takes 60-180 seconds and 10-30 iterations. Because generate_content has no server-side state, every new turn resends the full self.contents list including every prior Blob. The finally block in main() calls sessions.release().

Make it yours

  • Change the task. Edit TASK in .env or pass it inline.
  • Swap the model. self.model = "gemini-3-flash-preview" in Agent.__init__.
  • Tune the viewport. viewport_width and viewport_height in Agent.__init__ flow into sessions.create(dimensions=...).
  • Gate safety confirmations. Replace the auto-acknowledge branch in execute_task with a human prompt.
  • Persist a login. Pass session_context to sessions.create to resume with cookies and local storage. See credentials.
  • Raise the ceiling. max_iterations=50 in execute_task bounds a single task.

TypeScript version · Gemini Computer Use guide · google-genai SDK