# Drive a browser with Gemini Computer Use
URL: /cookbook/gemini-computer-use

---
title: Drive a browser with Gemini Computer Use
description: "Connect Google's Gemini Computer Use to a Steel browser session for autonomous web interactions."
---

<RecipeJsonLd slug="gemini-computer-use" title={"Drive a browser with Gemini Computer Use"} description={"Connect Google's Gemini Computer Use to a Steel browser session for autonomous web interactions."} authors={[{"handle":"junhsss","name":"Jun Ryu"}]} datePublished="2025-11-25" dateModified="2026-04-24" sourceUrl="https://github.com/steel-dev/steel-cookbook/tree/92f29742253e2b6c6801d109e18232768e5291a0/examples/gemini-computer-use-ts" />

<Tabs items={['TypeScript', 'Python']} groupId="lang" persist updateAnchor className="cookbook-concept-tabs">

<Tab id="typescript" className="cookbook-concept-tab">

<RecipeMeta href="https://github.com/steel-dev/steel-cookbook/tree/92f29742253e2b6c6801d109e18232768e5291a0/examples/gemini-computer-use-ts" path="examples/gemini-computer-use-ts" authors={[{"handle":"junhsss","name":"Jun Ryu","avatar":"https://github.com/junhsss.png?size=40"}]} updated="2026-04-24" />

<RecipeQuickstart slug="gemini-computer-use-ts" />

Gemini exposes computer use as a built-in tool type, not a hand-written schema. You set `config.tools = [{ computerUse: { environment: Environment.ENVIRONMENT_BROWSER } }]` on a `generateContent` call and the model plans against a fixed action vocabulary (`click_at`, `type_text_at`, `navigate`, `scroll_document`, `search`, `drag_and_drop`, `key_combination`, ...) with coordinates in a normalized 0-1000 grid.

The model defaults to `gemini-3-flash-preview`. Conversation state lives entirely on your side, appended to `this.contents` turn by turn.

## Coordinates and action mapping

Gemini plans in a 1000x1000 normalized grid regardless of the browser dimensions; `denormalizeX` and `denormalizeY` scale back to pixels off `viewportWidth`/`viewportHeight` (1440x900 by default).

```typescript
private denormalizeX(x: number): number {
  return Math.round((x / MAX_COORDINATE) * this.viewportWidth);
}
```

Several of Gemini's actions are compound; the starter expands them. `type_text_at` fans into click, Ctrl+A, Backspace, type_text, Enter, wait, screenshot. `navigate` and `search` skip the URL bar hunt by doing the Chrome `Ctrl+L` trick: focus the address bar, type, press Enter, wait. `key_combination` arrives as a `+`-joined string like `"Control+Enter"`; `splitKeys` and `normalizeKey` break it apart and rewrite synonyms (`CTRL` to `Control`, `CMD` to `Meta`, `ARROWUP` to `ArrowUp`).

Every mapped action sets `screenshot: true` on the Steel call. The PNG comes back in the same response.

## Sending frames back

Each completed call produces two parts in a single user-role turn: a `functionResponse` that names the call and echoes the current URL, then an `inlineData` part carrying the screenshot as raw base64.

```typescript
const functionResponse: FunctionResponse = {
  name: fc.name ?? "",
  response: { url: result.url ?? this.currentUrl },
};
parts.push({ functionResponse });

parts.push({
  inlineData: {
    mimeType: "image/png",
    data: result.screenshotBase64,
  },
});
```

## The loop

Four exits, in rough order of frequency:

- **Text, no function calls.** The model wrote a final message.
- **Empty turn.** No calls, no text. `consecutiveNoActions` increments. Three in a row stops the loop.
- **`MALFORMED_FUNCTION_CALL` with nothing else.** A known quirk of the preview model; the loop continues to the next iteration.
- **Iteration cap.** 50 turns by default.

The `finally` in `main` calls `agent.cleanup()`, which releases the Steel session.

## Run it

```bash
cd examples/gemini-computer-use-ts
cp .env.example .env          # set STEEL_API_KEY and GEMINI_API_KEY
npm install
npm start
```

Get keys from [app.steel.dev](https://app.steel.dev/settings/api-keys) and [aistudio.google.com](https://aistudio.google.com/apikey). Override the task inline:

```bash
TASK="Find the current weather in New York City" npm start
```

Your output varies. Structure looks like this:

```text
Steel Session created successfully!
View live session at: https://app.steel.dev/sessions/ab12cd34...

Executing task: Go to Steel.dev and find the latest news
============================================================
I'll navigate to steel.dev and scan the landing page for news.
navigate({"url":"https://steel.dev"})
scroll_document({"direction":"down"})
click_at({"x":520,"y":410})
Steel's latest release adds ...

============================================================
TASK EXECUTION COMPLETED
============================================================
Duration: 78.2 seconds
```

Expect roughly 60-120 seconds and 15-40 turns for a simple browsing task.

## Make it yours

- **Resize the viewport.** `viewportWidth` / `viewportHeight` in the `Agent` constructor feed both the Steel session `dimensions` and the `denormalizeX` / `denormalizeY` math.
- **Swap the model.** `this.model = "gemini-3-flash-preview"` is the only version string.
- **Tune the system prompt.** `BROWSER_SYSTEM_PROMPT` carries the browsing conventions: today's date via `formatToday()`, clear-before-typing, batch-actions-when-possible, black-screen recovery.
- **Gate safety decisions.** Replace the auto-acknowledgement branch with a human approval before the next `executeComputerAction` fires.
- **Hand off auth.** Pair this recipe with Steel's [credentials](/cookbook/credentials) or [auth contexts](/cookbook/auth-context) to start the session already logged in.

## Related

[Computer use docs](https://ai.google.dev/gemini-api/docs/computer-use) · [Python version](/cookbook/gemini-computer-use) · [Anthropic equivalent](/cookbook/claude-computer-use) · [OpenAI equivalent](/cookbook/openai-computer-use)

</Tab>

<Tab id="python" className="cookbook-concept-tab">

<RecipeMeta href="https://github.com/steel-dev/steel-cookbook/tree/92f29742253e2b6c6801d109e18232768e5291a0/examples/gemini-computer-use-py" path="examples/gemini-computer-use-py" authors={[{"handle":"junhsss","name":"Jun Ryu","avatar":"https://github.com/junhsss.png?size=40"}]} updated="2026-04-24" />

<RecipeQuickstart slug="gemini-computer-use-py" />

Gemini's computer use ships through `google.genai` as a single built-in tool: `Tool(computer_use=ComputerUse(environment=ENVIRONMENT_BROWSER))`. Setting `ENVIRONMENT_BROWSER` unlocks a fixed vocabulary of browser function calls (`click_at`, `type_text_at`, `scroll_document`, `scroll_at`, `navigate`, `search`, `key_combination`, `drag_and_drop`, `hover_at`, `go_back`, `go_forward`, `open_web_browser`, `wait_5_seconds`).

Steel supplies the screen. A Steel session is a headful Chromium in a VM, and `sessions.computer(session_id, action=...)` runs mouse and keyboard actions with a base64 PNG attached to the response.

## The loop

`Agent.execute_task` seeds two user-role `Part`s (`BROWSER_SYSTEM_PROMPT` and the task) into `self.contents`, then calls `generate_content` in a loop:

```python
response = self.client.models.generate_content(
    model=self.model,
    contents=self.contents,
    config=self.config,
)

candidate = response.candidates[0]
if candidate.content:
    self.contents.append(candidate.content)

reasoning = self.extract_text(candidate)
function_calls = self.extract_function_calls(candidate)
```

Gemini doesn't keep server-side conversation state, so every turn resends the full `contents` list including every prior screenshot.

## Coordinates live in a 0-1000 canvas

Gemini never emits pixel coordinates. Every spatial argument (`x`, `y`, `destination_x`, `destination_y`, `magnitude`) is scaled against `MAX_COORDINATE = 1000` regardless of viewport. `denormalize_x` and `denormalize_y` rescale onto Steel's viewport before each action:

```python
def denormalize_x(self, x: int) -> int:
    return int(x / MAX_COORDINATE * self.viewport_width)
```

## Sending screenshots back

Gemini expects each function response as two `Part`s in a user-role `Content`: a `FunctionResponse` with metadata, then an `inline_data` `Blob` carrying the PNG.

```python
function_response = FunctionResponse(
    name=fc.name or "",
    response={"url": url or self.current_url},
)
parts.append(Part(function_response=function_response))

parts.append(
    Part(
        inline_data=types.Blob(
            mime_type="image/png",
            data=screenshot_base64,
        )
    )
)
```

## Stopping conditions

`execute_task` ends one of three ways:

1. Gemini emits only text, no function calls.
2. Three consecutive iterations produce neither text nor function calls.
3. `max_iterations=50` caps total turns.

## Run it

```bash
cd examples/gemini-computer-use-py
cp .env.example .env          # set STEEL_API_KEY and GEMINI_API_KEY
uv run main.py
```

Steel keys live at [app.steel.dev/settings/api-keys](https://app.steel.dev/settings/api-keys). Gemini keys come from [aistudio.google.com/apikey](https://aistudio.google.com/apikey).

Override the task per run:

```bash
TASK="Find the current weather in New York City" python main.py
```

Your output varies. Structure looks like this:

```text
Steel Session created successfully!
View live session at: https://app.steel.dev/sessions/ab12cd34...

Executing task: Go to Steel.dev and find the latest news
============================================================

I'll open steel.dev and look for the latest news.
navigate({"url": "https://steel.dev"})
scroll_document({"direction": "down"})
click_at({"x": 512, "y": 340})
...
Task complete - model provided final response

TASK EXECUTION COMPLETED
Duration: 78.4 seconds
Result: Steel's latest release notes mention ...
```

A run typically takes 60-180 seconds and 10-30 iterations. Because `generate_content` has no server-side state, every new turn resends the full `self.contents` list including every prior `Blob`. The `finally` block in `main()` calls `sessions.release()`.

## Make it yours

- Change the task. Edit `TASK` in `.env` or pass it inline.
- Swap the model. `self.model = "gemini-3-flash-preview"` in `Agent.__init__`.
- Tune the viewport. `viewport_width` and `viewport_height` in `Agent.__init__` flow into `sessions.create(dimensions=...)`.
- Gate safety confirmations. Replace the auto-acknowledge branch in `execute_task` with a human prompt.
- Persist a login. Pass `session_context` to `sessions.create` to resume with cookies and local storage. See [credentials](/cookbook/credentials).
- Raise the ceiling. `max_iterations=50` in `execute_task` bounds a single task.

## Related

[TypeScript version](/cookbook/gemini-computer-use) · [Gemini Computer Use guide](https://ai.google.dev/gemini-api/docs/computer-use) · [google-genai SDK](https://googleapis.github.io/python-genai/)

</Tab>

</Tabs>

## Related recipes

<RecipeGrid>
<RecipeCard slug="claude-computer-use-mobile" title={"Drive a mobile browser with Claude Computer Use"} description={"Claude Computer Use with Steel for autonomous task execution in mobile browser environments."} topics={['Computer use', 'Mobile']} date="2025-10-14" />
<RecipeCard slug="claude-computer-use" title={"Drive a browser with Claude Computer Use"} description={"Connect Claude to a Steel browser session for autonomous web interactions."} topics={['Computer use']} date="2025-07-16" />
<RecipeCard slug="openai-computer-use" title={"Drive a browser with OpenAI Computer Use"} description={"Connect OpenAI's Computer Use Assistant to a Steel browser session for autonomous web interactions."} topics={['Computer use']} date="2025-03-19" />
</RecipeGrid>
