Quickstart (Python)

This guide will walk you through how to use Google's gemini-2.5-computer-use-preview model with Steel's Computer API to create AI agents that can navigate the web.

Gemini's Computer Use model uses a normalized coordinate system (0-1000) and provides built-in actions for browser control, making it straightforward to integrate with Steel.

Prerequisites

Python 3.8+
A Steel API key (sign up here)
A Gemini API key (get one here)

Step 1: Setup and Helper Functions

First, set up a virtual environment and install the required packages:

Terminal

$ uv venv
$ source .venv/bin/activate
$ uv add steel-sdk google-genai python-dotenv

Create a .env file with your API keys:

ENV

.env

1STEEL_API_KEY=your_steel_api_key_here
2GEMINI_API_KEY=your_gemini_api_key_here
3TASK=Go to Steel.dev and find the latest news

Create a file with helper functions and constants:

Python

helpers.py

1import os
2import json
3from typing import List, Optional, Tuple, Dict, Any
4from datetime import datetime
5
6from dotenv import load_dotenv
7from steel import Steel
8from google import genai
9from google.genai import types
10from google.genai.types import (
11    Content,
12    Part,
13    FunctionCall,
14    FunctionResponse,
15    Candidate,
16    FinishReason,
17    Tool,
18    GenerateContentConfig,
19)
20
21load_dotenv(override=True)
22
23STEEL_API_KEY = os.getenv("STEEL_API_KEY") or "your-steel-api-key-here"
24GEMINI_API_KEY = os.getenv("GEMINI_API_KEY") or "your-gemini-api-key-here"
25TASK = os.getenv("TASK") or "Go to Steel.dev and find the latest news"
26
27MODEL = "gemini-2.5-computer-use-preview-10-2025"
28MAX_COORDINATE = 1000
29
30
31def format_today() -> str:
32    return datetime.now().strftime("%A, %B %d, %Y")
33
34
35BROWSER_SYSTEM_PROMPT = f"""<BROWSER_ENV>
36  - You control a headful Chromium browser running in a VM with internet access.
37  - Chromium is already open; interact only through computer use actions (mouse, keyboard, scroll, screenshots).
38  - Today's date is {format_today()}.
39  </BROWSER_ENV>
40  
41  <BROWSER_CONTROL>
42  - When viewing pages, zoom out or scroll so all relevant content is visible.
43  - When typing into any input:
44    * Clear it first with Ctrl+A, then Delete.
45    * After submitting (pressing Enter or clicking a button), wait for the page to load.
46  - Computer tool calls are slow; batch related actions into a single call whenever possible.
47  - You may act on the user's behalf on sites where they are already authenticated.
48  - Assume any required authentication/Auth Contexts are already configured before the task starts.
49  - If the first screenshot is black:
50    * Click near the center of the screen.
51    * Take another screenshot.
52  </BROWSER_CONTROL>
53  
54  <TASK_EXECUTION>
55  - You receive exactly one natural-language task and no further user feedback.
56  - Do not ask the user clarifying questions; instead, make reasonable assumptions and proceed.
57  - For complex tasks, quickly plan a short, ordered sequence of steps before acting.
58  - Prefer minimal, high-signal actions that move directly toward the goal.
59  - Keep your final response concise and focused on fulfilling the task (e.g., a brief summary of findings or results).
60  </TASK_EXECUTION>"""

Step 2: Create the Agent Class

Python

agent.py

1import json
2from typing import List, Optional, Tuple, Dict, Any
3
4from helpers import (
5    STEEL_API_KEY,
6    GEMINI_API_KEY,
7    MODEL,
8    MAX_COORDINATE,
9    BROWSER_SYSTEM_PROMPT,
10)
11from steel import Steel
12from google import genai
13from google.genai import types
14from google.genai.types import (
15    Content,
16    Part,
17    FunctionCall,
18    FunctionResponse,
19    Candidate,
20    FinishReason,
21    Tool,
22    GenerateContentConfig,
23)
24
25
26class Agent:
27    def __init__(self):
28        self.client = genai.Client(api_key=GEMINI_API_KEY)
29        self.steel = Steel(steel_api_key=STEEL_API_KEY)
30        self.session = None
31        self.contents: List[Content] = []
32        self.current_url = "about:blank"
33        self.viewport_width = 1280
34        self.viewport_height = 768
35        self.tools: List[Tool] = [
36            Tool(
37                computer_use=types.ComputerUse(
38                    environment=types.Environment.ENVIRONMENT_BROWSER,
39                )
40            )
41        ]
42        self.config = GenerateContentConfig(tools=self.tools)
43
44    def _denormalize_x(self, x: int) -> int:
45        return int(x / MAX_COORDINATE * self.viewport_width)
46
47    def _denormalize_y(self, y: int) -> int:
48        return int(y / MAX_COORDINATE * self.viewport_height)
49
50    def _center(self) -> Tuple[int, int]:
51        return (self.viewport_width // 2, self.viewport_height // 2)
52
53    def _normalize_key(self, key: str) -> str:
54        if not isinstance(key, str) or not key:
55            return key
56        k = key.strip()
57        upper = k.upper()
58        synonyms = {
59            "ENTER": "Enter",
60            "RETURN": "Enter",
61            "ESC": "Escape",
62            "ESCAPE": "Escape",
63            "TAB": "Tab",
64            "BACKSPACE": "Backspace",
65            "DELETE": "Delete",
66            "SPACE": "Space",
67            "CTRL": "Control",
68            "CONTROL": "Control",
69            "ALT": "Alt",
70            "SHIFT": "Shift",
71            "META": "Meta",
72            "CMD": "Meta",
73            "UP": "ArrowUp",
74            "DOWN": "ArrowDown",
75            "LEFT": "ArrowLeft",
76            "RIGHT": "ArrowRight",
77            "HOME": "Home",
78            "END": "End",
79            "PAGEUP": "PageUp",
80            "PAGEDOWN": "PageDown",
81        }
82        if upper in synonyms:
83            return synonyms[upper]
84        if upper.startswith("F") and upper[1:].isdigit():
85            return "F" + upper[1:]
86        return k
87
88    def _normalize_keys(self, keys: List[str]) -> List[str]:
89        return [self._normalize_key(k) for k in keys]
90
91    def initialize(self) -> None:
92        self.session = self.steel.sessions.create(
93            dimensions={"width": self.viewport_width, "height": self.viewport_height},
94            block_ads=True,
95            api_timeout=900000,
96        )
97        print("Steel Session created successfully!")
98        print(f"View live session at: {self.session.session_viewer_url}")
99
100    def cleanup(self) -> None:
101        if self.session:
102            print("Releasing Steel session...")
103            self.steel.sessions.release(self.session.id)
104            print(
105                f"Session completed. View replay at {self.session.session_viewer_url}"
106            )
107            self.session = None
108
109    def _take_screenshot(self) -> str:
110        resp = self.steel.sessions.computer(self.session.id, action="take_screenshot")
111        img = getattr(resp, "base64_image", None)
112        if not img:
113            raise RuntimeError("No screenshot returned from Steel")
114        return img
115
116    def _execute_computer_action(
117        self, function_call: FunctionCall
118    ) -> Tuple[str, Optional[str]]:
119        """Execute a computer action and return (screenshot_base64, url)."""
120        name = function_call.name or ""
121        args: Dict[str, Any] = function_call.args or {}
122
123        if name == "open_web_browser":
124            screenshot = self._take_screenshot()
125            return screenshot, self.current_url
126
127        elif name == "click_at":
128            x = self._denormalize_x(args.get("x", 0))
129            y = self._denormalize_y(args.get("y", 0))
130            resp = self.steel.sessions.computer(
131                self.session.id,
132                action="click_mouse",
133                button="left",
134                coordinates=[x, y],
135                screenshot=True,
136            )
137            img = getattr(resp, "base64_image", None)
138            return img or self._take_screenshot(), self.current_url
139
140        elif name == "hover_at":
141            x = self._denormalize_x(args.get("x", 0))
142            y = self._denormalize_y(args.get("y", 0))
143            resp = self.steel.sessions.computer(
144                self.session.id,
145                action="move_mouse",
146                coordinates=[x, y],
147                screenshot=True,
148            )
149            img = getattr(resp, "base64_image", None)
150            return img or self._take_screenshot(), self.current_url
151
152        elif name == "type_text_at":
153            x = self._denormalize_x(args.get("x", 0))
154            y = self._denormalize_y(args.get("y", 0))
155            text = args.get("text", "")
156            press_enter = args.get("press_enter", True)
157            clear_before_typing = args.get("clear_before_typing", True)
158
159            self.steel.sessions.computer(
160                self.session.id,
161                action="click_mouse",
162                button="left",
163                coordinates=[x, y],
164            )
165
166            if clear_before_typing:
167                self.steel.sessions.computer(
168                    self.session.id,
169                    action="press_key",
170                    keys=["Control", "a"],
171                )
172                self.steel.sessions.computer(
173                    self.session.id,
174                    action="press_key",
175                    keys=["Backspace"],
176                )
177
178            self.steel.sessions.computer(
179                self.session.id,
180                action="type_text",
181                text=text,
182            )
183
184            if press_enter:
185                self.steel.sessions.computer(
186                    self.session.id,
187                    action="press_key",
188                    keys=["Enter"],
189                )
190
191            self.steel.sessions.computer(
192                self.session.id,
193                action="wait",
194                duration=1,
195            )
196
197            screenshot = self._take_screenshot()
198            return screenshot, self.current_url
199
200        elif name == "scroll_document":
201            direction = args.get("direction", "down")
202
203            if direction == "down":
204                keys = ["PageDown"]
205            elif direction == "up":
206                keys = ["PageUp"]
207            elif direction in ("left", "right"):
208                cx, cy = self._center()
209                delta = -400 if direction == "left" else 400
210                resp = self.steel.sessions.computer(
211                    self.session.id,
212                    action="scroll",
213                    coordinates=[cx, cy],
214                    delta_x=delta,
215                    delta_y=0,
216                    screenshot=True,
217                )
218                img = getattr(resp, "base64_image", None)
219                return img or self._take_screenshot(), self.current_url
220            else:
221                keys = ["PageDown"]
222
223            resp = self.steel.sessions.computer(
224                self.session.id,
225                action="press_key",
226                keys=keys,
227                screenshot=True,
228            )
229            img = getattr(resp, "base64_image", None)
230            return img or self._take_screenshot(), self.current_url
231
232        elif name == "scroll_at":
233            x = self._denormalize_x(args.get("x", 0))
234            y = self._denormalize_y(args.get("y", 0))
235            direction = args.get("direction", "down")
236            magnitude = self._denormalize_y(args.get("magnitude", 800))
237
238            delta_x, delta_y = 0, 0
239            if direction == "down":
240                delta_y = magnitude
241            elif direction == "up":
242                delta_y = -magnitude
243            elif direction == "right":
244                delta_x = magnitude
245            elif direction == "left":
246                delta_x = -magnitude
247
248            resp = self.steel.sessions.computer(
249                self.session.id,
250                action="scroll",
251                coordinates=[x, y],
252                delta_x=delta_x,
253                delta_y=delta_y,
254                screenshot=True,
255            )
256            img = getattr(resp, "base64_image", None)
257            return img or self._take_screenshot(), self.current_url
258
259        elif name == "wait_5_seconds":
260            resp = self.steel.sessions.computer(
261                self.session.id,
262                action="wait",
263                duration=5,
264                screenshot=True,
265            )
266            img = getattr(resp, "base64_image", None)
267            return img or self._take_screenshot(), self.current_url
268
269        elif name == "go_back":
270            resp = self.steel.sessions.computer(
271                self.session.id,
272                action="press_key",
273                keys=["Alt", "ArrowLeft"],
274                screenshot=True,
275            )
276            img = getattr(resp, "base64_image", None)
277            return img or self._take_screenshot(), self.current_url
278
279        elif name == "go_forward":
280            resp = self.steel.sessions.computer(
281                self.session.id,
282                action="press_key",
283                keys=["Alt", "ArrowRight"],
284                screenshot=True,
285            )
286            img = getattr(resp, "base64_image", None)
287            return img or self._take_screenshot(), self.current_url
288
289        elif name == "navigate":
290            url = args.get("url", "")
291            if not url.startswith(("http://", "https://")):
292                url = "https://" + url
293
294            self.steel.sessions.computer(
295                self.session.id,
296                action="press_key",
297                keys=["Control", "l"],
298            )
299            self.steel.sessions.computer(
300                self.session.id,
301                action="type_text",
302                text=url,
303            )
304            self.steel.sessions.computer(
305                self.session.id,
306                action="press_key",
307                keys=["Enter"],
308            )
309            self.steel.sessions.computer(
310                self.session.id,
311                action="wait",
312                duration=2,
313            )
314
315            self.current_url = url
316            screenshot = self._take_screenshot()
317            return screenshot, self.current_url
318
319        elif name == "key_combination":
320            keys_str = args.get("keys", "")
321            keys = [k.strip() for k in keys_str.split("+")]
322            normalized_keys = self._normalize_keys(keys)
323
324            resp = self.steel.sessions.computer(
325                self.session.id,
326                action="press_key",
327                keys=normalized_keys,
328                screenshot=True,
329            )
330            img = getattr(resp, "base64_image", None)
331            return img or self._take_screenshot(), self.current_url
332
333        elif name == "drag_and_drop":
334            start_x = self._denormalize_x(args.get("x", 0))
335            start_y = self._denormalize_y(args.get("y", 0))
336            end_x = self._denormalize_x(args.get("destination_x", 0))
337            end_y = self._denormalize_y(args.get("destination_y", 0))
338
339            resp = self.steel.sessions.computer(
340                self.session.id,
341                action="drag_mouse",
342                path=[[start_x, start_y], [end_x, end_y]],
343                screenshot=True,
344            )
345            img = getattr(resp, "base64_image", None)
346            return img or self._take_screenshot(), self.current_url
347
348        else:
349            print(f"Unknown action: {name}, taking screenshot")
350            screenshot = self._take_screenshot()
351            return screenshot, self.current_url
352
353    def _extract_function_calls(self, candidate: Candidate) -> List[FunctionCall]:
354        function_calls: List[FunctionCall] = []
355        if not candidate.content or not candidate.content.parts:
356            return function_calls
357
358        for part in candidate.content.parts:
359            if part.function_call:
360                function_calls.append(part.function_call)
361
362        return function_calls
363
364    def _extract_text(self, candidate: Candidate) -> str:
365        if not candidate.content or not candidate.content.parts:
366            return ""
367        texts: List[str] = []
368        for part in candidate.content.parts:
369            if part.text:
370                texts.append(part.text)
371        return " ".join(texts).strip()
372
373    def _build_function_response_parts(
374        self,
375        function_calls: List[FunctionCall],
376        results: List[Tuple[str, Optional[str]]],
377    ) -> List[Part]:
378        parts: List[Part] = []
379
380        for i, fc in enumerate(function_calls):
381            screenshot_base64, url = results[i]
382
383            function_response = FunctionResponse(
384                name=fc.name or "",
385                response={"url": url or self.current_url},
386            )
387            parts.append(Part(function_response=function_response))
388            parts.append(
389                Part(
390                    inline_data=types.Blob(
391                        mime_type="image/png",
392                        data=screenshot_base64,
393                    )
394                )
395            )
396
397        return parts
398
399    def execute_task(
400        self,
401        task: str,
402        print_steps: bool = True,
403        max_iterations: int = 50,
404    ) -> str:
405        self.contents = [
406            Content(
407                role="user",
408                parts=[Part(text=BROWSER_SYSTEM_PROMPT), Part(text=task)],
409            )
410        ]
411
412        iterations = 0
413        consecutive_no_actions = 0
414
415        print(f"🎯 Executing task: {task}")
416        print("=" * 60)
417
418        while iterations < max_iterations:
419            iterations += 1
420
421            try:
422                response = self.client.models.generate_content(
423                    model=MODEL,
424                    contents=self.contents,
425                    config=self.config,
426                )
427
428                if not response.candidates:
429                    print("❌ No candidates in response")
430                    break
431
432                candidate = response.candidates[0]
433
434                if candidate.content:
435                    self.contents.append(candidate.content)
436
437                reasoning = self._extract_text(candidate)
438                function_calls = self._extract_function_calls(candidate)
439
440                if (
441                    not function_calls
442                    and not reasoning
443                    and candidate.finish_reason == FinishReason.MALFORMED_FUNCTION_CALL
444                ):
445                    print("⚠️ Malformed function call, retrying...")
446                    continue
447
448                if not function_calls:
449                    if reasoning:
450                        if print_steps:
451                            print(f"\n💬 {reasoning}")
452                        print("✅ Task complete - model provided final response")
453                        break
454
455                    consecutive_no_actions += 1
456                    if consecutive_no_actions >= 3:
457                        print("⚠️ No actions for 3 consecutive iterations - stopping")
458                        break
459                    continue
460
461                consecutive_no_actions = 0
462
463                if print_steps and reasoning:
464                    print(f"\n💭 {reasoning}")
465
466                results: List[Tuple[str, Optional[str]]] = []
467
468                for fc in function_calls:
469                    action_name = fc.name or "unknown"
470                    action_args = fc.args or {}
471
472                    if print_steps:
473                        print(f"🔧 {action_name}({json.dumps(action_args)})")
474
475                    if action_args:
476                        safety_decision = action_args.get("safety_decision")
477                        if (
478                            isinstance(safety_decision, dict)
479                            and safety_decision.get("decision") == "require_confirmation"
480                        ):
481                            print(
482                                f"⚠️ Safety confirmation required: {safety_decision.get('explanation')}"
483                            )
484                            print("✅ Auto-acknowledging safety check")
485
486                    result = self._execute_computer_action(fc)
487                    results.append(result)
488
489                function_response_parts = self._build_function_response_parts(
490                    function_calls, results
491                )
492                self.contents.append(
493                    Content(role="user", parts=function_response_parts)
494                )
495
496            except Exception as e:
497                print(f"❌ Error during task execution: {e}")
498                raise
499
500        if iterations >= max_iterations:
501            print(f"⚠️ Task execution stopped after {max_iterations} iterations")
502
503        for content in reversed(self.contents):
504            if content.role == "model" and content.parts:
505                text_parts = [p.text for p in content.parts if p.text]
506                if text_parts:
507                    return " ".join(text_parts).strip()
508
509        return "Task execution completed (no final message)"

Step 3: Create the Main Script

Python

main.py

1import sys
2import time
3
4from helpers import STEEL_API_KEY, GEMINI_API_KEY, TASK
5from agent import Agent
6
7
8def main():
9    print("🚀 Steel + Gemini Computer Use Assistant")
10    print("=" * 60)
11
12    if STEEL_API_KEY == "your-steel-api-key-here":
13        print(
14            "⚠️ WARNING: Please replace 'your-steel-api-key-here' with your actual Steel API key"
15        )
16        print("   Get your API key at: https://app.steel.dev/settings/api-keys")
17        sys.exit(1)
18
19    if GEMINI_API_KEY == "your-gemini-api-key-here":
20        print(
21            "⚠️ WARNING: Please replace 'your-gemini-api-key-here' with your actual Gemini API key"
22        )
23        print("   Get your API key at: https://aistudio.google.com/apikey")
24        sys.exit(1)
25
26    print("\nStarting Steel session...")
27    agent = Agent()
28
29    try:
30        agent.initialize()
31        print("✅ Steel session started!")
32
33        start_time = time.time()
34        result = agent.execute_task(TASK, True, 50)
35        duration = f"{(time.time() - start_time):.1f}"
36
37        print("\n" + "=" * 60)
38        print("🎉 TASK EXECUTION COMPLETED")
39        print("=" * 60)
40        print(f"⏱️  Duration: {duration} seconds")
41        print(f"🎯 Task: {TASK}")
42        print(f"📋 Result:\n{result}")
43        print("=" * 60)
44
45    except Exception as e:
46        print(f"❌ Failed to run: {e}")
47        raise
48
49    finally:
50        agent.cleanup()
51
52
53if __name__ == "__main__":
54    main()

Running Your Agent

Execute your script to start an interactive AI browser session:

Terminal

python main.py

You will see the session URL printed in the console. You can view the live browser session by opening this URL in your web browser.

The agent will execute the task defined in the TASK environment variable or the default task. You can modify the task by setting the environment variable:

Terminal

export TASK="Search for the latest news on artificial intelligence"
python main.py

Understanding Gemini's Coordinate System

Gemini's Computer Use model uses a normalized coordinate system where both X and Y coordinates range from 0 to 1000. The agent automatically converts these to actual pixel coordinates based on the viewport size (1280x768 by default).

Next Steps

Explore the Steel API documentation for more advanced features
Check out the Gemini Computer Use documentation for more information about the model
Add additional features like session recording or multi-session management