Quickstart (Python)

How to use Gemini Computer Use with Steel

This guide will walk you through how to use Google's gemini-2.5-computer-use-preview model with Steel's Computer API to create AI agents that can navigate the web.

Gemini's Computer Use model uses a normalized coordinate system (0-1000) and provides built-in actions for browser control, making it straightforward to integrate with Steel.

Prerequisites

Step 1: Setup and Helper Functions

First, set up a virtual environment and install the required packages:

Terminal
$
uv venv
$
source .venv/bin/activate
$
uv add steel-sdk google-genai python-dotenv

Create a .env file with your API keys:

ENV
.env
1
STEEL_API_KEY=your_steel_api_key_here
2
GEMINI_API_KEY=your_gemini_api_key_here
3
TASK=Go to Steel.dev and find the latest news

Create a file with helper functions and constants:

Python
helpers.py
1
import os
2
import json
3
from typing import List, Optional, Tuple, Dict, Any
4
from datetime import datetime
5
6
from dotenv import load_dotenv
7
from steel import Steel
8
from google import genai
9
from google.genai import types
10
from google.genai.types import (
11
Content,
12
Part,
13
FunctionCall,
14
FunctionResponse,
15
Candidate,
16
FinishReason,
17
Tool,
18
GenerateContentConfig,
19
)
20
21
load_dotenv(override=True)
22
23
STEEL_API_KEY = os.getenv("STEEL_API_KEY") or "your-steel-api-key-here"
24
GEMINI_API_KEY = os.getenv("GEMINI_API_KEY") or "your-gemini-api-key-here"
25
TASK = os.getenv("TASK") or "Go to Steel.dev and find the latest news"
26
27
MODEL = "gemini-2.5-computer-use-preview-10-2025"
28
MAX_COORDINATE = 1000
29
30
31
def format_today() -> str:
32
return datetime.now().strftime("%A, %B %d, %Y")
33
34
35
BROWSER_SYSTEM_PROMPT = f"""<BROWSER_ENV>
36
- You control a headful Chromium browser running in a VM with internet access.
37
- Chromium is already open; interact only through computer use actions (mouse, keyboard, scroll, screenshots).
38
- Today's date is {format_today()}.
39
</BROWSER_ENV>
40
41
<BROWSER_CONTROL>
42
- When viewing pages, zoom out or scroll so all relevant content is visible.
43
- When typing into any input:
44
* Clear it first with Ctrl+A, then Delete.
45
* After submitting (pressing Enter or clicking a button), wait for the page to load.
46
- Computer tool calls are slow; batch related actions into a single call whenever possible.
47
- You may act on the user's behalf on sites where they are already authenticated.
48
- Assume any required authentication/Auth Contexts are already configured before the task starts.
49
- If the first screenshot is black:
50
* Click near the center of the screen.
51
* Take another screenshot.
52
</BROWSER_CONTROL>
53
54
<TASK_EXECUTION>
55
- You receive exactly one natural-language task and no further user feedback.
56
- Do not ask the user clarifying questions; instead, make reasonable assumptions and proceed.
57
- For complex tasks, quickly plan a short, ordered sequence of steps before acting.
58
- Prefer minimal, high-signal actions that move directly toward the goal.
59
- Keep your final response concise and focused on fulfilling the task (e.g., a brief summary of findings or results).
60
</TASK_EXECUTION>"""

Step 2: Create the Agent Class

Python
agent.py
1
import json
2
from typing import List, Optional, Tuple, Dict, Any
3
4
from helpers import (
5
STEEL_API_KEY,
6
GEMINI_API_KEY,
7
MODEL,
8
MAX_COORDINATE,
9
BROWSER_SYSTEM_PROMPT,
10
)
11
from steel import Steel
12
from google import genai
13
from google.genai import types
14
from google.genai.types import (
15
Content,
16
Part,
17
FunctionCall,
18
FunctionResponse,
19
Candidate,
20
FinishReason,
21
Tool,
22
GenerateContentConfig,
23
)
24
25
26
class Agent:
27
def __init__(self):
28
self.client = genai.Client(api_key=GEMINI_API_KEY)
29
self.steel = Steel(steel_api_key=STEEL_API_KEY)
30
self.session = None
31
self.contents: List[Content] = []
32
self.current_url = "about:blank"
33
self.viewport_width = 1280
34
self.viewport_height = 768
35
self.tools: List[Tool] = [
36
Tool(
37
computer_use=types.ComputerUse(
38
environment=types.Environment.ENVIRONMENT_BROWSER,
39
)
40
)
41
]
42
self.config = GenerateContentConfig(tools=self.tools)
43
44
def _denormalize_x(self, x: int) -> int:
45
return int(x / MAX_COORDINATE * self.viewport_width)
46
47
def _denormalize_y(self, y: int) -> int:
48
return int(y / MAX_COORDINATE * self.viewport_height)
49
50
def _center(self) -> Tuple[int, int]:
51
return (self.viewport_width // 2, self.viewport_height // 2)
52
53
def _normalize_key(self, key: str) -> str:
54
if not isinstance(key, str) or not key:
55
return key
56
k = key.strip()
57
upper = k.upper()
58
synonyms = {
59
"ENTER": "Enter",
60
"RETURN": "Enter",
61
"ESC": "Escape",
62
"ESCAPE": "Escape",
63
"TAB": "Tab",
64
"BACKSPACE": "Backspace",
65
"DELETE": "Delete",
66
"SPACE": "Space",
67
"CTRL": "Control",
68
"CONTROL": "Control",
69
"ALT": "Alt",
70
"SHIFT": "Shift",
71
"META": "Meta",
72
"CMD": "Meta",
73
"UP": "ArrowUp",
74
"DOWN": "ArrowDown",
75
"LEFT": "ArrowLeft",
76
"RIGHT": "ArrowRight",
77
"HOME": "Home",
78
"END": "End",
79
"PAGEUP": "PageUp",
80
"PAGEDOWN": "PageDown",
81
}
82
if upper in synonyms:
83
return synonyms[upper]
84
if upper.startswith("F") and upper[1:].isdigit():
85
return "F" + upper[1:]
86
return k
87
88
def _normalize_keys(self, keys: List[str]) -> List[str]:
89
return [self._normalize_key(k) for k in keys]
90
91
def initialize(self) -> None:
92
self.session = self.steel.sessions.create(
93
dimensions={"width": self.viewport_width, "height": self.viewport_height},
94
block_ads=True,
95
api_timeout=900000,
96
)
97
print("Steel Session created successfully!")
98
print(f"View live session at: {self.session.session_viewer_url}")
99
100
def cleanup(self) -> None:
101
if self.session:
102
print("Releasing Steel session...")
103
self.steel.sessions.release(self.session.id)
104
print(
105
f"Session completed. View replay at {self.session.session_viewer_url}"
106
)
107
self.session = None
108
109
def _take_screenshot(self) -> str:
110
resp = self.steel.sessions.computer(self.session.id, action="take_screenshot")
111
img = getattr(resp, "base64_image", None)
112
if not img:
113
raise RuntimeError("No screenshot returned from Steel")
114
return img
115
116
def _execute_computer_action(
117
self, function_call: FunctionCall
118
) -> Tuple[str, Optional[str]]:
119
"""Execute a computer action and return (screenshot_base64, url)."""
120
name = function_call.name or ""
121
args: Dict[str, Any] = function_call.args or {}
122
123
if name == "open_web_browser":
124
screenshot = self._take_screenshot()
125
return screenshot, self.current_url
126
127
elif name == "click_at":
128
x = self._denormalize_x(args.get("x", 0))
129
y = self._denormalize_y(args.get("y", 0))
130
resp = self.steel.sessions.computer(
131
self.session.id,
132
action="click_mouse",
133
button="left",
134
coordinates=[x, y],
135
screenshot=True,
136
)
137
img = getattr(resp, "base64_image", None)
138
return img or self._take_screenshot(), self.current_url
139
140
elif name == "hover_at":
141
x = self._denormalize_x(args.get("x", 0))
142
y = self._denormalize_y(args.get("y", 0))
143
resp = self.steel.sessions.computer(
144
self.session.id,
145
action="move_mouse",
146
coordinates=[x, y],
147
screenshot=True,
148
)
149
img = getattr(resp, "base64_image", None)
150
return img or self._take_screenshot(), self.current_url
151
152
elif name == "type_text_at":
153
x = self._denormalize_x(args.get("x", 0))
154
y = self._denormalize_y(args.get("y", 0))
155
text = args.get("text", "")
156
press_enter = args.get("press_enter", True)
157
clear_before_typing = args.get("clear_before_typing", True)
158
159
self.steel.sessions.computer(
160
self.session.id,
161
action="click_mouse",
162
button="left",
163
coordinates=[x, y],
164
)
165
166
if clear_before_typing:
167
self.steel.sessions.computer(
168
self.session.id,
169
action="press_key",
170
keys=["Control", "a"],
171
)
172
self.steel.sessions.computer(
173
self.session.id,
174
action="press_key",
175
keys=["Backspace"],
176
)
177
178
self.steel.sessions.computer(
179
self.session.id,
180
action="type_text",
181
text=text,
182
)
183
184
if press_enter:
185
self.steel.sessions.computer(
186
self.session.id,
187
action="press_key",
188
keys=["Enter"],
189
)
190
191
self.steel.sessions.computer(
192
self.session.id,
193
action="wait",
194
duration=1,
195
)
196
197
screenshot = self._take_screenshot()
198
return screenshot, self.current_url
199
200
elif name == "scroll_document":
201
direction = args.get("direction", "down")
202
203
if direction == "down":
204
keys = ["PageDown"]
205
elif direction == "up":
206
keys = ["PageUp"]
207
elif direction in ("left", "right"):
208
cx, cy = self._center()
209
delta = -400 if direction == "left" else 400
210
resp = self.steel.sessions.computer(
211
self.session.id,
212
action="scroll",
213
coordinates=[cx, cy],
214
delta_x=delta,
215
delta_y=0,
216
screenshot=True,
217
)
218
img = getattr(resp, "base64_image", None)
219
return img or self._take_screenshot(), self.current_url
220
else:
221
keys = ["PageDown"]
222
223
resp = self.steel.sessions.computer(
224
self.session.id,
225
action="press_key",
226
keys=keys,
227
screenshot=True,
228
)
229
img = getattr(resp, "base64_image", None)
230
return img or self._take_screenshot(), self.current_url
231
232
elif name == "scroll_at":
233
x = self._denormalize_x(args.get("x", 0))
234
y = self._denormalize_y(args.get("y", 0))
235
direction = args.get("direction", "down")
236
magnitude = self._denormalize_y(args.get("magnitude", 800))
237
238
delta_x, delta_y = 0, 0
239
if direction == "down":
240
delta_y = magnitude
241
elif direction == "up":
242
delta_y = -magnitude
243
elif direction == "right":
244
delta_x = magnitude
245
elif direction == "left":
246
delta_x = -magnitude
247
248
resp = self.steel.sessions.computer(
249
self.session.id,
250
action="scroll",
251
coordinates=[x, y],
252
delta_x=delta_x,
253
delta_y=delta_y,
254
screenshot=True,
255
)
256
img = getattr(resp, "base64_image", None)
257
return img or self._take_screenshot(), self.current_url
258
259
elif name == "wait_5_seconds":
260
resp = self.steel.sessions.computer(
261
self.session.id,
262
action="wait",
263
duration=5,
264
screenshot=True,
265
)
266
img = getattr(resp, "base64_image", None)
267
return img or self._take_screenshot(), self.current_url
268
269
elif name == "go_back":
270
resp = self.steel.sessions.computer(
271
self.session.id,
272
action="press_key",
273
keys=["Alt", "ArrowLeft"],
274
screenshot=True,
275
)
276
img = getattr(resp, "base64_image", None)
277
return img or self._take_screenshot(), self.current_url
278
279
elif name == "go_forward":
280
resp = self.steel.sessions.computer(
281
self.session.id,
282
action="press_key",
283
keys=["Alt", "ArrowRight"],
284
screenshot=True,
285
)
286
img = getattr(resp, "base64_image", None)
287
return img or self._take_screenshot(), self.current_url
288
289
elif name == "navigate":
290
url = args.get("url", "")
291
if not url.startswith(("http://", "https://")):
292
url = "https://" + url
293
294
self.steel.sessions.computer(
295
self.session.id,
296
action="press_key",
297
keys=["Control", "l"],
298
)
299
self.steel.sessions.computer(
300
self.session.id,
301
action="type_text",
302
text=url,
303
)
304
self.steel.sessions.computer(
305
self.session.id,
306
action="press_key",
307
keys=["Enter"],
308
)
309
self.steel.sessions.computer(
310
self.session.id,
311
action="wait",
312
duration=2,
313
)
314
315
self.current_url = url
316
screenshot = self._take_screenshot()
317
return screenshot, self.current_url
318
319
elif name == "key_combination":
320
keys_str = args.get("keys", "")
321
keys = [k.strip() for k in keys_str.split("+")]
322
normalized_keys = self._normalize_keys(keys)
323
324
resp = self.steel.sessions.computer(
325
self.session.id,
326
action="press_key",
327
keys=normalized_keys,
328
screenshot=True,
329
)
330
img = getattr(resp, "base64_image", None)
331
return img or self._take_screenshot(), self.current_url
332
333
elif name == "drag_and_drop":
334
start_x = self._denormalize_x(args.get("x", 0))
335
start_y = self._denormalize_y(args.get("y", 0))
336
end_x = self._denormalize_x(args.get("destination_x", 0))
337
end_y = self._denormalize_y(args.get("destination_y", 0))
338
339
resp = self.steel.sessions.computer(
340
self.session.id,
341
action="drag_mouse",
342
path=[[start_x, start_y], [end_x, end_y]],
343
screenshot=True,
344
)
345
img = getattr(resp, "base64_image", None)
346
return img or self._take_screenshot(), self.current_url
347
348
else:
349
print(f"Unknown action: {name}, taking screenshot")
350
screenshot = self._take_screenshot()
351
return screenshot, self.current_url
352
353
def _extract_function_calls(self, candidate: Candidate) -> List[FunctionCall]:
354
function_calls: List[FunctionCall] = []
355
if not candidate.content or not candidate.content.parts:
356
return function_calls
357
358
for part in candidate.content.parts:
359
if part.function_call:
360
function_calls.append(part.function_call)
361
362
return function_calls
363
364
def _extract_text(self, candidate: Candidate) -> str:
365
if not candidate.content or not candidate.content.parts:
366
return ""
367
texts: List[str] = []
368
for part in candidate.content.parts:
369
if part.text:
370
texts.append(part.text)
371
return " ".join(texts).strip()
372
373
def _build_function_response_parts(
374
self,
375
function_calls: List[FunctionCall],
376
results: List[Tuple[str, Optional[str]]],
377
) -> List[Part]:
378
parts: List[Part] = []
379
380
for i, fc in enumerate(function_calls):
381
screenshot_base64, url = results[i]
382
383
function_response = FunctionResponse(
384
name=fc.name or "",
385
response={"url": url or self.current_url},
386
)
387
parts.append(Part(function_response=function_response))
388
parts.append(
389
Part(
390
inline_data=types.Blob(
391
mime_type="image/png",
392
data=screenshot_base64,
393
)
394
)
395
)
396
397
return parts
398
399
def execute_task(
400
self,
401
task: str,
402
print_steps: bool = True,
403
max_iterations: int = 50,
404
) -> str:
405
self.contents = [
406
Content(
407
role="user",
408
parts=[Part(text=BROWSER_SYSTEM_PROMPT), Part(text=task)],
409
)
410
]
411
412
iterations = 0
413
consecutive_no_actions = 0
414
415
print(f"๐ŸŽฏ Executing task: {task}")
416
print("=" * 60)
417
418
while iterations < max_iterations:
419
iterations += 1
420
421
try:
422
response = self.client.models.generate_content(
423
model=MODEL,
424
contents=self.contents,
425
config=self.config,
426
)
427
428
if not response.candidates:
429
print("โŒ No candidates in response")
430
break
431
432
candidate = response.candidates[0]
433
434
if candidate.content:
435
self.contents.append(candidate.content)
436
437
reasoning = self._extract_text(candidate)
438
function_calls = self._extract_function_calls(candidate)
439
440
if (
441
not function_calls
442
and not reasoning
443
and candidate.finish_reason == FinishReason.MALFORMED_FUNCTION_CALL
444
):
445
print("โš ๏ธ Malformed function call, retrying...")
446
continue
447
448
if not function_calls:
449
if reasoning:
450
if print_steps:
451
print(f"\n๐Ÿ’ฌ {reasoning}")
452
print("โœ… Task complete - model provided final response")
453
break
454
455
consecutive_no_actions += 1
456
if consecutive_no_actions >= 3:
457
print("โš ๏ธ No actions for 3 consecutive iterations - stopping")
458
break
459
continue
460
461
consecutive_no_actions = 0
462
463
if print_steps and reasoning:
464
print(f"\n๐Ÿ’ญ {reasoning}")
465
466
results: List[Tuple[str, Optional[str]]] = []
467
468
for fc in function_calls:
469
action_name = fc.name or "unknown"
470
action_args = fc.args or {}
471
472
if print_steps:
473
print(f"๐Ÿ”ง {action_name}({json.dumps(action_args)})")
474
475
if action_args:
476
safety_decision = action_args.get("safety_decision")
477
if (
478
isinstance(safety_decision, dict)
479
and safety_decision.get("decision") == "require_confirmation"
480
):
481
print(
482
f"โš ๏ธ Safety confirmation required: {safety_decision.get('explanation')}"
483
)
484
print("โœ… Auto-acknowledging safety check")
485
486
result = self._execute_computer_action(fc)
487
results.append(result)
488
489
function_response_parts = self._build_function_response_parts(
490
function_calls, results
491
)
492
self.contents.append(
493
Content(role="user", parts=function_response_parts)
494
)
495
496
except Exception as e:
497
print(f"โŒ Error during task execution: {e}")
498
raise
499
500
if iterations >= max_iterations:
501
print(f"โš ๏ธ Task execution stopped after {max_iterations} iterations")
502
503
for content in reversed(self.contents):
504
if content.role == "model" and content.parts:
505
text_parts = [p.text for p in content.parts if p.text]
506
if text_parts:
507
return " ".join(text_parts).strip()
508
509
return "Task execution completed (no final message)"

Step 3: Create the Main Script

Python
main.py
1
import sys
2
import time
3
4
from helpers import STEEL_API_KEY, GEMINI_API_KEY, TASK
5
from agent import Agent
6
7
8
def main():
9
print("๐Ÿš€ Steel + Gemini Computer Use Assistant")
10
print("=" * 60)
11
12
if STEEL_API_KEY == "your-steel-api-key-here":
13
print(
14
"โš ๏ธ WARNING: Please replace 'your-steel-api-key-here' with your actual Steel API key"
15
)
16
print(" Get your API key at: https://app.steel.dev/settings/api-keys")
17
sys.exit(1)
18
19
if GEMINI_API_KEY == "your-gemini-api-key-here":
20
print(
21
"โš ๏ธ WARNING: Please replace 'your-gemini-api-key-here' with your actual Gemini API key"
22
)
23
print(" Get your API key at: https://aistudio.google.com/apikey")
24
sys.exit(1)
25
26
print("\nStarting Steel session...")
27
agent = Agent()
28
29
try:
30
agent.initialize()
31
print("โœ… Steel session started!")
32
33
start_time = time.time()
34
result = agent.execute_task(TASK, True, 50)
35
duration = f"{(time.time() - start_time):.1f}"
36
37
print("\n" + "=" * 60)
38
print("๐ŸŽ‰ TASK EXECUTION COMPLETED")
39
print("=" * 60)
40
print(f"โฑ๏ธ Duration: {duration} seconds")
41
print(f"๐ŸŽฏ Task: {TASK}")
42
print(f"๐Ÿ“‹ Result:\n{result}")
43
print("=" * 60)
44
45
except Exception as e:
46
print(f"โŒ Failed to run: {e}")
47
raise
48
49
finally:
50
agent.cleanup()
51
52
53
if __name__ == "__main__":
54
main()

Running Your Agent

Execute your script to start an interactive AI browser session:

Terminal
python main.py

You will see the session URL printed in the console. You can view the live browser session by opening this URL in your web browser.

The agent will execute the task defined in the TASK environment variable or the default task. You can modify the task by setting the environment variable:

Terminal
export TASK="Search for the latest news on artificial intelligence"
python main.py

Understanding Gemini's Coordinate System

Gemini's Computer Use model uses a normalized coordinate system where both X and Y coordinates range from 0 to 1000. The agent automatically converts these to actual pixel coordinates based on the viewport size (1280x768 by default).

Next Steps