Quickstart (Python)
How to use Gemini Computer Use with Steel
This guide will walk you through how to use Google's gemini-2.5-computer-use-preview model with Steel's Computer API to create AI agents that can navigate the web.
Gemini's Computer Use model uses a normalized coordinate system (0-1000) and provides built-in actions for browser control, making it straightforward to integrate with Steel.
Prerequisites
-
Python 3.8+
-
A Steel API key (sign up here)
-
A Gemini API key (get one here)
Step 1: Setup and Helper Functions
First, set up a virtual environment and install the required packages:
$uv venv$source .venv/bin/activate$uv add steel-sdk google-genai python-dotenv
Create a .env file with your API keys:
1STEEL_API_KEY=your_steel_api_key_here2GEMINI_API_KEY=your_gemini_api_key_here3TASK=Go to Steel.dev and find the latest news
Create a file with helper functions and constants:
1import os2import json3from typing import List, Optional, Tuple, Dict, Any4from datetime import datetime56from dotenv import load_dotenv7from steel import Steel8from google import genai9from google.genai import types10from google.genai.types import (11Content,12Part,13FunctionCall,14FunctionResponse,15Candidate,16FinishReason,17Tool,18GenerateContentConfig,19)2021load_dotenv(override=True)2223STEEL_API_KEY = os.getenv("STEEL_API_KEY") or "your-steel-api-key-here"24GEMINI_API_KEY = os.getenv("GEMINI_API_KEY") or "your-gemini-api-key-here"25TASK = os.getenv("TASK") or "Go to Steel.dev and find the latest news"2627MODEL = "gemini-2.5-computer-use-preview-10-2025"28MAX_COORDINATE = 1000293031def format_today() -> str:32return datetime.now().strftime("%A, %B %d, %Y")333435BROWSER_SYSTEM_PROMPT = f"""<BROWSER_ENV>36- You control a headful Chromium browser running in a VM with internet access.37- Chromium is already open; interact only through computer use actions (mouse, keyboard, scroll, screenshots).38- Today's date is {format_today()}.39</BROWSER_ENV>4041<BROWSER_CONTROL>42- When viewing pages, zoom out or scroll so all relevant content is visible.43- When typing into any input:44* Clear it first with Ctrl+A, then Delete.45* After submitting (pressing Enter or clicking a button), wait for the page to load.46- Computer tool calls are slow; batch related actions into a single call whenever possible.47- You may act on the user's behalf on sites where they are already authenticated.48- Assume any required authentication/Auth Contexts are already configured before the task starts.49- If the first screenshot is black:50* Click near the center of the screen.51* Take another screenshot.52</BROWSER_CONTROL>5354<TASK_EXECUTION>55- You receive exactly one natural-language task and no further user feedback.56- Do not ask the user clarifying questions; instead, make reasonable assumptions and proceed.57- For complex tasks, quickly plan a short, ordered sequence of steps before acting.58- Prefer minimal, high-signal actions that move directly toward the goal.59- Keep your final response concise and focused on fulfilling the task (e.g., a brief summary of findings or results).60</TASK_EXECUTION>"""
Step 2: Create the Agent Class
1import json2from typing import List, Optional, Tuple, Dict, Any34from helpers import (5STEEL_API_KEY,6GEMINI_API_KEY,7MODEL,8MAX_COORDINATE,9BROWSER_SYSTEM_PROMPT,10)11from steel import Steel12from google import genai13from google.genai import types14from google.genai.types import (15Content,16Part,17FunctionCall,18FunctionResponse,19Candidate,20FinishReason,21Tool,22GenerateContentConfig,23)242526class Agent:27def __init__(self):28self.client = genai.Client(api_key=GEMINI_API_KEY)29self.steel = Steel(steel_api_key=STEEL_API_KEY)30self.session = None31self.contents: List[Content] = []32self.current_url = "about:blank"33self.viewport_width = 128034self.viewport_height = 76835self.tools: List[Tool] = [36Tool(37computer_use=types.ComputerUse(38environment=types.Environment.ENVIRONMENT_BROWSER,39)40)41]42self.config = GenerateContentConfig(tools=self.tools)4344def _denormalize_x(self, x: int) -> int:45return int(x / MAX_COORDINATE * self.viewport_width)4647def _denormalize_y(self, y: int) -> int:48return int(y / MAX_COORDINATE * self.viewport_height)4950def _center(self) -> Tuple[int, int]:51return (self.viewport_width // 2, self.viewport_height // 2)5253def _normalize_key(self, key: str) -> str:54if not isinstance(key, str) or not key:55return key56k = key.strip()57upper = k.upper()58synonyms = {59"ENTER": "Enter",60"RETURN": "Enter",61"ESC": "Escape",62"ESCAPE": "Escape",63"TAB": "Tab",64"BACKSPACE": "Backspace",65"DELETE": "Delete",66"SPACE": "Space",67"CTRL": "Control",68"CONTROL": "Control",69"ALT": "Alt",70"SHIFT": "Shift",71"META": "Meta",72"CMD": "Meta",73"UP": "ArrowUp",74"DOWN": "ArrowDown",75"LEFT": "ArrowLeft",76"RIGHT": "ArrowRight",77"HOME": "Home",78"END": "End",79"PAGEUP": "PageUp",80"PAGEDOWN": "PageDown",81}82if upper in synonyms:83return synonyms[upper]84if upper.startswith("F") and upper[1:].isdigit():85return "F" + upper[1:]86return k8788def _normalize_keys(self, keys: List[str]) -> List[str]:89return [self._normalize_key(k) for k in keys]9091def initialize(self) -> None:92self.session = self.steel.sessions.create(93dimensions={"width": self.viewport_width, "height": self.viewport_height},94block_ads=True,95api_timeout=900000,96)97print("Steel Session created successfully!")98print(f"View live session at: {self.session.session_viewer_url}")99100def cleanup(self) -> None:101if self.session:102print("Releasing Steel session...")103self.steel.sessions.release(self.session.id)104print(105f"Session completed. View replay at {self.session.session_viewer_url}"106)107self.session = None108109def _take_screenshot(self) -> str:110resp = self.steel.sessions.computer(self.session.id, action="take_screenshot")111img = getattr(resp, "base64_image", None)112if not img:113raise RuntimeError("No screenshot returned from Steel")114return img115116def _execute_computer_action(117self, function_call: FunctionCall118) -> Tuple[str, Optional[str]]:119"""Execute a computer action and return (screenshot_base64, url)."""120name = function_call.name or ""121args: Dict[str, Any] = function_call.args or {}122123if name == "open_web_browser":124screenshot = self._take_screenshot()125return screenshot, self.current_url126127elif name == "click_at":128x = self._denormalize_x(args.get("x", 0))129y = self._denormalize_y(args.get("y", 0))130resp = self.steel.sessions.computer(131self.session.id,132action="click_mouse",133button="left",134coordinates=[x, y],135screenshot=True,136)137img = getattr(resp, "base64_image", None)138return img or self._take_screenshot(), self.current_url139140elif name == "hover_at":141x = self._denormalize_x(args.get("x", 0))142y = self._denormalize_y(args.get("y", 0))143resp = self.steel.sessions.computer(144self.session.id,145action="move_mouse",146coordinates=[x, y],147screenshot=True,148)149img = getattr(resp, "base64_image", None)150return img or self._take_screenshot(), self.current_url151152elif name == "type_text_at":153x = self._denormalize_x(args.get("x", 0))154y = self._denormalize_y(args.get("y", 0))155text = args.get("text", "")156press_enter = args.get("press_enter", True)157clear_before_typing = args.get("clear_before_typing", True)158159self.steel.sessions.computer(160self.session.id,161action="click_mouse",162button="left",163coordinates=[x, y],164)165166if clear_before_typing:167self.steel.sessions.computer(168self.session.id,169action="press_key",170keys=["Control", "a"],171)172self.steel.sessions.computer(173self.session.id,174action="press_key",175keys=["Backspace"],176)177178self.steel.sessions.computer(179self.session.id,180action="type_text",181text=text,182)183184if press_enter:185self.steel.sessions.computer(186self.session.id,187action="press_key",188keys=["Enter"],189)190191self.steel.sessions.computer(192self.session.id,193action="wait",194duration=1,195)196197screenshot = self._take_screenshot()198return screenshot, self.current_url199200elif name == "scroll_document":201direction = args.get("direction", "down")202203if direction == "down":204keys = ["PageDown"]205elif direction == "up":206keys = ["PageUp"]207elif direction in ("left", "right"):208cx, cy = self._center()209delta = -400 if direction == "left" else 400210resp = self.steel.sessions.computer(211self.session.id,212action="scroll",213coordinates=[cx, cy],214delta_x=delta,215delta_y=0,216screenshot=True,217)218img = getattr(resp, "base64_image", None)219return img or self._take_screenshot(), self.current_url220else:221keys = ["PageDown"]222223resp = self.steel.sessions.computer(224self.session.id,225action="press_key",226keys=keys,227screenshot=True,228)229img = getattr(resp, "base64_image", None)230return img or self._take_screenshot(), self.current_url231232elif name == "scroll_at":233x = self._denormalize_x(args.get("x", 0))234y = self._denormalize_y(args.get("y", 0))235direction = args.get("direction", "down")236magnitude = self._denormalize_y(args.get("magnitude", 800))237238delta_x, delta_y = 0, 0239if direction == "down":240delta_y = magnitude241elif direction == "up":242delta_y = -magnitude243elif direction == "right":244delta_x = magnitude245elif direction == "left":246delta_x = -magnitude247248resp = self.steel.sessions.computer(249self.session.id,250action="scroll",251coordinates=[x, y],252delta_x=delta_x,253delta_y=delta_y,254screenshot=True,255)256img = getattr(resp, "base64_image", None)257return img or self._take_screenshot(), self.current_url258259elif name == "wait_5_seconds":260resp = self.steel.sessions.computer(261self.session.id,262action="wait",263duration=5,264screenshot=True,265)266img = getattr(resp, "base64_image", None)267return img or self._take_screenshot(), self.current_url268269elif name == "go_back":270resp = self.steel.sessions.computer(271self.session.id,272action="press_key",273keys=["Alt", "ArrowLeft"],274screenshot=True,275)276img = getattr(resp, "base64_image", None)277return img or self._take_screenshot(), self.current_url278279elif name == "go_forward":280resp = self.steel.sessions.computer(281self.session.id,282action="press_key",283keys=["Alt", "ArrowRight"],284screenshot=True,285)286img = getattr(resp, "base64_image", None)287return img or self._take_screenshot(), self.current_url288289elif name == "navigate":290url = args.get("url", "")291if not url.startswith(("http://", "https://")):292url = "https://" + url293294self.steel.sessions.computer(295self.session.id,296action="press_key",297keys=["Control", "l"],298)299self.steel.sessions.computer(300self.session.id,301action="type_text",302text=url,303)304self.steel.sessions.computer(305self.session.id,306action="press_key",307keys=["Enter"],308)309self.steel.sessions.computer(310self.session.id,311action="wait",312duration=2,313)314315self.current_url = url316screenshot = self._take_screenshot()317return screenshot, self.current_url318319elif name == "key_combination":320keys_str = args.get("keys", "")321keys = [k.strip() for k in keys_str.split("+")]322normalized_keys = self._normalize_keys(keys)323324resp = self.steel.sessions.computer(325self.session.id,326action="press_key",327keys=normalized_keys,328screenshot=True,329)330img = getattr(resp, "base64_image", None)331return img or self._take_screenshot(), self.current_url332333elif name == "drag_and_drop":334start_x = self._denormalize_x(args.get("x", 0))335start_y = self._denormalize_y(args.get("y", 0))336end_x = self._denormalize_x(args.get("destination_x", 0))337end_y = self._denormalize_y(args.get("destination_y", 0))338339resp = self.steel.sessions.computer(340self.session.id,341action="drag_mouse",342path=[[start_x, start_y], [end_x, end_y]],343screenshot=True,344)345img = getattr(resp, "base64_image", None)346return img or self._take_screenshot(), self.current_url347348else:349print(f"Unknown action: {name}, taking screenshot")350screenshot = self._take_screenshot()351return screenshot, self.current_url352353def _extract_function_calls(self, candidate: Candidate) -> List[FunctionCall]:354function_calls: List[FunctionCall] = []355if not candidate.content or not candidate.content.parts:356return function_calls357358for part in candidate.content.parts:359if part.function_call:360function_calls.append(part.function_call)361362return function_calls363364def _extract_text(self, candidate: Candidate) -> str:365if not candidate.content or not candidate.content.parts:366return ""367texts: List[str] = []368for part in candidate.content.parts:369if part.text:370texts.append(part.text)371return " ".join(texts).strip()372373def _build_function_response_parts(374self,375function_calls: List[FunctionCall],376results: List[Tuple[str, Optional[str]]],377) -> List[Part]:378parts: List[Part] = []379380for i, fc in enumerate(function_calls):381screenshot_base64, url = results[i]382383function_response = FunctionResponse(384name=fc.name or "",385response={"url": url or self.current_url},386)387parts.append(Part(function_response=function_response))388parts.append(389Part(390inline_data=types.Blob(391mime_type="image/png",392data=screenshot_base64,393)394)395)396397return parts398399def execute_task(400self,401task: str,402print_steps: bool = True,403max_iterations: int = 50,404) -> str:405self.contents = [406Content(407role="user",408parts=[Part(text=BROWSER_SYSTEM_PROMPT), Part(text=task)],409)410]411412iterations = 0413consecutive_no_actions = 0414415print(f"๐ฏ Executing task: {task}")416print("=" * 60)417418while iterations < max_iterations:419iterations += 1420421try:422response = self.client.models.generate_content(423model=MODEL,424contents=self.contents,425config=self.config,426)427428if not response.candidates:429print("โ No candidates in response")430break431432candidate = response.candidates[0]433434if candidate.content:435self.contents.append(candidate.content)436437reasoning = self._extract_text(candidate)438function_calls = self._extract_function_calls(candidate)439440if (441not function_calls442and not reasoning443and candidate.finish_reason == FinishReason.MALFORMED_FUNCTION_CALL444):445print("โ ๏ธ Malformed function call, retrying...")446continue447448if not function_calls:449if reasoning:450if print_steps:451print(f"\n๐ฌ {reasoning}")452print("โ Task complete - model provided final response")453break454455consecutive_no_actions += 1456if consecutive_no_actions >= 3:457print("โ ๏ธ No actions for 3 consecutive iterations - stopping")458break459continue460461consecutive_no_actions = 0462463if print_steps and reasoning:464print(f"\n๐ญ {reasoning}")465466results: List[Tuple[str, Optional[str]]] = []467468for fc in function_calls:469action_name = fc.name or "unknown"470action_args = fc.args or {}471472if print_steps:473print(f"๐ง {action_name}({json.dumps(action_args)})")474475if action_args:476safety_decision = action_args.get("safety_decision")477if (478isinstance(safety_decision, dict)479and safety_decision.get("decision") == "require_confirmation"480):481print(482f"โ ๏ธ Safety confirmation required: {safety_decision.get('explanation')}"483)484print("โ Auto-acknowledging safety check")485486result = self._execute_computer_action(fc)487results.append(result)488489function_response_parts = self._build_function_response_parts(490function_calls, results491)492self.contents.append(493Content(role="user", parts=function_response_parts)494)495496except Exception as e:497print(f"โ Error during task execution: {e}")498raise499500if iterations >= max_iterations:501print(f"โ ๏ธ Task execution stopped after {max_iterations} iterations")502503for content in reversed(self.contents):504if content.role == "model" and content.parts:505text_parts = [p.text for p in content.parts if p.text]506if text_parts:507return " ".join(text_parts).strip()508509return "Task execution completed (no final message)"
Step 3: Create the Main Script
1import sys2import time34from helpers import STEEL_API_KEY, GEMINI_API_KEY, TASK5from agent import Agent678def main():9print("๐ Steel + Gemini Computer Use Assistant")10print("=" * 60)1112if STEEL_API_KEY == "your-steel-api-key-here":13print(14"โ ๏ธ WARNING: Please replace 'your-steel-api-key-here' with your actual Steel API key"15)16print(" Get your API key at: https://app.steel.dev/settings/api-keys")17sys.exit(1)1819if GEMINI_API_KEY == "your-gemini-api-key-here":20print(21"โ ๏ธ WARNING: Please replace 'your-gemini-api-key-here' with your actual Gemini API key"22)23print(" Get your API key at: https://aistudio.google.com/apikey")24sys.exit(1)2526print("\nStarting Steel session...")27agent = Agent()2829try:30agent.initialize()31print("โ Steel session started!")3233start_time = time.time()34result = agent.execute_task(TASK, True, 50)35duration = f"{(time.time() - start_time):.1f}"3637print("\n" + "=" * 60)38print("๐ TASK EXECUTION COMPLETED")39print("=" * 60)40print(f"โฑ๏ธ Duration: {duration} seconds")41print(f"๐ฏ Task: {TASK}")42print(f"๐ Result:\n{result}")43print("=" * 60)4445except Exception as e:46print(f"โ Failed to run: {e}")47raise4849finally:50agent.cleanup()515253if __name__ == "__main__":54main()
Running Your Agent
Execute your script to start an interactive AI browser session:
python main.py
You will see the session URL printed in the console. You can view the live browser session by opening this URL in your web browser.
The agent will execute the task defined in the TASK environment variable or the default task. You can modify the task by setting the environment variable:
export TASK="Search for the latest news on artificial intelligence"python main.py
Understanding Gemini's Coordinate System
Gemini's Computer Use model uses a normalized coordinate system where both X and Y coordinates range from 0 to 1000. The agent automatically converts these to actual pixel coordinates based on the viewport size (1280x768 by default).
Next Steps
-
Explore the Steel API documentation for more advanced features
-
Check out the Gemini Computer Use documentation for more information about the model
-
Add additional features like session recording or multi-session management