Quickstart (Python)

How to use OpenAI Computer Use with Steel

This guide will walk you through how to use OpenAI's computer-use-preview model with Steel's Computer API to create AI agents that can navigate the web.

We'll be implementing a simple CUA loop that functions as described below:

Computer use - OpenAI API

Prerequisites

  • Python 3.8+

  • A Steel API key (sign up here)

  • An OpenAI API key with access to the computer-use-preview model

Step 1: Setup and Helper Functions

First, set up a virtual environment and install the required packages:

Terminal
$
uv venv
$
source .venv/bin/activate
$
uv add steel-sdk requests python-dotenv

Create a .env file with your API keys:

ENV
.env
1
STEEL_API_KEY=your_steel_api_key_here
2
OPENAI_API_KEY=your_openai_api_key_here
3
TASK=Go to Steel.dev and find the latest news

Create a file with helper functions and constants:

Python
helpers.py
1
import os
2
import json
3
from typing import Any, Dict, List, Optional, Tuple
4
from datetime import datetime
5
6
import requests
7
from dotenv import load_dotenv
8
from steel import Steel
9
10
load_dotenv(override=True)
11
12
STEEL_API_KEY = os.getenv("STEEL_API_KEY") or "your-steel-api-key-here"
13
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY") or "your-openai-api-key-here"
14
TASK = os.getenv("TASK") or "Go to Steel.dev and find the latest news"
15
16
17
def format_today() -> str:
18
return datetime.now().strftime("%A, %B %d, %Y")
19
20
21
BROWSER_SYSTEM_PROMPT = f"""<BROWSER_ENV>
22
- You control a headful Chromium browser running in a VM with internet access.
23
- Interact only through the computer tool (mouse/keyboard/scroll/screenshots). Do not call navigation functions.
24
- Today's date is {format_today()}.
25
</BROWSER_ENV>
26
27
<BROWSER_CONTROL>
28
- Before acting, take a screenshot to observe state.
29
- When typing into any input:
30
* Clear with Ctrl/⌘+A, then Delete.
31
* After submitting (Enter or clicking a button), call wait(1–2s) once, then take a single screenshot and move the mouse aside.
32
* Do not press Enter repeatedly. If the page state doesn't change after submit+wait+screenshot, change strategy (e.g., focus address bar with Ctrl/⌘+L, type the full URL, press Enter once).
33
- Computer calls are slow; batch related actions together.
34
- Zoom out or scroll so all relevant content is visible before reading.
35
- If the first screenshot is black, click near center and screenshot again.
36
</BROWSER_CONTROL>
37
38
<TASK_EXECUTION>
39
- You receive exactly one natural-language task and no further user feedback.
40
- Do not ask clarifying questions; make reasonable assumptions and proceed.
41
- Prefer minimal, high-signal actions that move directly toward the goal.
42
- Every assistant turn must include at least one computer action; avoid text-only turns.
43
- Avoid repetition: never repeat the same action sequence in consecutive turns (e.g., pressing Enter multiple times). If an action has no visible effect, pivot to a different approach.
44
- If two iterations produce no meaningful progress, try a different tactic (e.g., Ctrl/⌘+L → type URL → Enter) rather than repeating the prior keys, then proceed.
45
- Keep the final response concise and focused on fulfilling the task.
46
</TASK_EXECUTION>"""
47
48
49
def create_response(**kwargs):
50
url = "https://api.openai.com/v1/responses"
51
headers = {
52
"Authorization": f"Bearer {os.getenv('OPENAI_API_KEY')}",
53
"Content-Type": "application/json",
54
}
55
openai_org = os.getenv("OPENAI_ORG")
56
if openai_org:
57
headers["Openai-Organization"] = openai_org
58
59
response = requests.post(url, headers=headers, json=kwargs)
60
if response.status_code != 200:
61
raise RuntimeError(f"OpenAI API Error: {response.status_code} {response.text}")
62
return response.json()

Step 2: Create the Agent Class

Python
agent.py
1
import json
2
from typing import Any, Dict, List, Optional, Tuple
3
4
from helpers import (
5
STEEL_API_KEY,
6
BROWSER_SYSTEM_PROMPT,
7
create_response,
8
)
9
from steel import Steel
10
11
12
class Agent:
13
def __init__(self):
14
self.steel = Steel(steel_api_key=STEEL_API_KEY)
15
self.session = None
16
self.model = "computer-use-preview"
17
self.viewport_width = 1280
18
self.viewport_height = 768
19
self.system_prompt = BROWSER_SYSTEM_PROMPT
20
self.tools = [
21
{
22
"type": "computer-preview",
23
"display_width": self.viewport_width,
24
"display_height": self.viewport_height,
25
"environment": "browser",
26
}
27
]
28
self.print_steps = True
29
self.auto_acknowledge_safety = True
30
31
def center(self) -> Tuple[int, int]:
32
return (self.viewport_width // 2, self.viewport_height // 2)
33
34
def to_number(self, v: Any, default: float = 0.0) -> float:
35
if isinstance(v, (int, float)):
36
return float(v)
37
if isinstance(v, str):
38
try:
39
return float(v)
40
except ValueError:
41
return default
42
return default
43
44
def to_coords(self, x: Any = None, y: Any = None) -> Tuple[int, int]:
45
if x is None or y is None:
46
return self.center()
47
return (
48
int(self.to_number(x, self.center()[0])),
49
int(self.to_number(y, self.center()[1])),
50
)
51
52
def split_keys(self, k: Optional[Any]) -> List[str]:
53
if isinstance(k, list):
54
return [str(s) for s in k if s]
55
if isinstance(k, str) and k.strip():
56
return [s.strip() for s in k.split("+") if s.strip()]
57
return []
58
59
def normalize_key(self, key: str) -> str:
60
if not isinstance(key, str) or not key:
61
return key
62
k = key.strip()
63
upper = k.upper()
64
synonyms = {
65
"ENTER": "Enter",
66
"RETURN": "Enter",
67
"ESC": "Escape",
68
"ESCAPE": "Escape",
69
"TAB": "Tab",
70
"BACKSPACE": "Backspace",
71
"DELETE": "Delete",
72
"SPACE": "Space",
73
"CTRL": "Control",
74
"CONTROL": "Control",
75
"ALT": "Alt",
76
"SHIFT": "Shift",
77
"META": "Meta",
78
"CMD": "Meta",
79
"UP": "ArrowUp",
80
"DOWN": "ArrowDown",
81
"LEFT": "ArrowLeft",
82
"RIGHT": "ArrowRight",
83
"HOME": "Home",
84
"END": "End",
85
"PAGEUP": "PageUp",
86
"PAGEDOWN": "PageDown",
87
}
88
if upper in synonyms:
89
return synonyms[upper]
90
if upper.startswith("F") and upper[1:].isdigit():
91
return "F" + upper[1:]
92
return k
93
94
def normalize_keys(self, keys: List[str]) -> List[str]:
95
return [self.normalize_key(k) for k in keys]
96
97
def initialize(self) -> None:
98
width = self.viewport_width
99
height = self.viewport_height
100
self.session = self.steel.sessions.create(
101
dimensions={"width": width, "height": height},
102
block_ads=True,
103
api_timeout=900000,
104
)
105
print("Steel Session created successfully!")
106
print(f"View live session at: {self.session.session_viewer_url}")
107
108
def cleanup(self) -> None:
109
if self.session:
110
print("Releasing Steel session...")
111
self.steel.sessions.release(self.session.id)
112
print(
113
f"Session completed. View replay at {self.session.session_viewer_url}"
114
)
115
self.session = None
116
117
def take_screenshot(self) -> str:
118
resp = self.steel.sessions.computer(self.session.id, action="take_screenshot")
119
img = getattr(resp, "base64_image", None)
120
if not img:
121
raise RuntimeError("No screenshot returned from Steel")
122
return img
123
124
def map_button(self, btn: Optional[str]) -> str:
125
b = (btn or "left").lower()
126
if b in ("left", "right", "middle", "back", "forward"):
127
return b
128
return "left"
129
130
def execute_computer_action(
131
self, action_type: str, action_args: Dict[str, Any]
132
) -> str:
133
body: Dict[str, Any]
134
135
if action_type == "move":
136
coords = self.to_coords(action_args.get("x"), action_args.get("y"))
137
body = {
138
"action": "move_mouse",
139
"coordinates": [coords[0], coords[1]],
140
"screenshot": True,
141
}
142
143
elif action_type in ("click",):
144
coords = self.to_coords(action_args.get("x"), action_args.get("y"))
145
button = self.map_button(action_args.get("button"))
146
num_clicks = int(self.to_number(action_args.get("num_clicks"), 1))
147
payload = {
148
"action": "click_mouse",
149
"button": button,
150
"coordinates": [coords[0], coords[1]],
151
"screenshot": True,
152
}
153
if num_clicks > 1:
154
payload["num_clicks"] = num_clicks
155
body = payload
156
157
elif action_type in ("doubleClick", "double_click"):
158
coords = self.to_coords(action_args.get("x"), action_args.get("y"))
159
body = {
160
"action": "click_mouse",
161
"button": "left",
162
"coordinates": [coords[0], coords[1]],
163
"num_clicks": 2,
164
"screenshot": True,
165
}
166
167
elif action_type == "drag":
168
path = action_args.get("path") or []
169
steel_path: List[List[int]] = []
170
for p in path:
171
steel_path.append(list(self.to_coords(p.get("x"), p.get("y"))))
172
if len(steel_path) < 2:
173
cx, cy = self.center()
174
tx, ty = self.to_coords(action_args.get("x"), action_args.get("y"))
175
steel_path = [[cx, cy], [tx, ty]]
176
body = {"action": "drag_mouse", "path": steel_path, "screenshot": True}
177
178
elif action_type == "scroll":
179
coords: Optional[Tuple[int, int]] = None
180
if action_args.get("x") is not None or action_args.get("y") is not None:
181
coords = self.to_coords(action_args.get("x"), action_args.get("y"))
182
delta_x = int(self.to_number(action_args.get("scroll_x"), 0))
183
delta_y = int(self.to_number(action_args.get("scroll_y"), 0))
184
body = {
185
"action": "scroll",
186
"screenshot": True,
187
}
188
if coords:
189
body["coordinates"] = [coords[0], coords[1]]
190
if delta_x:
191
body["delta_x"] = delta_x
192
if delta_y:
193
body["delta_y"] = delta_y
194
195
elif action_type == "type":
196
text = action_args.get("text") or ""
197
body = {"action": "type_text", "text": text, "screenshot": True}
198
199
elif action_type == "keypress":
200
keys = action_args.get("keys")
201
keys_list = self.split_keys(keys)
202
normalized = self.normalize_keys(keys_list)
203
body = {"action": "press_key", "keys": normalized, "screenshot": True}
204
205
elif action_type == "wait":
206
ms = self.to_number(action_args.get("ms"), 1000)
207
seconds = max(0.001, ms / 1000.0)
208
body = {"action": "wait", "duration": seconds, "screenshot": True}
209
210
elif action_type == "screenshot":
211
return self.take_screenshot()
212
213
else:
214
return self.take_screenshot()
215
216
resp = self.steel.sessions.computer(
217
self.session.id, **{k: v for k, v in body.items() if v is not None}
218
)
219
img = getattr(resp, "base64_image", None)
220
return img if img else self.take_screenshot()
221
222
def handle_item(self, item: Dict[str, Any]) -> List[Dict[str, Any]]:
223
if item["type"] == "message":
224
if self.print_steps and item.get("content") and len(item["content"]) > 0:
225
print(item["content"][0].get("text", ""))
226
return []
227
228
if item["type"] == "function_call":
229
if self.print_steps:
230
print(f"{item['name']}({item['arguments']})")
231
return [
232
{
233
"type": "function_call_output",
234
"call_id": item["call_id"],
235
"output": "success",
236
}
237
]
238
239
if item["type"] == "computer_call":
240
action = item["action"]
241
action_type = action["type"]
242
action_args = {k: v for k, v in action.items() if k != "type"}
243
244
if self.print_steps:
245
print(f"{action_type}({json.dumps(action_args)})")
246
247
screenshot_base64 = self.execute_computer_action(action_type, action_args)
248
249
pending_checks = item.get("pending_safety_checks", []) or []
250
for check in pending_checks:
251
if self.auto_acknowledge_safety:
252
print(f"⚠️ Auto-acknowledging safety check: {check.get('message')}")
253
else:
254
raise RuntimeError(f"Safety check failed: {check.get('message')}")
255
256
call_output = {
257
"type": "computer_call_output",
258
"call_id": item["call_id"],
259
"acknowledged_safety_checks": pending_checks,
260
"output": {
261
"type": "input_image",
262
"image_url": f"data:image/png;base64,{screenshot_base64}",
263
},
264
}
265
return [call_output]
266
267
return []
268
269
def execute_task(
270
self,
271
task: str,
272
print_steps: bool = True,
273
debug: bool = False,
274
max_iterations: int = 50,
275
) -> str:
276
self.print_steps = print_steps
277
278
input_items: List[Dict[str, Any]] = [
279
{"role": "system", "content": self.system_prompt},
280
{"role": "user", "content": task},
281
]
282
283
new_items: List[Dict[str, Any]] = []
284
iterations = 0
285
consecutive_no_actions = 0
286
last_assistant_texts: List[str] = []
287
288
print(f"🎯 Executing task: {task}")
289
print("=" * 60)
290
291
def detect_repetition(text: str) -> bool:
292
if len(last_assistant_texts) < 2:
293
return False
294
words1 = text.lower().split()
295
for prev in last_assistant_texts:
296
words2 = prev.lower().split()
297
common = [w for w in words1 if w in words2]
298
if len(common) / max(len(words1), len(words2)) > 0.8:
299
return True
300
return False
301
302
while iterations < max_iterations:
303
iterations += 1
304
has_actions = False
305
306
if new_items and new_items[-1].get("role") == "assistant":
307
content = new_items[-1].get("content", [])
308
last_text = content[0].get("text") if content else None
309
if isinstance(last_text, str) and last_text:
310
if detect_repetition(last_text):
311
print("🔄 Repetition detected - stopping execution")
312
last_assistant_texts.append(last_text)
313
break
314
last_assistant_texts.append(last_text)
315
if len(last_assistant_texts) > 3:
316
last_assistant_texts.pop(0)
317
318
try:
319
response = create_response(
320
model=self.model,
321
input=[*input_items, *new_items],
322
tools=self.tools,
323
truncation="auto",
324
)
325
326
if "output" not in response:
327
raise RuntimeError("No output from model")
328
329
for item in response["output"]:
330
new_items.append(item)
331
if item.get("type") in ("computer_call", "function_call"):
332
has_actions = True
333
new_items.extend(self.handle_item(item))
334
335
if not has_actions:
336
consecutive_no_actions += 1
337
if consecutive_no_actions >= 3:
338
print("⚠️ No actions for 3 consecutive iterations - stopping")
339
break
340
else:
341
consecutive_no_actions = 0
342
343
except Exception as error:
344
print(f"❌ Error during task execution: {error}")
345
raise
346
347
if iterations >= max_iterations:
348
print(f"⚠️ Task execution stopped after {max_iterations} iterations")
349
350
assistant_messages = [i for i in new_items if i.get("role") == "assistant"]
351
if assistant_messages:
352
content = assistant_messages[-1].get("content") or []
353
if content and content[0].get("text"):
354
return content[0]["text"]
355
356
return "Task execution completed (no final message)"

Step 3: Create the Main Script

Python
main.py
1
import sys
2
import time
3
4
from helpers import STEEL_API_KEY, OPENAI_API_KEY, TASK
5
from agent import Agent
6
7
8
def main():
9
print("🚀 Steel + OpenAI Computer Use Assistant")
10
print("=" * 60)
11
12
if STEEL_API_KEY == "your-steel-api-key-here":
13
print(
14
"⚠️ WARNING: Please replace 'your-steel-api-key-here' with your actual Steel API key"
15
)
16
print(" Get your API key at: https://app.steel.dev/settings/api-keys")
17
sys.exit(1)
18
19
if OPENAI_API_KEY == "your-openai-api-key-here":
20
print(
21
"⚠️ WARNING: Please replace 'your-openai-api-key-here' with your actual OpenAI API key"
22
)
23
print(" Get your API key at: https://platform.openai.com/")
24
sys.exit(1)
25
26
print("\nStarting Steel session...")
27
agent = Agent()
28
29
try:
30
agent.initialize()
31
print("✅ Steel session started!")
32
33
start_time = time.time()
34
35
try:
36
result = agent.execute_task(TASK, True, False, 50)
37
duration = f"{(time.time() - start_time):.1f}"
38
39
print("\n" + "=" * 60)
40
print("🎉 TASK EXECUTION COMPLETED")
41
print("=" * 60)
42
print(f"⏱️ Duration: {duration} seconds")
43
print(f"🎯 Task: {TASK}")
44
print(f"📋 Result:\n{result}")
45
print("=" * 60)
46
47
except Exception as e:
48
print(f"❌ Task execution failed: {e}")
49
raise
50
51
except Exception as e:
52
print(f"❌ Failed to start Steel session: {e}")
53
print("Please check your STEEL_API_KEY and internet connection.")
54
raise
55
56
finally:
57
agent.cleanup()
58
59
60
if __name__ == "__main__":
61
main()

Running Your Agent

Execute your script to start an interactive AI browser session:

Terminal
python main.py

You will see the session URL printed in the console. You can view the live browser session by opening this URL in your web browser.

The agent will execute the task defined in the TASK environment variable or the default task. You can modify the task by setting the environment variable:

Terminal
export TASK="Search for the latest news on artificial intelligence"
python main.py

Next Steps

  • Explore the Steel API documentation for more advanced features

  • Check out the OpenAI documentation for more information about the computer-use-preview model

  • Add additional features like session recording or multi-session management