Quickstart (Typescript)

How to use Gemini Computer Use with Steel

This guide will walk you through how to use Google's gemini-2.5-computer-use-preview model with Steel's Computer API to create AI agents that can navigate the web.

Gemini's Computer Use model uses a normalized coordinate system (0-1000) and provides built-in actions for browser control, making it straightforward to integrate with Steel.

Prerequisites

Step 1: Setup and Helper Functions

First, create a project directory and install the required packages:

Terminal
# Create a project directory
mkdir steel-gemini-computer-use
cd steel-gemini-computer-use
# Initialize package.json
npm init -y
# Install required packages
npm install steel-sdk @google/genai dotenv
npm install -D @types/node typescript ts-node

Create a .env file with your API keys:

ENV
.env
1
STEEL_API_KEY=your_steel_api_key_here
2
GEMINI_API_KEY=your_gemini_api_key_here
3
TASK=Go to Steel.dev and find the latest news

Create a file with helper functions, constants, and type definitions:

Typescript
helpers.ts
1
import * as dotenv from "dotenv";
2
import { Steel } from "steel-sdk";
3
import {
4
GoogleGenAI,
5
Content,
6
Part,
7
FunctionCall,
8
FunctionResponse,
9
Tool,
10
Environment,
11
GenerateContentConfig,
12
GenerateContentResponse,
13
Candidate,
14
FinishReason,
15
} from "@google/genai";
16
17
dotenv.config();
18
19
export const STEEL_API_KEY = process.env.STEEL_API_KEY || "your-steel-api-key-here";
20
export const GEMINI_API_KEY = process.env.GEMINI_API_KEY || "your-gemini-api-key-here";
21
export const TASK = process.env.TASK || "Go to Steel.dev and find the latest news";
22
23
export const MODEL = "gemini-2.5-computer-use-preview-10-2025";
24
export const MAX_COORDINATE = 1000;
25
26
export function formatToday(): string {
27
return new Intl.DateTimeFormat("en-US", {
28
weekday: "long",
29
month: "long",
30
day: "2-digit",
31
year: "numeric",
32
}).format(new Date());
33
}
34
35
export const BROWSER_SYSTEM_PROMPT = `<BROWSER_ENV>
36
- You control a headful Chromium browser running in a VM with internet access.
37
- Chromium is already open; interact only through computer use actions (mouse, keyboard, scroll, screenshots).
38
- Today's date is ${formatToday()}.
39
</BROWSER_ENV>
40
41
<BROWSER_CONTROL>
42
- When viewing pages, zoom out or scroll so all relevant content is visible.
43
- When typing into any input:
44
* Clear it first with Ctrl+A, then Delete.
45
* After submitting (pressing Enter or clicking a button), wait for the page to load.
46
- Computer tool calls are slow; batch related actions into a single call whenever possible.
47
- You may act on the user's behalf on sites where they are already authenticated.
48
- Assume any required authentication/Auth Contexts are already configured before the task starts.
49
- If the first screenshot is black:
50
* Click near the center of the screen.
51
* Take another screenshot.
52
</BROWSER_CONTROL>
53
54
<TASK_EXECUTION>
55
- You receive exactly one natural-language task and no further user feedback.
56
- Do not ask the user clarifying questions; instead, make reasonable assumptions and proceed.
57
- For complex tasks, quickly plan a short, ordered sequence of steps before acting.
58
- Prefer minimal, high-signal actions that move directly toward the goal.
59
- Keep your final response concise and focused on fulfilling the task (e.g., a brief summary of findings or results).
60
</TASK_EXECUTION>`;
61
62
export type Coordinates = [number, number];
63
64
export interface ActionResult {
65
screenshotBase64: string;
66
url?: string;
67
}
68
69
export {
70
Steel,
71
GoogleGenAI,
72
Content,
73
Part,
74
FunctionCall,
75
FunctionResponse,
76
Tool,
77
Environment,
78
GenerateContentConfig,
79
Candidate,
80
FinishReason,
81
};

Step 2: Create the Agent Class

Typescript
agent.ts
1
import {
2
Steel,
3
GoogleGenAI,
4
Content,
5
Part,
6
FunctionCall,
7
FunctionResponse,
8
Tool,
9
Environment,
10
GenerateContentConfig,
11
Candidate,
12
FinishReason,
13
STEEL_API_KEY,
14
GEMINI_API_KEY,
15
MODEL,
16
MAX_COORDINATE,
17
BROWSER_SYSTEM_PROMPT,
18
Coordinates,
19
ActionResult,
20
} from "./helpers";
21
22
export class Agent {
23
private client: GoogleGenAI;
24
private steel: Steel;
25
private session: Steel.Session | null = null;
26
private contents: Content[];
27
private tools: Tool[];
28
private config: GenerateContentConfig;
29
private viewportWidth: number;
30
private viewportHeight: number;
31
private currentUrl: string;
32
33
constructor() {
34
this.client = new GoogleGenAI({ apiKey: GEMINI_API_KEY });
35
this.steel = new Steel({ steelAPIKey: STEEL_API_KEY });
36
this.contents = [];
37
this.currentUrl = "about:blank";
38
this.viewportWidth = 1280;
39
this.viewportHeight = 768;
40
this.tools = [
41
{
42
computerUse: {
43
environment: Environment.ENVIRONMENT_BROWSER,
44
},
45
},
46
];
47
this.config = {
48
tools: this.tools,
49
};
50
}
51
52
private denormalizeX(x: number): number {
53
return Math.round((x / MAX_COORDINATE) * this.viewportWidth);
54
}
55
56
private denormalizeY(y: number): number {
57
return Math.round((y / MAX_COORDINATE) * this.viewportHeight);
58
}
59
60
private center(): Coordinates {
61
return [
62
Math.floor(this.viewportWidth / 2),
63
Math.floor(this.viewportHeight / 2),
64
];
65
}
66
67
private normalizeKey(key: string): string {
68
if (!key) return key;
69
const k = key.trim();
70
const upper = k.toUpperCase();
71
const synonyms: Record<string, string> = {
72
ENTER: "Enter",
73
RETURN: "Enter",
74
ESC: "Escape",
75
ESCAPE: "Escape",
76
TAB: "Tab",
77
BACKSPACE: "Backspace",
78
DELETE: "Delete",
79
SPACE: "Space",
80
CTRL: "Control",
81
CONTROL: "Control",
82
ALT: "Alt",
83
SHIFT: "Shift",
84
META: "Meta",
85
CMD: "Meta",
86
UP: "ArrowUp",
87
DOWN: "ArrowDown",
88
LEFT: "ArrowLeft",
89
RIGHT: "ArrowRight",
90
HOME: "Home",
91
END: "End",
92
PAGEUP: "PageUp",
93
PAGEDOWN: "PageDown",
94
};
95
if (upper in synonyms) return synonyms[upper];
96
if (upper.startsWith("F") && /^\d+$/.test(upper.slice(1))) {
97
return "F" + upper.slice(1);
98
}
99
return k;
100
}
101
102
private normalizeKeys(keys: string[]): string[] {
103
return keys.map((k) => this.normalizeKey(k));
104
}
105
106
async initialize(): Promise<void> {
107
this.session = await this.steel.sessions.create({
108
dimensions: { width: this.viewportWidth, height: this.viewportHeight },
109
blockAds: true,
110
timeout: 900000,
111
});
112
console.log("Steel Session created successfully!");
113
console.log(`View live session at: ${this.session.sessionViewerUrl}`);
114
}
115
116
async cleanup(): Promise<void> {
117
if (this.session) {
118
console.log("Releasing Steel session...");
119
await this.steel.sessions.release(this.session.id);
120
console.log(
121
`Session completed. View replay at ${this.session.sessionViewerUrl}`
122
);
123
this.session = null;
124
}
125
}
126
127
private async takeScreenshot(): Promise<string> {
128
const resp: any = await this.steel.sessions.computer(this.session!.id, {
129
action: "take_screenshot",
130
});
131
const img = resp?.base64_image;
132
if (!img) throw new Error("No screenshot returned from Steel");
133
return img;
134
}
135
136
private async executeComputerAction(
137
functionCall: FunctionCall
138
): Promise<ActionResult> {
139
const name = functionCall.name ?? "";
140
const args = (functionCall.args ?? {}) as Record<string, unknown>;
141
142
switch (name) {
143
case "open_web_browser": {
144
const screenshot = await this.takeScreenshot();
145
return { screenshotBase64: screenshot, url: this.currentUrl };
146
}
147
148
case "click_at": {
149
const x = this.denormalizeX(args.x as number);
150
const y = this.denormalizeY(args.y as number);
151
const resp: any = await this.steel.sessions.computer(this.session!.id, {
152
action: "click_mouse",
153
button: "left",
154
coordinates: [x, y],
155
screenshot: true,
156
});
157
return {
158
screenshotBase64: resp?.base64_image || (await this.takeScreenshot()),
159
url: this.currentUrl,
160
};
161
}
162
163
case "hover_at": {
164
const x = this.denormalizeX(args.x as number);
165
const y = this.denormalizeY(args.y as number);
166
const resp: any = await this.steel.sessions.computer(this.session!.id, {
167
action: "move_mouse",
168
coordinates: [x, y],
169
screenshot: true,
170
});
171
return {
172
screenshotBase64: resp?.base64_image || (await this.takeScreenshot()),
173
url: this.currentUrl,
174
};
175
}
176
177
case "type_text_at": {
178
const x = this.denormalizeX(args.x as number);
179
const y = this.denormalizeY(args.y as number);
180
const text = args.text as string;
181
const pressEnter = args.press_enter !== false;
182
const clearBeforeTyping = args.clear_before_typing !== false;
183
184
await this.steel.sessions.computer(this.session!.id, {
185
action: "click_mouse",
186
button: "left",
187
coordinates: [x, y],
188
});
189
190
if (clearBeforeTyping) {
191
await this.steel.sessions.computer(this.session!.id, {
192
action: "press_key",
193
keys: ["Control", "a"],
194
});
195
await this.steel.sessions.computer(this.session!.id, {
196
action: "press_key",
197
keys: ["Backspace"],
198
});
199
}
200
201
await this.steel.sessions.computer(this.session!.id, {
202
action: "type_text",
203
text: text,
204
});
205
206
if (pressEnter) {
207
await this.steel.sessions.computer(this.session!.id, {
208
action: "press_key",
209
keys: ["Enter"],
210
});
211
}
212
213
await this.steel.sessions.computer(this.session!.id, {
214
action: "wait",
215
duration: 1,
216
});
217
218
const screenshot = await this.takeScreenshot();
219
return { screenshotBase64: screenshot, url: this.currentUrl };
220
}
221
222
case "scroll_document": {
223
const direction = args.direction as string;
224
let keys: string[];
225
226
if (direction === "down") {
227
keys = ["PageDown"];
228
} else if (direction === "up") {
229
keys = ["PageUp"];
230
} else if (direction === "left" || direction === "right") {
231
const [cx, cy] = this.center();
232
const delta = direction === "left" ? -400 : 400;
233
const resp: any = await this.steel.sessions.computer(this.session!.id, {
234
action: "scroll",
235
coordinates: [cx, cy],
236
delta_x: delta,
237
delta_y: 0,
238
screenshot: true,
239
});
240
return {
241
screenshotBase64: resp?.base64_image || (await this.takeScreenshot()),
242
url: this.currentUrl,
243
};
244
} else {
245
keys = ["PageDown"];
246
}
247
248
const resp: any = await this.steel.sessions.computer(this.session!.id, {
249
action: "press_key",
250
keys: keys,
251
screenshot: true,
252
});
253
return {
254
screenshotBase64: resp?.base64_image || (await this.takeScreenshot()),
255
url: this.currentUrl,
256
};
257
}
258
259
case "scroll_at": {
260
const x = this.denormalizeX(args.x as number);
261
const y = this.denormalizeY(args.y as number);
262
const direction = args.direction as string;
263
const magnitude = this.denormalizeY((args.magnitude as number) ?? 800);
264
265
let deltaX = 0;
266
let deltaY = 0;
267
268
if (direction === "down") deltaY = magnitude;
269
else if (direction === "up") deltaY = -magnitude;
270
else if (direction === "right") deltaX = magnitude;
271
else if (direction === "left") deltaX = -magnitude;
272
273
const resp: any = await this.steel.sessions.computer(this.session!.id, {
274
action: "scroll",
275
coordinates: [x, y],
276
delta_x: deltaX,
277
delta_y: deltaY,
278
screenshot: true,
279
});
280
return {
281
screenshotBase64: resp?.base64_image || (await this.takeScreenshot()),
282
url: this.currentUrl,
283
};
284
}
285
286
case "wait_5_seconds": {
287
const resp: any = await this.steel.sessions.computer(this.session!.id, {
288
action: "wait",
289
duration: 5,
290
screenshot: true,
291
});
292
return {
293
screenshotBase64: resp?.base64_image || (await this.takeScreenshot()),
294
url: this.currentUrl,
295
};
296
}
297
298
case "go_back": {
299
const resp: any = await this.steel.sessions.computer(this.session!.id, {
300
action: "press_key",
301
keys: ["Alt", "ArrowLeft"],
302
screenshot: true,
303
});
304
return {
305
screenshotBase64: resp?.base64_image || (await this.takeScreenshot()),
306
url: this.currentUrl,
307
};
308
}
309
310
case "go_forward": {
311
const resp: any = await this.steel.sessions.computer(this.session!.id, {
312
action: "press_key",
313
keys: ["Alt", "ArrowRight"],
314
screenshot: true,
315
});
316
return {
317
screenshotBase64: resp?.base64_image || (await this.takeScreenshot()),
318
url: this.currentUrl,
319
};
320
}
321
322
case "navigate": {
323
let url = args.url as string;
324
if (!url.startsWith("http://") && !url.startsWith("https://")) {
325
url = "https://" + url;
326
}
327
328
await this.steel.sessions.computer(this.session!.id, {
329
action: "press_key",
330
keys: ["Control", "l"],
331
});
332
await this.steel.sessions.computer(this.session!.id, {
333
action: "type_text",
334
text: url,
335
});
336
await this.steel.sessions.computer(this.session!.id, {
337
action: "press_key",
338
keys: ["Enter"],
339
});
340
await this.steel.sessions.computer(this.session!.id, {
341
action: "wait",
342
duration: 2,
343
});
344
345
this.currentUrl = url;
346
const screenshot = await this.takeScreenshot();
347
return { screenshotBase64: screenshot, url: this.currentUrl };
348
}
349
350
case "key_combination": {
351
const keysStr = args.keys as string;
352
const keys = keysStr.split("+").map((k) => k.trim());
353
const normalizedKeys = this.normalizeKeys(keys);
354
355
const resp: any = await this.steel.sessions.computer(this.session!.id, {
356
action: "press_key",
357
keys: normalizedKeys,
358
screenshot: true,
359
});
360
return {
361
screenshotBase64: resp?.base64_image || (await this.takeScreenshot()),
362
url: this.currentUrl,
363
};
364
}
365
366
case "drag_and_drop": {
367
const startX = this.denormalizeX(args.x as number);
368
const startY = this.denormalizeY(args.y as number);
369
const endX = this.denormalizeX(args.destination_x as number);
370
const endY = this.denormalizeY(args.destination_y as number);
371
372
const resp: any = await this.steel.sessions.computer(this.session!.id, {
373
action: "drag_mouse",
374
path: [
375
[startX, startY],
376
[endX, endY],
377
],
378
screenshot: true,
379
});
380
return {
381
screenshotBase64: resp?.base64_image || (await this.takeScreenshot()),
382
url: this.currentUrl,
383
};
384
}
385
386
default: {
387
console.log(`Unknown action: ${name}, taking screenshot`);
388
const screenshot = await this.takeScreenshot();
389
return { screenshotBase64: screenshot, url: this.currentUrl };
390
}
391
}
392
}
393
394
private extractFunctionCalls(candidate: Candidate): FunctionCall[] {
395
const functionCalls: FunctionCall[] = [];
396
if (!candidate.content?.parts) return functionCalls;
397
398
for (const part of candidate.content.parts) {
399
if (part.functionCall) {
400
functionCalls.push(part.functionCall);
401
}
402
}
403
return functionCalls;
404
}
405
406
private extractText(candidate: Candidate): string {
407
if (!candidate.content?.parts) return "";
408
const texts: string[] = [];
409
for (const part of candidate.content.parts) {
410
if (part.text) {
411
texts.push(part.text);
412
}
413
}
414
return texts.join(" ").trim();
415
}
416
417
private buildFunctionResponseParts(
418
functionCalls: FunctionCall[],
419
results: ActionResult[]
420
): Part[] {
421
const parts: Part[] = [];
422
423
for (let i = 0; i < functionCalls.length; i++) {
424
const fc = functionCalls[i];
425
const result = results[i];
426
427
const functionResponse: FunctionResponse = {
428
name: fc.name ?? "",
429
response: { url: result.url ?? this.currentUrl },
430
};
431
432
parts.push({ functionResponse });
433
parts.push({
434
inlineData: {
435
mimeType: "image/png",
436
data: result.screenshotBase64,
437
},
438
});
439
}
440
441
return parts;
442
}
443
444
async executeTask(
445
task: string,
446
printSteps: boolean = true,
447
maxIterations: number = 50
448
): Promise<string> {
449
this.contents = [
450
{
451
role: "user",
452
parts: [{ text: BROWSER_SYSTEM_PROMPT }, { text: task }],
453
},
454
];
455
456
let iterations = 0;
457
let consecutiveNoActions = 0;
458
459
console.log(`๐ŸŽฏ Executing task: ${task}`);
460
console.log("=".repeat(60));
461
462
while (iterations < maxIterations) {
463
iterations++;
464
465
try {
466
const response = await this.client.models.generateContent({
467
model: MODEL,
468
contents: this.contents,
469
config: this.config,
470
});
471
472
if (!response.candidates || response.candidates.length === 0) {
473
console.log("โŒ No candidates in response");
474
break;
475
}
476
477
const candidate = response.candidates[0];
478
479
if (candidate.content) {
480
this.contents.push(candidate.content);
481
}
482
483
const reasoning = this.extractText(candidate);
484
const functionCalls = this.extractFunctionCalls(candidate);
485
486
if (
487
!functionCalls.length &&
488
!reasoning &&
489
candidate.finishReason === FinishReason.MALFORMED_FUNCTION_CALL
490
) {
491
console.log("โš ๏ธ Malformed function call, retrying...");
492
continue;
493
}
494
495
if (!functionCalls.length) {
496
if (reasoning) {
497
if (printSteps) {
498
console.log(`\n๐Ÿ’ฌ ${reasoning}`);
499
}
500
console.log("โœ… Task complete - model provided final response");
501
break;
502
}
503
504
consecutiveNoActions++;
505
if (consecutiveNoActions >= 3) {
506
console.log(
507
"โš ๏ธ No actions for 3 consecutive iterations - stopping"
508
);
509
break;
510
}
511
continue;
512
}
513
514
consecutiveNoActions = 0;
515
516
if (printSteps && reasoning) {
517
console.log(`\n๐Ÿ’ญ ${reasoning}`);
518
}
519
520
const results: ActionResult[] = [];
521
522
for (const fc of functionCalls) {
523
const actionName = fc.name ?? "unknown";
524
const actionArgs = fc.args ?? {};
525
526
if (printSteps) {
527
console.log(`๐Ÿ”ง ${actionName}(${JSON.stringify(actionArgs)})`);
528
}
529
530
const result = await this.executeComputerAction(fc);
531
results.push(result);
532
}
533
534
const functionResponseParts = this.buildFunctionResponseParts(
535
functionCalls,
536
results
537
);
538
539
this.contents.push({
540
role: "user",
541
parts: functionResponseParts,
542
});
543
} catch (error) {
544
console.error(`โŒ Error during task execution: ${error}`);
545
throw error;
546
}
547
}
548
549
if (iterations >= maxIterations) {
550
console.warn(
551
`โš ๏ธ Task execution stopped after ${maxIterations} iterations`
552
);
553
}
554
555
for (let i = this.contents.length - 1; i >= 0; i--) {
556
const content = this.contents[i];
557
if (content.role === "model") {
558
const text = content.parts
559
?.filter((p) => p.text)
560
.map((p) => p.text)
561
.join(" ")
562
.trim();
563
if (text) {
564
return text;
565
}
566
}
567
}
568
569
return "Task execution completed (no final message)";
570
}
571
}

Step 3: Create the Main Script

Typescript
main.ts
1
import { Agent } from "./agent";
2
import { STEEL_API_KEY, GEMINI_API_KEY, TASK } from "./helpers";
3
4
async function main(): Promise<void> {
5
console.log("๐Ÿš€ Steel + Gemini Computer Use Assistant");
6
console.log("=".repeat(60));
7
8
if (STEEL_API_KEY === "your-steel-api-key-here") {
9
console.warn(
10
"โš ๏ธ WARNING: Please replace 'your-steel-api-key-here' with your actual Steel API key"
11
);
12
console.warn(
13
" Get your API key at: https://app.steel.dev/settings/api-keys"
14
);
15
throw new Error("Set STEEL_API_KEY");
16
}
17
18
if (GEMINI_API_KEY === "your-gemini-api-key-here") {
19
console.warn(
20
"โš ๏ธ WARNING: Please replace 'your-gemini-api-key-here' with your actual Gemini API key"
21
);
22
console.warn(" Get your API key at: https://aistudio.google.com/apikey");
23
throw new Error("Set GEMINI_API_KEY");
24
}
25
26
console.log("\nStarting Steel session...");
27
const agent = new Agent();
28
29
try {
30
await agent.initialize();
31
console.log("โœ… Steel session started!");
32
33
const startTime = Date.now();
34
const result = await agent.executeTask(TASK, true, 50);
35
const duration = ((Date.now() - startTime) / 1000).toFixed(1);
36
37
console.log("\n" + "=".repeat(60));
38
console.log("๐ŸŽ‰ TASK EXECUTION COMPLETED");
39
console.log("=".repeat(60));
40
console.log(`โฑ๏ธ Duration: ${duration} seconds`);
41
console.log(`๐ŸŽฏ Task: ${TASK}`);
42
console.log(`๐Ÿ“‹ Result:\n${result}`);
43
console.log("=".repeat(60));
44
} catch (error) {
45
console.log(`โŒ Failed to run: ${error}`);
46
throw error;
47
} finally {
48
await agent.cleanup();
49
}
50
}
51
52
main()
53
.then(() => {
54
process.exit(0);
55
})
56
.catch((error) => {
57
console.error("Task execution failed:", error);
58
process.exit(1);
59
});

Running Your Agent

Execute your script to start an interactive AI browser session:

Terminal
npx ts-node main.ts

The agent will execute the task defined in the TASK environment variable or the default task. You can modify the task by setting the environment variable:

Terminal
export TASK="Research the latest developments in AI"
npx ts-node main.ts

You'll see each action the agent takes displayed in the console, and you can view the live browser session by opening the session URL in your web browser.

Understanding Gemini's Coordinate System

Gemini's Computer Use model uses a normalized coordinate system where both X and Y coordinates range from 0 to 1000. The agent automatically converts these to actual pixel coordinates based on the viewport size (1280x768 by default).

Next Steps