Overview
Gemini's Computer Use is an agent that combines vision capabilities with advanced reasoning to control computer interfaces and perform tasks on behalf of users through a continuous action loop.
Overview
The Gemini Computer Use integration allows you to connect Gemini 2.5's vision and reasoning capabilities with Steel's reliable browser infrastructure. This integration enables AI agents to:
-
Control Steel browser sessions via the Gemini API
-
Execute real browser actions like clicking, typing, and scrolling
-
Perform complex web tasks such as form filling, searching, and navigation
-
Process visual feedback from screenshots to determine next actions
-
Handle normalized coordinate systems automatically
By combining Gemini's Computer Use with Steel's cloud browser infrastructure, you can build robust, scalable web automation solutions that leverage Steel's anti-bot capabilities, proxy management, and sandboxed environments.
Requirements & Limitations
-
Gemini API Key: Access to the Gemini API with the gemini-2.5-computer-use-preview model
-
Steel API Key: Active subscription to Steel
-
Python/Node Environment: Support for API clients for both services
-
Supported Environments: Works best with Steel's browser environment
Documentation
Quickstart Guide (Python) → Step-by-step guide to building a Simple CUA agent with Steel browser sessions in Python.
Quickstart Guide (Node) → Step-by-step guide to building a Simple CUA agent with Steel browser sessions in Typescript & Node.
Additional Resources
-
Gemini Computer Use Documentation - Official documentation from Google
-
Steel Sessions API Reference - Technical details for managing Steel browser sessions
-
Cookbook Recipe (Python) - Working, forkable examples of the integration in Python
-
Cookbook Recipe (TS/Node) - Working, forkable examples of the integration in TypeScript
-
Community Discord - Get help and share your implementations