Drive a mobile browser with Claude Computer Use
Claude Computer Use with Steel for autonomous task execution in mobile browser environments.
Mobile-emulated Chrome, driven by Claude's computer-use tool. The agent loop is the same as the Desktop TS recipe (screenshot in, coordinate out, next screenshot back), but three things change when the surface is a phone: Steel allocates the viewport instead of you, Playwright drives the page over CDP instead of Steel's Input API, and every coordinate is clamped before it touches the browser. Those differences are what this recipe walks through. If you want the loop mechanics and tool-definition details, read the desktop README first.
Mobile session, Playwright driver
SteelBrowser.initialize asks Steel for a mobile device and connects Playwright to the returned CDP socket:
this.session = await this.client.sessions.create({apiTimeout: 900000,solveCaptcha: false,deviceConfig: { device: "mobile" },});const cdpUrl = `${this.session.websocketUrl}&apiKey=${STEEL_API_KEY}`;this.browser = await chromium.connectOverCDP(cdpUrl, { timeout: 60000 });this.page = this.browser.contexts()[0].pages()[0];
deviceConfig.device: "mobile" tells Steel to spin up Chrome with a mobile user agent, mobile viewport, and touch-capable device metrics. The page that comes back renders mobile layouts, serves mobile-tuned assets, and fires touch events when you click.
Choosing Playwright over Steel's Input API buys two things. context.route intercepts every request against a BLOCKED_DOMAINS list and aborts matches before they leave the browser. And SteelBrowser.screenshot falls back to Page.captureScreenshot over a raw CDP session when Playwright's capture trips, which happens occasionally on backgrounded tabs in mobile emulation.
The action switch in executeComputerAction maps Claude's computer-use vocabulary onto Playwright calls: left_click becomes page.mouse.click(x, y), type chunks through page.keyboard.type with a 12 ms delay per keystroke (50-char groups), scroll multiplies scroll_amount by 100 and calls page.mouse.wheel, and key splits on + so ctrl+a routes through CUA_KEY_TO_PLAYWRIGHT_KEY into real modifier-plus-key presses. Taps stay as left_click: Playwright synthesizes a mouse event, mobile Chrome turns it into a touch downstream. No real multitouch, but tap and swipe flows work transparently.
Dimensions come from the session
Desktop recipes hard-code the viewport in the constructor. Mobile inverts that: the constructor holds a placeholder until Steel says what device it allocated.
constructor(startUrl: string = "https://amazon.com") {this.dimensions = [1920, 1080]; // placeholder// ...}async initialize() {this.session = await this.client.sessions.create(sessionParams);this.dimensions = [this.session.dimensions.width,this.session.dimensions.height,];await this.page.setViewportSize({ width, height });}
ClaudeAgent reads those dimensions back through computer.getDimensions() and threads them into both the system prompt and the computer_20251124 tool definition:
const [width, height] = computer.getDimensions();this.tools = [{type: TOOL_TYPE,name: "computer",display_width_px: width,display_height_px: height,display_number: 1,}];
Order matters. Instantiate ClaudeAgent before computer.initialize() completes and the agent captures the 1920x1080 placeholder while the real page is rendered at the mobile size Steel picked. Claude then plans pixels against a canvas that doesn't exist, and every click lands wrong.
Clamping and coordinates
Phone targets are small. Buttons sit at the edge of the viewport, tap regions overlap, and Claude sometimes overshoots the rendered width by a pixel when it aims for a trailing icon. clampCoordinates pins every incoming coordinate to [0, width - 1] x [0, height - 1] and logs when it has to:
private clampCoordinates(x: number, y: number): [number, number] {const clampedX = Math.max(0, Math.min(x, width - 1));const clampedY = Math.max(0, Math.min(y, height - 1));if (x !== clampedX || y !== clampedY) {console.log(`Coordinate clamped: (${x}, ${y}) → (${clampedX}, ${clampedY})`);}return [clampedX, clampedY];}
validateAndGetCoordinates wraps it for click, move, drag, and scroll targets. Occasional clamps are fine; frequent clamps mean the tool definition and the session's real dimensions have drifted. Check that the values you inject into this.tools match session.dimensions exactly.
validateScreenshotDimensions is the paranoid companion. After a screenshot action it re-reads the viewport and compares it to what the agent thinks it's looking at. It currently logs rather than rejects, so treat it as a debug signal, not a guard.
Completion markers
The system prompt in SYSTEM_PROMPT instructs Claude to end every run with TASK_COMPLETED:, TASK_FAILED:, or TASK_ABANDONED:. isTaskComplete inside executeTask scans each assistant turn for those tokens first, then falls back to natural-language patterns (task completed, unable to proceed, in conclusion, etc.). The loop also exits on detectRepetition (>80% word overlap across the last three assistant messages) and a 50-iteration cap, same as the desktop version.
Run it
cd examples/claude-computer-use-mobilecp .env.example .env # set STEEL_API_KEY and ANTHROPIC_API_KEYnpm installnpm start
Keys from app.steel.dev and console.anthropic.com. The session viewer URL prints at startup, open it in another tab to watch the mobile Chrome navigate live. Override the task inline:
TASK="Open amazon.com and find the price of an iPhone 16 Pro Max" npm start
Your output varies. Structure looks like this:
Steel Session created successfully!View live session at: https://app.steel.dev/sessions/ab12cd34...Executing task: Go to amazon.com, search for 'iPhone 16 Pro Max'...============================================================I'll start by taking a screenshot to see the current state.computer({"action":"screenshot"})Taking screenshot with dimensions: 390x844computer({"action":"left_click","coordinate":[195,120]})computer({"action":"type","text":"iPhone 16 Pro Max"})...TASK_COMPLETED: iPhone 16 Pro Max is $1,199 and in stock.TASK EXECUTION COMPLETEDDuration: 96.4 seconds
Expect ~60-180 seconds and 15-40 iterations for a typical mobile browse. Every screenshot rides in the message history as a base64 PNG, so token cost scales with iteration count. The finally block calls computer.cleanup(), which releases the session; the 15-minute apiTimeout in sessions.create is the safety net if that slips.
Make it yours
- Change the start URL.
SteelBrowserdefaults tohttps://amazon.com. Pass a different URL to the constructor inmain, or editstartUrlto match your target. - Tighten or loosen the blocklist.
BLOCKED_DOMAINSflows throughcontext.route. Add hosts to block (subdomains match via.endsWith), remove them to open access. - Tune the system prompt.
SYSTEM_PROMPTteaches Claude the mobile conventions: respect coordinate bounds, never tap browser UI, declareTASK_COMPLETED:when done. The<COORDINATE_SYSTEM>block is rewritten at runtime with the live viewport numbers, so your edits merge with the injected dimensions. - Persist a login. Pass
sessionContextintosessions.createto resume with cookies and local storage. Useful when the mobile site is gated behind auth. See credentials. - Raise the iteration cap.
maxIterations = 50inexecuteTaskis conservative for long mobile flows with lots of scrolling. The fourth argument toagent.executeTasklifts it.
Related
Desktop TS · Desktop Python · Computer use docs · Playwright docs
Related recipes
Drive a browser with Gemini Computer Use
Connect Google's Gemini Computer Use to a Steel browser session for autonomous web interactions.
Drive a browser with Claude Computer Use
Connect Claude to a Steel browser session for autonomous web interactions.
Drive a browser with OpenAI Computer Use
Connect OpenAI's Computer Use Assistant to a Steel browser session for autonomous web interactions.