Build a typed browser agent with the OpenAI Agents SDK
Use Steel with the OpenAI Agents SDK for TypeScript to build typed, tool-using browser agents.
The OpenAI Agents SDK (@openai/agents) is a small runtime for agentic loops. You define an Agent with instructions, a model, tools, and an outputType. You call run(agent, prompt). The SDK handles "pick a tool, call it, feed the result back, repeat" and validates the final message against your schema. It also ships handoffs (one agent calls another), guardrails (pre/post checks), and traces out of the box.
This recipe turns the tool layer into a Steel cloud browser. Four tool() wrappers in index.ts (openSession, navigate, snapshot, extract) shuttle CDP calls between the agent and Playwright. The agent only sees typed arguments in and structured JSON out. Demo task: scan github.com/trending/python and return the top 3 AI/ML repos as a validated FinalReport.
This is a different primitive from OpenAI's computer-use API. Computer-use streams screenshots and mouse coordinates through responses.create and you write the loop yourself. The Agents SDK sits a layer above: typed function tools, no pixels, and the SDK owns the loop.
The four tools
Every tool() call pairs a Zod parameters schema with an async execute. The SDK compiles Zod to OpenAI's strict JSON Schema at registration, which tightens a couple of rules: no .url() format (pass a plain z.string()), and .optional() is rejected. Use .nullable() instead. The attr: z.string().nullable() field inside extract exists for that reason.
openSession creates the Steel session, attaches Playwright over CDP, and stashes session, browser, and page in module scope so the other three tools share them:
const openSession = tool({name: "open_session",description: "Open a Steel cloud browser session. Call exactly once, before anything else.",parameters: z.object({}),execute: async () => {session = await steel.sessions.create({});browser = await chromium.connectOverCDP(`${session.websocketUrl}&apiKey=${STEEL_API_KEY}`,);const ctx = browser.contexts()[0];page = ctx.pages()[0] ?? (await ctx.newPage());return { sessionId: session.id, liveViewUrl: session.sessionViewerUrl };},});
Module-level state is fine here because run() fires once per script. For a long-lived server, move the session into a per-run context object and pass it to each tool's closure.
snapshot is the cheap look-around. It returns title, url, a capped innerText, and the first N anchor tags, all inside one page.evaluate. The agent is told to call it before extract so it can pick a real CSS selector instead of hallucinating one.
extract takes a rowSelector and a fields[] spec (name + selector + optional attr) and returns Record<string, string>[]. The whole query runs inside a single page.evaluate. Serial CDP round-trips against a cloud browser cost roughly 200-300ms each, so N rows by M fields in sequence burns real seconds. Batching collapses it to one round trip.
navigate is a thin page.goto wrapper with a 45-second timeout and waitUntil: "domcontentloaded".
The output contract
outputType accepts a Zod schema. The SDK attaches it to the final assistant message and fails the run if the model drifts off-shape:
const FinalReport = z.object({summary: z.string(),repos: z.array(z.object({name: z.string(),url: z.string(),stars: z.string().nullable(),description: z.string().nullable(),})).min(1).max(5),});const agent = new Agent({name: "SteelResearch",instructions: "You operate a Steel cloud browser via tools. Workflow: ...",model: "gpt-5-mini",tools: [openSession, navigate, snapshot, extract],outputType: FinalReport,});const result = await run(agent, "Go to https://github.com/trending/python ...", { maxTurns: 15 });console.log(result.finalOutput); // typed as z.infer<typeof FinalReport>
Some providers force a JSON-only mode when you ask for structured output, which kills tool use. OpenAI does not. The agent calls tools freely throughout the loop and only formats against the schema on the final message.
maxTurns: 15 caps the loop. One turn is one model response, which may contain any number of tool calls. The demo finishes in 4-6 turns.
Run it
cd examples/openai-agents-tscp .env.example .env # set STEEL_API_KEY and OPENAI_API_KEYnpm installnpx playwright install chromiumnpm start
Get keys at app.steel.dev/settings/api-keys and platform.openai.com/api-keys. A viewer URL prints as openSession runs. Open it in another tab to watch the browser live.
Your output varies. Structure looks like this:
Steel + OpenAI Agents SDK (TypeScript) Starter============================================================open_session: 1432msnavigate: 2180mssnapshot: 412ms (3921 chars, 48 links)extract: 380ms (3 rows)Agent finished.{"summary": "All three repos focus on LLM tooling written in Python...","repos": [{ "name": "owner/repo", "url": "https://github.com/...", "stars": "1,204", "description": "..." },...]}Releasing Steel session...Session released. Replay: https://app.steel.dev/sessions/ab12cd34...
A full run is ~20-40 seconds. Cost is a few cents of Steel session time plus OpenAI tokens per turn. gpt-5-mini keeps the bill small; the snapshot text and link list dominate each turn's prompt.
The finally block calls steel.sessions.release(). Skip it and the session runs until the default 5-minute timeout while you pay for idle browser time.
Make it yours
- Swap the task and schema. Change the prompt passed to
run()and rewriteFinalReport. The four tools are task-agnostic: any page that yields text plus anchors plus repeating rows works. Pricing pages, job boards, and dashboards fit the same shape. - Add handoffs. Pass
handoffs: [writerAgent]on theAgent. The SDK routes between agents based on each one's description. Useful when "browse" and "synthesize" want different models or prompts. - Add a guardrail. Wire
inputGuardrailsoroutputGuardrailson theAgentto vet the user's prompt or the final message. See the guardrails guide. - Use a stronger model.
model: "gpt-5"plans better on ambiguous pages at the cost of tokens and latency. For well-structured targets like GitHub trending,gpt-5-miniis enough. - Turn on stealth. Pass
useProxy,solveCaptcha, or a longersessionTimeouttosessions.create()for sites with anti-bot.
Related
Python version · OpenAI Agents SDK docs · Computer Use version
The OpenAI Agents SDK runs the tool-call loop so you don't have to. You declare an Agent with tools, a model, and (optionally) a Pydantic output_type. You call Runner.run(agent, input=...) once. The SDK handles every model turn, every tool dispatch, and every schema check until the agent returns a typed final answer. That shape is distinct from OpenAI's Computer Use, where you own the message loop.
This starter wraps a Steel browser as four tools and points the agent at GitHub Trending.
from agents import Agent, Runner, function_toolagent = Agent(name="SteelResearch",instructions="You operate a Steel cloud browser via tools. ...",model="gpt-5-mini",tools=[open_session, navigate, snapshot, extract],output_type=FinalReport,)result = await Runner.run(agent, input="...", max_turns=15)final: FinalReport = result.final_output
Each tool is a plain async function wrapped with @function_tool. The SDK reads the signature and docstring to build the JSON schema the model sees. output_type=FinalReport forces the last turn to produce a Pydantic-validated object, so result.final_output is typed. max_turns=15 is a hard cap on the loop.
The four tools
open_session creates a Steel session, starts Playwright, connects over CDP, and stashes the browser in module globals:
@function_toolasync def open_session() -> dict:"""Open a Steel cloud browser session. Call exactly once, before anything else."""global _session, _browser, _page, _playwright_session = steel.sessions.create()_playwright = await async_playwright().start()_browser = await _playwright.chromium.connect_over_cdp(f"{_session.websocket_url}&apiKey={STEEL_API_KEY}")ctx = _browser.contexts[0]_page = ctx.pages[0] if ctx.pages else await ctx.new_page()return {"session_id": _session.id, "live_view_url": _session.session_viewer_url}
Module globals keep the demo readable. For concurrent runs, swap them for the SDK's RunContextWrapper (pass a context object into Runner.run, read it via the wrapper inside each tool). The session viewer URL is returned to the agent so it can mention it; you can also watch the run live at app.steel.dev/sessions/<id>.
navigate is a thin wrapper around page.goto. snapshot returns the page's title, URL, visible text (capped at 4k chars), and the first 50 links. The docstring instructs the agent to call it before extract:
"""Return a readable snapshot of the current page: title, URL, visibletext (capped), and a list of links. Call BEFORE extract so the agentnever has to guess CSS selectors."""
This matters. With only navigate plus extract, the model invents selectors like .trending-repo that don't exist on real pages, calls extract, gets zero rows, retries. snapshot hands it the real DOM signals (visible text, href list) so it picks a selector that actually matches.
extract runs one page.evaluate to pull N rows with M fields each. The inline comment explains why:
# Serial CDP round-trips to Steel's cloud browser are ~200-300ms each,# so N*M round-trips burns seconds. One evaluate call is <500ms total.
Each field is a FieldSpec Pydantic model (name, selector, optional attr). The SDK generates a JSON schema for FieldSpec automatically, so the model sees a typed argument instead of a loose dict.
The typed output
class Repo(BaseModel):name: strurl: strstars: Optional[str] = Nonedescription: Optional[str] = Noneclass FinalReport(BaseModel):summary: strrepos: list[Repo] = Field(min_length=1, max_length=5)
output_type=FinalReport means the SDK steers the final turn to match the schema and validates the response against FinalReport. On failure, the SDK feeds the error back and the agent corrects. result.final_output in main() is a FinalReport, not a string you have to parse.
Some providers force JSON-only mode when you request structured output. OpenAI lets you combine output_type with tool calls freely: the agent keeps using snapshot / extract through the loop, then emits a validated FinalReport at the end.
Run it
cd examples/openai-agents-pycp .env.example .env # set STEEL_API_KEY and OPENAI_API_KEYuv run playwright install chromiumuv run main.py
Get keys from app.steel.dev and platform.openai.com. Each tool call prints its latency so you can see where time is going.
Your output varies. Structure looks like this:
Steel + OpenAI Agents SDK (Python) Starter============================================================open_session: 2843msnavigate: 1612mssnapshot: 487ms (3821 chars, 48 links)extract: 394ms (3 rows)Agent finished.{"summary": "Three trending Python repos focused on agentic workflows...","repos": [{"name": "owner/repo","url": "https://github.com/owner/repo","stars": "1,240","description": "..."},...]}Releasing Steel session...Session released. Replay: https://app.steel.dev/sessions/ab12cd34...
A run takes ~20 to 40 seconds and 5 to 10 agent turns on GitHub Trending. Cost is a few cents of Steel session time plus OpenAI tokens (gpt-5-mini keeps tokens cheap; the full gpt-5 reasons harder but costs more).
The finally block in main closes Playwright and calls steel.sessions.release(). Steel bills per session-minute, so don't skip it. A released session keeps its replay URL, printed on shutdown.
Tracing and sessions
The Agents SDK ships tracing on by default. Each Runner.run produces a trace with every model call, tool invocation, and token count, viewable at platform.openai.com/traces. No extra code, useful when a run goes sideways and you want to see which turn confused the agent.
If you want to chain runs without replaying history, pass a session (e.g. SQLiteSession("my-convo")) to Runner.run. The SDK persists the turn list so the next Runner.run picks up the conversation. That's orthogonal to Steel's browser session: one is agent memory, the other is the Chrome instance.
Make it yours
- Swap the task. Change the
input=string inmain()and theFinalReportschema. Tools stay the same; the agent re-plans. - Add a tool. Write an async function, decorate with
@function_tool, add it totools=[...]. A useful fifth tool isclick(selector: str)that callspage.clickand waits for navigation. - Hand off to a specialist. The SDK supports handoffs: define a second
Agent(say, aSummarizerwith no tools) and list it inhandoffs=[...]on the research agent. The browser agent transfers control once it has data. - Add a guardrail. Attach an input or output guardrail to reject off-topic requests or validate the
FinalReportbefore it returns. - Swap the model.
model="gpt-5"for harder reasoning,"gpt-5-mini"(default) for speed and cost. - Raise
max_turns. 15 is plenty for single-page extraction. Multi-page flows (login, then extract, then paginate) want 25 to 40. - Use
context. Replace module globals with a dataclass passed toRunner.run(agent, input=..., context=my_ctx). Each tool reads it viaRunContextWrapper. Needed for concurrent runs.
Related
TypeScript version · OpenAI Computer Use (Python) for the raw message loop · OpenAI Agents SDK docs
Related recipes
Build a typed browser agent with Pydantic AI
Use Steel with Pydantic AI to build typed, provider-agnostic browser agents with dependency injection.
Build a typed browser agent with LangGraph
Use Steel with LangGraph to build a typed browser agent with an explicit state-machine loop and a structured-output formatter node.
Build a typed browser agent with Mastra
Use Steel with Mastra to build a typed browser agent with the Mastra Model Router and Studio playground.