Build a typed browser agent with the OpenAI Agents SDK

Use Steel with the OpenAI Agents SDK for TypeScript to build typed, tool-using browser agents.

Jun RyuUpdated Apr 24, 2026

The OpenAI Agents SDK (@openai/agents) is a small runtime for agentic loops. You define an Agent with instructions, a model, tools, and an outputType. You call run(agent, prompt). The SDK handles "pick a tool, call it, feed the result back, repeat" and validates the final message against your schema. It also ships handoffs (one agent calls another), guardrails (pre/post checks), and traces out of the box.

This recipe turns the tool layer into a Steel cloud browser. Four tool() wrappers in index.ts (openSession, navigate, snapshot, extract) shuttle CDP calls between the agent and Playwright. The agent only sees typed arguments in and structured JSON out. Demo task: scan github.com/trending/python and return the top 3 AI/ML repos as a validated FinalReport.

This is a different primitive from OpenAI's computer-use API. Computer-use streams screenshots and mouse coordinates through responses.create and you write the loop yourself. The Agents SDK sits a layer above: typed function tools, no pixels, and the SDK owns the loop.

The four tools

Every tool() call pairs a Zod parameters schema with an async execute. The SDK compiles Zod to OpenAI's strict JSON Schema at registration, which tightens a couple of rules: no .url() format (pass a plain z.string()), and .optional() is rejected. Use .nullable() instead. The attr: z.string().nullable() field inside extract exists for that reason.

openSession creates the Steel session, attaches Playwright over CDP, and stashes session, browser, and page in module scope so the other three tools share them:

const openSession = tool({
  name: "open_session",
  description: "Open a Steel cloud browser session. Call exactly once, before anything else.",
  parameters: z.object({}),
  execute: async () => {
    session = await steel.sessions.create({});
    browser = await chromium.connectOverCDP(
      `${session.websocketUrl}&apiKey=${STEEL_API_KEY}`,
    );
    const ctx = browser.contexts()[0];
    page = ctx.pages()[0] ?? (await ctx.newPage());
    return { sessionId: session.id, liveViewUrl: session.sessionViewerUrl };
  },
});

Module-level state is fine here because run() fires once per script. For a long-lived server, move the session into a per-run context object and pass it to each tool's closure.

snapshot is the cheap look-around. It returns title, url, a capped innerText, and the first N anchor tags, all inside one page.evaluate. The agent is told to call it before extract so it can pick a real CSS selector instead of hallucinating one.

extract takes a rowSelector and a fields[] spec (name + selector + optional attr) and returns Record<string, string>[]. The whole query runs inside a single page.evaluate. Serial CDP round-trips against a cloud browser cost roughly 200-300ms each, so N rows by M fields in sequence burns real seconds. Batching collapses it to one round trip.

navigate is a thin page.goto wrapper with a 45-second timeout and waitUntil: "domcontentloaded".

The output contract

outputType accepts a Zod schema. The SDK attaches it to the final assistant message and fails the run if the model drifts off-shape:

const FinalReport = z.object({
  summary: z.string(),
  repos: z.array(z.object({
    name: z.string(),
    url: z.string(),
    stars: z.string().nullable(),
    description: z.string().nullable(),
  })).min(1).max(5),
});

const agent = new Agent({
  name: "SteelResearch",
  instructions: "You operate a Steel cloud browser via tools. Workflow: ...",
  model: "gpt-5-mini",
  tools: [openSession, navigate, snapshot, extract],
  outputType: FinalReport,
});

const result = await run(agent, "Go to https://github.com/trending/python ...", { maxTurns: 15 });
console.log(result.finalOutput); // typed as z.infer<typeof FinalReport>

Some providers force a JSON-only mode when you ask for structured output, which kills tool use. OpenAI does not. The agent calls tools freely throughout the loop and only formats against the schema on the final message.

maxTurns: 15 caps the loop. One turn is one model response, which may contain any number of tool calls. The demo finishes in 4-6 turns.

Run it

cd examples/openai-agents-ts
cp .env.example .env          # set STEEL_API_KEY and OPENAI_API_KEY
npm install
npx playwright install chromium
npm start

Get keys at app.steel.dev/settings/api-keys and platform.openai.com/api-keys. A viewer URL prints as openSession runs. Open it in another tab to watch the browser live.

Your output varies. Structure looks like this:

Steel + OpenAI Agents SDK (TypeScript) Starter
============================================================
    open_session: 1432ms
    navigate: 2180ms
    snapshot: 412ms (3921 chars, 48 links)
    extract: 380ms (3 rows)

Agent finished.

{
  "summary": "All three repos focus on LLM tooling written in Python...",
  "repos": [
    { "name": "owner/repo", "url": "https://github.com/...", "stars": "1,204", "description": "..." },
    ...
  ]
}

Releasing Steel session...
Session released. Replay: https://app.steel.dev/sessions/ab12cd34...

A full run is ~20-40 seconds. Cost is a few cents of Steel session time plus OpenAI tokens per turn. gpt-5-mini keeps the bill small; the snapshot text and link list dominate each turn's prompt.

The finally block calls steel.sessions.release(). Skip it and the session runs until the default 5-minute timeout while you pay for idle browser time.

Make it yours

Swap the task and schema. Change the prompt passed to run() and rewrite FinalReport. The four tools are task-agnostic: any page that yields text plus anchors plus repeating rows works. Pricing pages, job boards, and dashboards fit the same shape.
Add handoffs. Pass handoffs: [writerAgent] on the Agent. The SDK routes between agents based on each one's description. Useful when "browse" and "synthesize" want different models or prompts.
Add a guardrail. Wire inputGuardrails or outputGuardrails on the Agent to vet the user's prompt or the final message. See the guardrails guide.
Use a stronger model. model: "gpt-5" plans better on ambiguous pages at the cost of tokens and latency. For well-structured targets like GitHub trending, gpt-5-mini is enough.
Turn on stealth. Pass useProxy, solveCaptcha, or a longer sessionTimeout to sessions.create() for sites with anti-bot.

Python version · OpenAI Agents SDK docs · Computer Use version

examples/openai-agents-py

Jun RyuUpdated Apr 24, 2026

The OpenAI Agents SDK runs the tool-call loop so you don't have to. You declare an Agent with tools, a model, and (optionally) a Pydantic output_type. You call Runner.run(agent, input=...) once. The SDK handles every model turn, every tool dispatch, and every schema check until the agent returns a typed final answer. That shape is distinct from OpenAI's Computer Use, where you own the message loop.

This starter wraps a Steel browser as four tools and points the agent at GitHub Trending.

from agents import Agent, Runner, function_tool

agent = Agent(
    name="SteelResearch",
    instructions="You operate a Steel cloud browser via tools. ...",
    model="gpt-5-mini",
    tools=[open_session, navigate, snapshot, extract],
    output_type=FinalReport,
)

result = await Runner.run(agent, input="...", max_turns=15)
final: FinalReport = result.final_output

Each tool is a plain async function wrapped with @function_tool. The SDK reads the signature and docstring to build the JSON schema the model sees. output_type=FinalReport forces the last turn to produce a Pydantic-validated object, so result.final_output is typed. max_turns=15 is a hard cap on the loop.

The four tools

open_session creates a Steel session, starts Playwright, connects over CDP, and stashes the browser in module globals:

@function_tool
async def open_session() -> dict:
    """Open a Steel cloud browser session. Call exactly once, before anything else."""
    global _session, _browser, _page, _playwright
    _session = steel.sessions.create()
    _playwright = await async_playwright().start()
    _browser = await _playwright.chromium.connect_over_cdp(
        f"{_session.websocket_url}&apiKey={STEEL_API_KEY}"
    )
    ctx = _browser.contexts[0]
    _page = ctx.pages[0] if ctx.pages else await ctx.new_page()
    return {"session_id": _session.id, "live_view_url": _session.session_viewer_url}

Module globals keep the demo readable. For concurrent runs, swap them for the SDK's RunContextWrapper (pass a context object into Runner.run, read it via the wrapper inside each tool). The session viewer URL is returned to the agent so it can mention it; you can also watch the run live at app.steel.dev/sessions/<id>.

navigate is a thin wrapper around page.goto. snapshot returns the page's title, URL, visible text (capped at 4k chars), and the first 50 links. The docstring instructs the agent to call it before extract:

"""Return a readable snapshot of the current page: title, URL, visible
text (capped), and a list of links. Call BEFORE extract so the agent
never has to guess CSS selectors."""

This matters. With only navigate plus extract, the model invents selectors like .trending-repo that don't exist on real pages, calls extract, gets zero rows, retries. snapshot hands it the real DOM signals (visible text, href list) so it picks a selector that actually matches.

extract runs one page.evaluate to pull N rows with M fields each. The inline comment explains why:

# Serial CDP round-trips to Steel's cloud browser are ~200-300ms each,
# so N*M round-trips burns seconds. One evaluate call is <500ms total.

Each field is a FieldSpec Pydantic model (name, selector, optional attr). The SDK generates a JSON schema for FieldSpec automatically, so the model sees a typed argument instead of a loose dict.

The typed output

class Repo(BaseModel):
    name: str
    url: str
    stars: Optional[str] = None
    description: Optional[str] = None

class FinalReport(BaseModel):
    summary: str
    repos: list[Repo] = Field(min_length=1, max_length=5)

output_type=FinalReport means the SDK steers the final turn to match the schema and validates the response against FinalReport. On failure, the SDK feeds the error back and the agent corrects. result.final_output in main() is a FinalReport, not a string you have to parse.

Some providers force JSON-only mode when you request structured output. OpenAI lets you combine output_type with tool calls freely: the agent keeps using snapshot / extract through the loop, then emits a validated FinalReport at the end.

Run it

cd examples/openai-agents-py
cp .env.example .env          # set STEEL_API_KEY and OPENAI_API_KEY
uv run playwright install chromium
uv run main.py

Get keys from app.steel.dev and platform.openai.com. Each tool call prints its latency so you can see where time is going.

Your output varies. Structure looks like this:

Steel + OpenAI Agents SDK (Python) Starter
============================================================
    open_session: 2843ms
    navigate: 1612ms
    snapshot: 487ms (3821 chars, 48 links)
    extract: 394ms (3 rows)

Agent finished.

{
  "summary": "Three trending Python repos focused on agentic workflows...",
  "repos": [
    {
      "name": "owner/repo",
      "url": "https://github.com/owner/repo",
      "stars": "1,240",
      "description": "..."
    },
    ...
  ]
}

Releasing Steel session...
Session released. Replay: https://app.steel.dev/sessions/ab12cd34...

A run takes ~20 to 40 seconds and 5 to 10 agent turns on GitHub Trending. Cost is a few cents of Steel session time plus OpenAI tokens (gpt-5-mini keeps tokens cheap; the full gpt-5 reasons harder but costs more).

The finally block in main closes Playwright and calls steel.sessions.release(). Steel bills per session-minute, so don't skip it. A released session keeps its replay URL, printed on shutdown.

Tracing and sessions

The Agents SDK ships tracing on by default. Each Runner.run produces a trace with every model call, tool invocation, and token count, viewable at platform.openai.com/traces. No extra code, useful when a run goes sideways and you want to see which turn confused the agent.

If you want to chain runs without replaying history, pass a session (e.g. SQLiteSession("my-convo")) to Runner.run. The SDK persists the turn list so the next Runner.run picks up the conversation. That's orthogonal to Steel's browser session: one is agent memory, the other is the Chrome instance.

Make it yours

Swap the task. Change the input= string in main() and the FinalReport schema. Tools stay the same; the agent re-plans.
Add a tool. Write an async function, decorate with @function_tool, add it to tools=[...]. A useful fifth tool is click(selector: str) that calls page.click and waits for navigation.
Hand off to a specialist. The SDK supports handoffs: define a second Agent (say, a Summarizer with no tools) and list it in handoffs=[...] on the research agent. The browser agent transfers control once it has data.
Add a guardrail. Attach an input or output guardrail to reject off-topic requests or validate the FinalReport before it returns.
Swap the model. model="gpt-5" for harder reasoning, "gpt-5-mini" (default) for speed and cost.
Raise max_turns. 15 is plenty for single-page extraction. Multi-page flows (login, then extract, then paginate) want 25 to 40.
Use context. Replace module globals with a dataclass passed to Runner.run(agent, input=..., context=my_ctx). Each tool reads it via RunContextWrapper. Needed for concurrent runs.

TypeScript version · OpenAI Computer Use (Python) for the raw message loop · OpenAI Agents SDK docs

Build a typed browser agent with Pydantic AI

Use Steel with Pydantic AI to build typed, provider-agnostic browser agents with dependency injection.

Agents Typed output

Apr 29, 2026

Build a typed browser agent with LangGraph

Use Steel with LangGraph to build a typed browser agent with an explicit state-machine loop and a structured-output formatter node.

Agents Typed output

Apr 29, 2026

Build a typed browser agent with Mastra

Use Steel with Mastra to build a typed browser agent with the Mastra Model Router and Studio playground.

Agents Typed output

Apr 29, 2026