# Scrape a page to Markdown, screenshot, and PDF
URL: /cookbook/scrape

---
title: Scrape a page to Markdown, screenshot, and PDF
description: "Use the Steel TypeScript SDK's direct API to scrape a page to clean Markdown for LLM context, plus screenshot and PDF, with no browser library."
---

<RecipeJsonLd slug="scrape" title={"Scrape a page to Markdown, screenshot, and PDF"} description={"Use the Steel TypeScript SDK's direct API to scrape a page to clean Markdown for LLM context, plus screenshot and PDF, with no browser library."} authors={[{"handle":"junhsss","name":"Jun Ryu"}]} datePublished="2026-06-23" dateModified="2026-06-23" sourceUrl="https://github.com/steel-dev/steel-cookbook/tree/3d4db4fa997d1895d84d9d8106eaf25d97a60192/examples/scrape-ts" />

<Tabs items={['TypeScript', 'Python', 'Rust', 'Go']} groupId="lang" persist updateAnchor className="cookbook-concept-tabs">

<Tab id="typescript" className="cookbook-concept-tab">

<RecipeMeta href="https://github.com/steel-dev/steel-cookbook/tree/3d4db4fa997d1895d84d9d8106eaf25d97a60192/examples/scrape-ts" path="examples/scrape-ts" authors={[{"handle":"junhsss","name":"Jun Ryu","avatar":"https://github.com/junhsss.png?size=40"}]} updated="2026-06-23" />

<RecipeQuickstart slug="scrape-ts" />

`client.scrape()` takes a URL and returns the page already converted to Markdown. That matters because Markdown is the format large language models read best: headings, lists, and links survive, while the script tags, tracking pixels, and nav chrome that bloat a raw HTML dump are gone. You get a string you can drop straight into a prompt, with no headless Chrome on your machine and no DOM parsing in your code.

```typescript
const scraped = await client.scrape({
  url: TARGET_URL,
  format: ["markdown"],
});

const markdown = scraped.content.markdown ?? "";
```

`scrape()` runs the fetch and the cleanup on Steel's side, so there is no session to create, connect to, or release. One HTTP call in, structured content out. The same `client.screenshot()` and `client.pdf()` calls render the same page two other ways.

## Markdown for model context

The reason to reach for `scrape()` over a browser library is the format. A raw page is mostly markup a model has to wade through: a single news article can be tens of thousands of tokens of `<div>` soup before the first sentence. Markdown collapses that to the text, the structure, and the links, so you spend tokens on content instead of tags. The wiring is small once you have the string:

```typescript
const { content, metadata } = await client.scrape({
  url: TARGET_URL,
  format: ["markdown"],
});

const answer = await llm.chat({
  messages: [
    { role: "system", content: "Answer using only the page below." },
    { role: "user", content: `# ${metadata.title}\n\n${content.markdown}` },
  ],
});
```

That is the whole integration: scrape to Markdown, prepend the title, hand it to a model. No selectors, no `page.evaluate`, no waiting on a DOM you do not control.

One failure mode to plan for: a heavily client-rendered page can return near-empty Markdown if the content paints after the initial load. When `content.markdown` comes back short for a site you know is rich, add `delay` (milliseconds) to the `scrape()` call so the page settles before capture. Check `metadata.statusCode` too. A scrape of a 403 or a soft-blocked page still succeeds at the HTTP level but hands you the block page's text, not the content you wanted.

## What you get back

`format` is an array, so you can ask for more than one representation in a single call: `["markdown", "html", "cleaned_html", "readability"]`. Each lands under `content` on the response (`content.markdown`, `content.html`, and so on), and the field is undefined when you did not request that format, which is why the example reads `content.markdown ?? ""`.

The response carries more than the body. `scraped.metadata` holds the page `title`, `description`, `statusCode`, Open Graph tags, and the canonical URL. `scraped.links` is a flat array of `{ text, url }` for every link on the page, handy when you want an LLM to pick a next page to visit. The example prints the status code, title, link count, and the first 500 characters of Markdown so you can see the shape without dumping a whole article to the terminal.

`screenshot()` and `pdf()` differ from `scrape()` in one way worth knowing up front: they return a hosted URL, not bytes. `shot.url` and `pdf.url` point at the rendered artifact on Steel's storage, so the example logs the links rather than writing files. If you want the bytes on disk, fetch the URL yourself. The Python sibling does exactly that.

## Run it

```bash
cd examples/scrape-ts
cp .env.example .env          # set STEEL_API_KEY
npm install
npm start
```

Get a key at [app.steel.dev/settings/api-keys](https://app.steel.dev/settings/api-keys). `TARGET_URL` in `.env` is optional and defaults to Hacker News.

Your output varies. Structure looks like this:

```text
Steel Scrape API (TypeScript)
============================================================

Scraping https://news.ycombinator.com to markdown...
HTTP 200 | Hacker News
Links found: 174
Markdown length: 6841 characters

--- Markdown preview (first 500 chars) ---
# Hacker News

* [new](newest)
* [past](front)
* [comments](newcomments)
* [ask](ask)
* [show](show)
...
--- end preview ---

Capturing a full-page screenshot...
Screenshot hosted at: https://steel-screenshots.s3.amazonaws.com/...

Rendering the page to PDF...
PDF hosted at: https://steel-screenshots.s3.amazonaws.com/...

Done. Feed the markdown straight into an LLM prompt.
```

Each of the three calls is one billed request against Steel, so a full run costs a few cents of browser time. There is no session left open to leak: `scrape()`, `screenshot()`, and `pdf()` each return when the work is finished, so unlike the browser-driving recipes there is no `release()` to forget.

## Make it yours

- **Pipe Markdown into a model.** Pass `markdown` as the user message to your LLM of choice and ask it to summarize the page or pull out structured fields. This is the whole reason to scrape to Markdown instead of HTML.
- **Ask for several formats at once.** Set `format: ["markdown", "html"]` when you want the clean text for the model and the raw HTML for a fallback parser, both from a single request.
- **Bundle artifacts into the scrape.** Instead of separate `screenshot()` and `pdf()` calls, pass `screenshot: true` and `pdf: true` to `scrape()`. The URLs come back on `scraped.screenshot` and `scraped.pdf`, which is one billed request instead of three.
- **Get past anti-bot pages.** Add `useProxy: true` to route through Steel's residential proxies, or `delay: 3000` to wait for client-side rendering before the capture.
- **Pick a region.** `region` accepts values like `"iad"` or `"fra"` to run the fetch closer to the target or to your users.

## Related

[Python version](/cookbook/scrape) renders the same endpoints and writes the screenshot and PDF to disk as files. [Rust version](/cookbook/scrape) is the lowest-friction way into the Rust SDK. For a recipe that drives a real browser instead of the direct API, see [playwright-ts](/cookbook/playwright). Full method and parameter reference lives in the [steel-sdk package](https://www.npmjs.com/package/steel-sdk).

</Tab>

<Tab id="python" className="cookbook-concept-tab">

<RecipeMeta href="https://github.com/steel-dev/steel-cookbook/tree/3d4db4fa997d1895d84d9d8106eaf25d97a60192/examples/scrape-py" path="examples/scrape-py" authors={[{"handle":"junhsss","name":"Jun Ryu","avatar":"https://github.com/junhsss.png?size=40"}]} updated="2026-06-23" />

<RecipeQuickstart slug="scrape-py" />

Steel's `/v1/scrape` endpoint runs a browser server-side and hands back the rendered page. There is no session to create, no CDP socket to attach to, and no browser library on your machine. You call one method, and you get the page content, plus an optional screenshot and PDF. This recipe turns that single call into three files on disk: `page.md`, `screenshot.png`, and `page.pdf`.

```python
result = client.scrape(
    url=TARGET_URL,
    format=["markdown"],
    screenshot=True,
    pdf=True,
)
```

The one detail worth internalizing: the response mixes inline data and hosted artifacts. `result.content.markdown` is a string you can write straight to a file. But `result.screenshot.url` and `result.pdf.url` are **hosted URLs**, not bytes. Steel renders the image and PDF, stores them, and returns links. So the recipe writes the markdown directly, then fetches the two URLs with `urllib` and saves the bytes. The `download` helper does the fetch; `main` wires the three writes.

Because there is no session object, there is no teardown. `client.sessions.release(...)` does not apply here. You pay for the render, the response comes back, and you are done. That makes scrape the lowest-friction way to pull a page into an agent's context: one call, structured output, no lifecycle to manage.

## Run it

```bash
cd examples/scrape-py
cp .env.example .env          # set STEEL_API_KEY
uv run main.py
```

Grab a key at [app.steel.dev/settings/api-keys](https://app.steel.dev/settings/api-keys). `uv sync` runs automatically on first `uv run`, so there is no separate install step.

Your output varies. Structure looks like this:

```text
Steel Scrape API (Python)
============================================================
Scraping https://news.ycombinator.com ...
Fetched "Hacker News" (HTTP 200)
Markdown: 8421 chars, 147 links
Saved page.md (8421 chars)
Saved screenshot.png (184320 bytes)
Saved page.pdf (96774 bytes)

Artifacts written to /path/to/examples/scrape-py/output
Done!
```

The three files land in `output/` next to `main.py`. Open `page.md` to see the markdown an LLM would read, `screenshot.png` for the rendered viewport, and `page.pdf` for a print-layout capture.

A scrape costs a few cents of browser time. You are billed per render, not per minute, so a one-shot scrape is cheaper than spinning up a full session for the same page. If you only need text, drop `screenshot=True` and `pdf=True` and you skip the render-and-host work for the artifacts you are not using.

## Make it yours

- **Change the target.** Set `TARGET_URL` in `.env`, or edit the default in `main.py`. Everything downstream is the same.
- **Pick your formats.** `format` accepts any of `markdown`, `html`, `cleaned_html`, and `readability`. Pass a list to get several at once, then read them off `result.content` (`result.content.html`, `result.content.cleaned_html`, and so on). `cleaned_html` strips scripts and boilerplate; `readability` returns article-extracted structure.
- **Mine the metadata.** `result.metadata` carries `title`, `description`, `status_code`, Open Graph fields (`og_title`, `og_image`), `canonical`, `author`, and `json_ld`. `result.links` is a list of `{text, url}` for every link on the page, which is a ready-made frontier for a crawler.
- **Get the artifacts without the markdown.** `client.screenshot(url=..., full_page=True)` and `client.pdf(url=...)` are standalone calls that each return a single hosted URL. Use them when you want a capture and nothing else. `full_page=True` captures past the fold.
- **Reach difficult sites.** Pass `use_proxy=True` to route the render through Steel's residential proxy network for pages that block datacenter traffic.

## How scrape differs from a browser session

The other recipes in the cookbook connect a browser library (Playwright, Selenium) to a live Steel session over CDP, then drive clicks and reads themselves. That is the right tool when you need to log in, fill forms, or step through an app. Scrape is the right tool when you just want the page as it renders: one request in, content out, nothing to keep alive. If your agent's job is "read this URL," reach for scrape first and graduate to a session only when you need interaction.

## Related

[TypeScript version](/cookbook/scrape) covers the same endpoint with the clean-markdown-for-LLM angle. [Rust version](/cookbook/scrape) walks the three calls separately. For a live, interactive browser instead, see [playwright-py](/cookbook/playwright).

</Tab>

<Tab id="rust" className="cookbook-concept-tab">

<RecipeMeta href="https://github.com/steel-dev/steel-cookbook/tree/3d4db4fa997d1895d84d9d8106eaf25d97a60192/examples/scrape-rs" path="examples/scrape-rs" authors={[{"handle":"junhsss","name":"Jun Ryu","avatar":"https://github.com/junhsss.png?size=40"}]} updated="2026-06-23" />

<RecipeQuickstart slug="scrape-rs" />

Steel's REST API turns a URL into structured content without a browser on your side. The `steel-rs` crate wraps three of those endpoints as plain async methods: `client.scrape()` returns parsed content plus typed metadata, `client.screenshot()` and `client.pdf()` render the page and hand back a hosted file URL. There is no session to create, connect to, or release. Each call is one stateless request that runs a browser on Steel's side and returns when the page is done.

That makes this the shortest path into Steel from Rust, and it leans on the SDK's typed structs rather than raw JSON. `scrape()` deserializes into a `ScrapeResponse`, so the fields are real Rust types you can pattern-match on:

```rust
let scraped = client
    .scrape(ClientScrapeParams {
        url: TARGET_URL.to_string(),
        format: Some(vec![ScrapeRequestFormatItem::Markdown]),
        // remaining options set to None; see main.rs
    })
    .await?;

let meta = &scraped.metadata;       // ScrapeResponseMetadata
meta.status_code;                   // i64
meta.title.as_deref();              // Option<&str>
meta.language.as_deref();           // Option<&str>
scraped.links.len();                // Vec<ScrapeResponseLink>
scraped.content.markdown;           // Option<String>
```

`metadata` carries about twenty parsed fields (Open Graph tags, canonical URL, author, published time, the HTTP status code), so you get the document's shape without writing a single selector. `content` holds whichever formats you asked for in `format`: `Markdown`, `HTML`, `CleanedHTML`, or `Readability`. Request only what you need; markdown alone keeps the payload small for LLM context.

`main` runs all three calls against Hacker News, prints the typed metadata, and writes `page.md`, `screenshot.png`, and `page.pdf` to the working directory. Screenshot and PDF responses are a hosted URL, not bytes, so the `download` helper fetches each URL with `reqwest` and writes the file. The artifacts live on Steel for a while after the call, which is handy if you would rather hand the URL to another service than store the bytes yourself.

## Run it

```bash
cd examples/scrape-rs
cp .env.example .env          # set STEEL_API_KEY
cargo run
```

Get a key at [app.steel.dev/settings/api-keys](https://app.steel.dev/settings/api-keys). The first build pulls `steel-rs`, `tokio`, and `reqwest`, so it takes a moment; later runs are fast.

Your output varies. Structure looks like this:

```text
Scraping https://news.ycombinator.com ...
  status     200
  title      Hacker News
  language   en
  links      183
  markdown   14217 chars
  wrote      page.md
Capturing screenshot ...
  wrote      screenshot.png
Rendering PDF ...
  wrote      page.pdf
Done.
```

Three calls cost a few cents of browser time total. Steel bills per session-minute, and these one-shot endpoints spin up and tear down their own browser, so there is nothing to leak: no cleanup call, no session left running against the default 5-minute timeout. The trade-off is that each call is independent, so you cannot log in once and scrape five pages behind the auth. For that, open a session and drive a real browser (see Related).

## Make it yours

- **Change the target.** Edit the `TARGET_URL` constant. Every call reads from it.
- **Pick formats.** Pass more variants in `format`, for example `vec![ScrapeRequestFormatItem::Markdown, ScrapeRequestFormatItem::HTML]`, then read `scraped.content.html`. Each requested format comes back as its own `Option` field on `content`.
- **Get the screenshot and PDF in one call.** `scrape()` takes `pdf: Some(true)` and `screenshot: Some(true)`; the URLs come back on `scraped.pdf` and `scraped.screenshot` instead of making three round trips.
- **Handle anti-bot pages.** Set `use_proxy: Some(true)` on any of the params to route through a Steel residential proxy. Add `delay: Some(2000)` to wait for late-loading content before capture.
- **Match on the status.** `meta.status_code` is an `i64`, so branch on it before trusting the content (a soft 404 still returns markdown).

## Related

[TypeScript version](/cookbook/scrape) and [Python version](/cookbook/scrape) cover the same three endpoints. For a full browser session you connect to and drive over CDP, see [chromiumoxide](/cookbook/chromiumoxide). For the HTTP surface these methods wrap, see the [reqwest docs](https://docs.rs/reqwest) and [Tokio docs](https://tokio.rs).

</Tab>

<Tab id="go" className="cookbook-concept-tab">

<RecipeMeta href="https://github.com/steel-dev/steel-cookbook/tree/3d4db4fa997d1895d84d9d8106eaf25d97a60192/examples/scrape-go" path="examples/scrape-go" authors={[{"handle":"junhsss","name":"Jun Ryu","avatar":"https://github.com/junhsss.png?size=40"}]} updated="2026-06-23" />

<RecipeQuickstart slug="scrape-go" />

Steel's direct API turns a URL into clean content with no browser library and no session to manage. One `client.Scrape` call runs a browser server-side and returns the page as Markdown (or HTML, readability, or cleaned HTML) inline, while `client.Screenshot` and `client.Pdf` render the same page to hosted files. This recipe scrapes a page to Markdown, prints a preview, then captures a full-page screenshot and a PDF. It is the lowest-friction way to reach a page from Go: no CDP, no chromedp, no `defer release`.

The scrape call leads:

```go
scraped, err := client.Scrape(ctx, steel.ClientScrapeParams{
    URL:    targetURL,
    Format: &[]steel.ScrapeRequestFormatItem{steel.ScrapeRequestFormatItemMarkdown},
})
markdown := deref(scraped.Content.Markdown, "")
title := deref(scraped.Metadata.Title, "(no title)")
```

Two Go specifics show up here. Optional request fields are pointers (`Format` is a `*[]ScrapeRequestFormatItem`, `FullPage` is a `*bool`), and steel-go ships no pointer constructors, so the recipe defines a one-line `ptr[T]` generic. Response fields like `Content.Markdown` and `Metadata.Title` are `*string`, so a small `deref` helper supplies a fallback. The format is a typed constant (`steel.ScrapeRequestFormatItemMarkdown`), not a bare string.

Screenshot and PDF come back as hosted URLs, not bytes:

```go
shot, _ := client.Screenshot(ctx, steel.ClientScreenshotParams{URL: targetURL, FullPage: ptr(true)})
fmt.Println(shot.URL) // https://...

pdf, _ := client.Pdf(ctx, steel.ClientPdfParams{URL: targetURL})
fmt.Println(pdf.URL)
```

To keep the files, fetch each URL with `net/http` and write the bytes to disk.

## Run it

```bash
cd examples/scrape-go
cp .env.example .env          # set STEEL_API_KEY
go run .
```

Get a Steel key at [app.steel.dev/settings/api-keys](https://app.steel.dev/settings/api-keys). Point it at any page with `TARGET_URL` in `.env`. Your output varies. Structure looks like this:

```text
Steel Scrape API (Go)
============================================================

Scraping https://news.ycombinator.com to markdown...
HTTP 200 | Hacker News
Links found: 184
Markdown length: 8423 characters

--- Markdown preview (first 500 chars) ---
[ clean Markdown for the page ]
--- end preview ---

Capturing a full-page screenshot...
Screenshot hosted at: https://...
Rendering the page to PDF...
PDF hosted at: https://...

Done. Feed the markdown straight into an LLM prompt.
```

A scrape call costs a few cents of browser time. Steel starts and tears down the browser per call, so there is no session to release.

## Make it yours

- **Change the page.** Set `TARGET_URL` in `.env`, or pass a different URL to `client.Scrape`.
- **Ask for several formats.** `Format` takes a slice, so request more than one at once (`ScrapeRequestFormatItemMarkdown`, `...HTML`, `...Readability`, `...CleanedHTML`). Each lands under its own field on `Content`.
- **Save the artifacts.** Fetch `shot.URL` and `pdf.URL` with `net/http` and `os.WriteFile` to write `screenshot.png` and `page.pdf`, the way the Python recipe does.
- **Scrape behind a proxy.** Set `UseProxy: ptr(true)` to route through a Steel residential proxy for geofenced or bot-sensitive pages.

## Related

[scrape-ts](/cookbook/scrape) and [scrape-py](/cookbook/scrape) are the same direct API in TypeScript and Python, where the Python recipe writes the screenshot and PDF to disk. [scrape-rs](/cookbook/scrape) is the Rust version. For a full browser you drive yourself, [chromedp](/cookbook/chromedp) and [Rod](/cookbook/rod) connect over CDP instead.

</Tab>

</Tabs>

## Related recipes

<RecipeGrid>
<RecipeCard slug="convex-price-watch" title={"Watch Claude pricing for divergent A/B variants"} description={"Convex cron plus two parallel Steel proxy probes against claude.com/pricing. Stores per-tier per-region snapshots and surfaces tiers where the probes disagree."} topics={['Steel APIs', 'Convex']} languages={['TypeScript']} date="2026-05-04" />
<RecipeCard slug="profiles" title={"Persist authenticated sessions with Profiles"} description={"Maintain authenticated sessions across Steel browser instances using profiles."} topics={['Steel APIs', 'Authentication']} languages={['TypeScript', 'Python', 'Rust', 'Go']} date="2025-10-13" />
<RecipeCard slug="auth-context" title={"Reuse authenticated sessions across browsers"} description={"Maintain authenticated sessions across Steel browser instances by capturing and reusing cookies and local storage."} topics={['Steel APIs', 'Authentication']} languages={['TypeScript', 'Python', 'Rust', 'Go']} date="2025-03-11" />
</RecipeGrid>
