Scrape API `/scrape`

Low-level endpoint that reuses the same browser + proxy infra as the agent, but returns raw page text and accessibility trees. No planner, no tools—just data for your own models and pipelines.

Try Playground Get API Key

Infra-Only Credits

No model/tool credits—just browser + proxy costs for maximum efficiency.

Raw Page Data

Get extracted text, accessibility trees, and element link records.

Composable Output

Feed results directly into your own LLM/RAG pipelines.

Scrape API Playground

POST/scrape

Low-level endpoint for raw page text + accessibility tree.

API Key

Get from rtrvr.ai/cloud

URLs (one per line)

curl -X POST https://api.rtrvr.ai/scrape \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
  "urls": [
    "https://news.ycombinator.com/"
  ],
  "response": {
    "inlineOutputMaxBytes": 1048576
  }
}'

Base URLhttps://api.rtrvr.ai

Use /scrape for raw page data and /agent for full agent runs.

Use your API key in the Authorization header:

Header

Authorization: Bearer rtrvr_your_api_key

Security: Keep your key server-side (backend or serverless). Don't ship it to the browser.

POSThttps://api.rtrvr.ai/scrape

Agent vs Scrape

Use /agent when you want the full planner + tools engine, and /scrape when you just need raw page text + structure for your own models.

See comparison

Open one or more URLs in our browser cluster and get back extracted text, the accessibility tree, and link metadata. The endpoint is designed to be:

Cheap – infra-only credits (browser + proxy), no model usage.
Predictable – stable schema for tab content + usage metrics.
Composable – plug the result into your own LLM/RAG pipeline.

Minimal scrape – single URL

curl -X POST https://api.rtrvr.ai/scrape \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "urls": ["https://example.com/blog/ai-trends-2025"]
  }'

Each scrape uses a unified UserSettings profile stored in the cloud.

Relevant UserSettings fields (conceptual)

interface UserSettings {
  extractionConfig: {
    maxParallelTabs?: number;
    pageLoadDelay?: number;
    makeNewTabsActive?: boolean;
    writeRowProcessingTime?: boolean;
    disableAutoScroll?: boolean;
    /**
     * When true, only text content is returned from scrapes.
     * The accessibility tree + elementLinkRecord are omitted.
     */
    onlyTextContent?: boolean;
  };

  // Proxy Configuration
  proxyConfig: {
    mode: 'none' | 'custom' | 'default' | 'device';
    customProxies: ProxySettings[];
    selectedProxyId?: string;
    selectedDeviceId?: string;
  };
}

Two ways to control behavior:

1. Cloud profile: configure defaults in Cloud → Settings.
2. Per-request overrides: send settings in your request body.

The request body is an ScrapeApiRequest:

ScrapeApiRequest (conceptual)

interface ScrapeApiRequest {
    /**
     * Optional stable id if you want to tie multiple scrapes together.
     * Mostly useful for analytics/observability on your side.
     */
    trajectoryId?: string;

    /**
     * One or more absolute URLs to load in the browser.
     * Must be a non-empty array of non-empty strings.
     */
    urls: string[];

    /**
     * Optional per-request settings override.
     * Merged on top of the stored UserSettings profile (proxyConfig, extraction, etc.).
     *
     * Use extraction-related settings if you only want text content and don't need
     * the accessibility tree + elementLinkRecord.
     */
    settings?: Partial<UserSettings>;

    /**
     * Response size control for API callers.
     */
    response?: {
      /**
       * Max bytes allowed for the inline JSON response.
       * If the full response exceeds this, tabs remain inline as preview content,
       * and a StorageReference is returned under metadata.responseRef for full payload download.
       * Default: 1MB (1048576 bytes)
       */
      inlineOutputMaxBytes?: number;
    };

    /**
     * Optional execution options.
     * Set options.ui.emitEvents=true to write progress events for SSE/polling clients.
     * If omitted/false, no execution event stream is written.
     */
    options?: {
      ui?: {
        emitEvents?: boolean;
      };
    };

    /**
     * Webhooks to call when the scrape completes, fails, or is cancelled.
     */
    webhooks?: WebhookSubscription[];
  }

interface WebhookSubscription {
    /** The URL to POST to */
    url: string;
    /** Events to subscribe to. Defaults to all scrape events. */
    events?: WebhookEvent[];
    /** Optional custom headers */
    headers?: Record<string, string>;
    /** Optional auth (bearer or basic) */
    auth?: { type: "bearer"; token: string } | { type: "basic"; username: string; password: string };
    /** Optional secret for HMAC signing (X-Rtrvr-Signature header) */
    secret?: string;
    /** Timeout for webhook delivery (default: 8000ms) */
    timeoutMs?: number;
    /** Retry policy (default: { mode: "default" }) */
    retry?: { mode: "default" | "none" };
  }

type WebhookEvent =
    | "rtrvr.scrape.succeeded"
    | "rtrvr.scrape.failed"
    | "rtrvr.scrape.cancelled";

Parameters

urlsstring[]required

One or more absolute URLs to scrape. Must be a non-empty array.

trajectoryIdstring

Optional stable id for grouping scrapes together (analytics, observability).

settingsPartial<UserSettings>

Optional per-request override merged on top of your cloud UserSettings profile.

response.inlineOutputMaxBytesnumberdefault: 1048576

Maximum inline response size in bytes (default 1MB).

options.ui.emitEventsbooleandefault: false

Opt-in execution progress events for SSE/polling consumers.

webhooksWebhookSubscription[]

Optional array of webhook endpoints to notify when the scrape completes, fails, or is cancelled.

Receive HTTP callbacks when your scrape completes, fails, or is cancelled. Webhooks are delivered asynchronously after the scrape finishes.

Webhook Subscription

urlstringrequired

The HTTPS endpoint to POST the webhook payload to.

eventsWebhookEvent[]

Which events to subscribe to. Defaults to all scrape events.

"rtrvr.scrape.succeeded""rtrvr.scrape.failed""rtrvr.scrape.cancelled"

headersRecord<string, string>

Custom headers to include with each webhook request.

authobject

Authentication for the webhook endpoint. Supports bearer token or basic auth.

auth.type"bearer" | "basic"

The authentication type.

auth.tokenstring

Bearer token (when type is "bearer").

auth.usernamestring

Username (when type is "basic").

auth.passwordstring

Password (when type is "basic").

secretstring

HMAC secret for signing. When provided, requests include X-Rtrvr-Signature header.

timeoutMsnumberdefault: 8000

Timeout for webhook delivery in milliseconds.

retryobject

Retry policy. { mode: "default" } retries with backoff; { mode: "none" } delivers once.

Example with Webhook

Scrape with webhook notification

curl -X POST https://api.rtrvr.ai/scrape \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "urls": ["https://example.com/page1", "https://example.com/page2"],
    "webhooks": [
      {
        "url": "https://your-server.com/webhooks/scrape",
        "events": ["rtrvr.scrape.succeeded", "rtrvr.scrape.failed"],
        "secret": "whsec_your_signing_secret",
        "headers": { "X-Custom-Header": "my-value" }
      }
    ]
  }'

Webhook Payload

Each webhook delivery is a POST request with a JSON envelope:

Webhook envelope

{
  "id": "whd_abc123...",          // unique delivery id
  "event": "rtrvr.scrape.succeeded",
  "createdAt": "2025-01-15T10:30:00.000Z",
  "data": {
    "trajectoryId": "traj_xyz...",
    "success": true,
    "tabs": [...],
    "usageData": {...}
  }
}

Signature Verification

When you provide a secret, each request includes an X-Rtrvr-Signature header:

text

X-Rtrvr-Signature: t=1705312200,v1=5257a869e7ecebeda32affa62cdca3fa51cad7e77a0e56ff536d0ce8e108d8bd

Verify signature (Node.js)

import crypto from 'crypto';

function verifyWebhookSignature(payload, signature, secret) {
  const [tPart, vPart] = signature.split(',');
  const timestamp = tPart.split('=')[1];
  const receivedSig = vPart.split('=')[1];

  // Recreate the signed payload
  const signedPayload = `${timestamp}.${JSON.stringify(payload)}`;
  const expectedSig = crypto
    .createHmac('sha256', secret)
    .update(signedPayload)
    .digest('hex');

  // Timing-safe comparison
  return crypto.timingSafeEqual(
    Buffer.from(receivedSig),
    Buffer.from(expectedSig)
  );
}

Store & reuse webhooks

Save your webhook endpoints in Cloud → Webhooks to quickly attach them to any execution without re-entering the URL, secret, and events each time.

The API response is an ScrapeApiResponse:

ScrapeApiResponse (conceptual)

interface ScrapedTab {
  tabId: number;
  url: string;
  title: string;
  contentType: string;
  status: "success" | "error";
  error?: string;

  /**
   * Full extracted visible text (when available).
   */
  content?: string;

  /**
   * JSON-encoded accessibility tree (stringified).
   * Use this if you want a rich, structured view of the page for your own models.
   * Every link node in the tree has a numeric 'id' field which is used as the key
   * in elementLinkRecord.
   */
  tree?: string;

  /**
   * Map of accessibility-tree element id -> href/URL for link elements.
   * Only present when 'tree' is present.
   */
  elementLinkRecord?: Record<number, string>;
}

interface ScrapeUsageData {
  totalCredits: number;
  browserCredits: number;
  proxyCredits: number;
  totalUsd: number;
  requestDurationMs: number;
  proxyPageLoads: number;
  proxyTabsDataFetches: number;
  usingBillableProxy: boolean;
}

interface ScrapeApiResponse {
  success: boolean;
  status: "success" | "error";
  trajectoryId: string;

  tabs?: ScrapedTab[];
  usageData: ScrapeUsageData;

  metadata?: {
    taskRef?: string;
    inlineOutputMaxBytes: number;
    durationMs: number;
    outputTooLarge?: boolean;
    responseRef?: StorageReference;
  };

  error?: string;
}

Tabs & content

tabsScrapedTab[]

One tab per URL, in the same order as the input urls.

tabs[].contentstring

Full extracted visible text when available.

tabs[].treestring

JSON-encoded accessibility tree (stringified). Omitted when onlyTextContent=true.

tabs[].elementLinkRecordRecord<number, string>

Lookup table mapping accessibility-tree element id → href/URL.

Infra usage

usageData.totalCreditsnumber

Total infra credits consumed by this scrape.

usageData.browserCreditsnumber

Credits attributable to browser usage.

usageData.proxyCreditsnumber

Credits attributable to proxy usage.

usageData.requestDurationMsnumber

End-to-end latency for the scrape request in ms.

cURL

# Basic scrape using profile defaults
curl -X POST https://api.rtrvr.ai/scrape \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "urls": ["https://example.com/blog/ai-trends-2025"],
    "response": { "inlineOutputMaxBytes": 1048576 }
  }'

# With per-request settings override
curl -X POST https://api.rtrvr.ai/scrape \
  -H "Authorization: Bearer YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "urls": [
      "https://example.com/blog/ai-trends-2025",
      "https://example.com/pricing"
    ],
    "settings": {
      "extractionConfig": {
        "onlyTextContent": true
      },
      "proxyConfig": {
        "mode": "default"
      }
    },
    "response": {
      "inlineOutputMaxBytes": 1048576
    }
  }'

Ready to automate?

Join teams using rtrvr.ai to build playful, powerful web automation workflows.

interface UserSettings { extractionConfig: { maxParallelTabs?: number; pageLoadDelay?: number; makeNewTabsActive?: boolean; writeRowProcessingTime?: boolean; disableAutoScroll?: boolean; /** * When true, only text content is returned from scrapes. * The accessibility tree + elementLinkRecord are omitted. */ onlyTextContent?: boolean; }; // Proxy Configuration proxyConfig: { mode: 'none' | 'custom' | 'default' | 'device'; customProxies: ProxySettings[]; selectedProxyId?: string; selectedDeviceId?: string; }; }

interface ScrapeApiRequest { /** * Optional stable id if you want to tie multiple scrapes together. * Mostly useful for analytics/observability on your side. */ trajectoryId?: string; /** * One or more absolute URLs to load in the browser. * Must be a non-empty array of non-empty strings. */ urls: string[]; /** * Optional per-request settings override. * Merged on top of the stored UserSettings profile (proxyConfig, extraction, etc.). * * Use extraction-related settings if you only want text content and don't need * the accessibility tree + elementLinkRecord. */ settings?: Partial<UserSettings>; /** * Response size control for API callers. */ response?: { /** * Max bytes allowed for the inline JSON response. * If the full response exceeds this, tabs remain inline as preview content, * and a StorageReference is returned under metadata.responseRef for full payload download. * Default: 1MB (1048576 bytes) */ inlineOutputMaxBytes?: number; }; /** * Optional execution options. * Set options.ui.emitEvents=true to write progress events for SSE/polling clients. * If omitted/false, no execution event stream is written. */ options?: { ui?: { emitEvents?: boolean; }; }; /** * Webhooks to call when the scrape completes, fails, or is cancelled. */ webhooks?: WebhookSubscription[]; } interface WebhookSubscription { /** The URL to POST to */ url: string; /** Events to subscribe to. Defaults to all scrape events. */ events?: WebhookEvent[]; /** Optional custom headers */ headers?: Record<string, string>; /** Optional auth (bearer or basic) */ auth?: { type: "bearer"; token: string } | { type: "basic"; username: string; password: string }; /** Optional secret for HMAC signing (X-Rtrvr-Signature header) */ secret?: string; /** Timeout for webhook delivery (default: 8000ms) */ timeoutMs?: number; /** Retry policy (default: { mode: "default" }) */ retry?: { mode: "default" | "none" }; } type WebhookEvent = | "rtrvr.scrape.succeeded" | "rtrvr.scrape.failed" | "rtrvr.scrape.cancelled";

curl -X POST https://api.rtrvr.ai/scrape \ -H "Authorization: Bearer YOUR_API_KEY" \ -H "Content-Type: application/json" \ -d '{ "urls": ["https://example.com/page1", "https://example.com/page2"], "webhooks": [ { "url": "https://your-server.com/webhooks/scrape", "events": ["rtrvr.scrape.succeeded", "rtrvr.scrape.failed"], "secret": "whsec_your_signing_secret", "headers": { "X-Custom-Header": "my-value" } } ] }'

{ "id": "whd_abc123...", // unique delivery id "event": "rtrvr.scrape.succeeded", "createdAt": "2025-01-15T10:30:00.000Z", "data": { "trajectoryId": "traj_xyz...", "success": true, "tabs": [...], "usageData": {...} } }

import crypto from 'crypto'; function verifyWebhookSignature(payload, signature, secret) { const [tPart, vPart] = signature.split(','); const timestamp = tPart.split('=')[1]; const receivedSig = vPart.split('=')[1]; // Recreate the signed payload const signedPayload = `${timestamp}.${JSON.stringify(payload)}`; const expectedSig = crypto .createHmac('sha256', secret) .update(signedPayload) .digest('hex'); // Timing-safe comparison return crypto.timingSafeEqual( Buffer.from(receivedSig), Buffer.from(expectedSig) ); }

interface ScrapedTab { tabId: number; url: string; title: string; contentType: string; status: "success" | "error"; error?: string; /** * Full extracted visible text (when available). */ content?: string; /** * JSON-encoded accessibility tree (stringified). * Use this if you want a rich, structured view of the page for your own models. * Every link node in the tree has a numeric 'id' field which is used as the key * in elementLinkRecord. */ tree?: string; /** * Map of accessibility-tree element id -> href/URL for link elements. * Only present when 'tree' is present. */ elementLinkRecord?: Record<number, string>; } interface ScrapeUsageData { totalCredits: number; browserCredits: number; proxyCredits: number; totalUsd: number; requestDurationMs: number; proxyPageLoads: number; proxyTabsDataFetches: number; usingBillableProxy: boolean; } interface ScrapeApiResponse { success: boolean; status: "success" | "error"; trajectoryId: string; tabs?: ScrapedTab[]; usageData: ScrapeUsageData; metadata?: { taskRef?: string; inlineOutputMaxBytes: number; durationMs: number; outputTooLarge?: boolean; responseRef?: StorageReference; }; error?: string; }

# Basic scrape using profile defaults curl -X POST https://api.rtrvr.ai/scrape \ -H "Authorization: Bearer YOUR_API_KEY" \ -H "Content-Type: application/json" \ -d '{ "urls": ["https://example.com/blog/ai-trends-2025"], "response": { "inlineOutputMaxBytes": 1048576 } }' # With per-request settings override curl -X POST https://api.rtrvr.ai/scrape \ -H "Authorization: Bearer YOUR_API_KEY" \ -H "Content-Type: application/json" \ -d '{ "urls": [ "https://example.com/blog/ai-trends-2025", "https://example.com/pricing" ], "settings": { "extractionConfig": { "onlyTextContent": true }, "proxyConfig": { "mode": "default" } }, "response": { "inlineOutputMaxBytes": 1048576 } }'

Scrape API `/scrape`

Scrape API Playground

Authentication

Endpoint

User Settings

Request Schema

Parameters

Webhooks

Webhook Subscription

Example with Webhook

Webhook Payload

Signature Verification

Response Schema

Tabs & content

Infra usage

Code Examples

Ready to automate?

Scrape API `/scrape`

Scrape API Playground

Authentication

Endpoint

User Settings

Request Schema

Parameters

Webhooks

Webhook Subscription

Example with Webhook

Webhook Payload

Signature Verification

Response Schema

Tabs & content

Infra usage

Code Examples

Ready to automate?

Scrape API /scrape

Scrape API Playground

Authentication

Endpoint

User Settings

Request Schema

Parameters

Webhooks

Webhook Subscription

Example with Webhook

Webhook Payload

Signature Verification

Response Schema

Tabs & content

Infra usage

Code Examples

Ready to automate?

Scrape API /scrape

Scrape API Playground

Authentication

Endpoint

User Settings

Request Schema

Parameters

Webhooks

Webhook Subscription

Example with Webhook

Webhook Payload

Signature Verification

Response Schema

Tabs & content

Infra usage

Code Examples

Ready to automate?

Scrape API `/scrape`

Scrape API `/scrape`