Assistant
Crawler

Turn any website into agent-ready data

Scrape one page, search the web, map a domain, or crawl a whole site. Clean markdown, structured JSON, screenshots — straight into Assistant pipelines.

Scrape

One URL → clean markdown

JS rendering, anti-bot bypass, main-content extraction. Output as markdown, HTML, links, screenshot, branding tokens, or schema-validated JSON.

Search

Query the web, get pages back

Run a search, optionally scrape every result in the same call. Filter by language, country, time window (hour / day / week / month / year).

Map

Discover every URL

Fast sitemap generation up to 5,000 URLs per site. Filter by keyword, include subdomains, ranked by relevance — no full crawl required.

Crawl

Whole-site ingestion

Recursive crawl with depth limits, include/exclude path patterns, sitemap-only mode, async jobs with live progress and pagination.

Request lifecycle
  1. Step 1
    URL or query
  2. Step 2
    Render (JS / anti-bot)
  3. Step 3
    Extract & clean
  4. Step 4
    Format (md / json / image)
  5. Step 5
    Hand off to agent or store

Where teams use it

Knowledge ingestion

Crawl docs, FAQs, policy pages — pipe markdown into the Support Agent's RAG store. Auto-refresh on a schedule.

Lead enrichment

Map a prospect's site, scrape /pricing and /about, hand structured JSON to the Sales Agent before the first outreach.

Competitive monitoring

Scheduled scrape of competitor pages with diff detection. Trigger Slack alerts on copy, pricing, or feature changes.

Content extraction

Schema-driven JSON extraction for product catalogs, job boards, news sites. Drop straight into Postgres or BigQuery.

Under the hood

EngineFirecrawl v2 (managed) or self-hosted Playwright workers
AuthServer-side proxy — API key never touches the browser
OutputMarkdown, raw HTML, links, screenshot (base64), branding tokens, schema JSON, AI summary
ReliabilityRetries, polling, pagination, cancellable async jobs
ScaleBatch scrape, parallel workers, rate-limit aware
Compliancerobots.txt aware, configurable user agent, geo-targeting

All scraping runs server-side. Credentials, target URLs, and extracted content stay inside your tenant.

Crawler request flow

Crawler contract
Modesscrape (1 URL) · search (query) · map (sitemap) · crawl (recursive)
LimitsMap ≤ 5,000 URLs · crawl depth + path filters · async jobs
Cost modelPer-page billing · cached for 24h · batch discount
Outputmarkdown · raw HTML · links · screenshot (b64) · schema JSON