← TheCrawler

API Docs

A REST API to turn any public URL into clean markdown or structured JSON. AI extraction runs on our own GPU— it's included on every page, with no per-call surcharge. You pay per page; AI is free.

POST /v1/scrape
URL → markdown + metadata
POST /v1/extract
URL → structured JSON (managed GPU)
GET /v1/extract/result/{id}
Poll an extraction job
POST /v1/search
Query → top Google results, scraped
POST /v1/batch
Up to 200 URLs async + signed webhooks
POST /v1/monitoring
Watch a URL; webhook on change

Authentication

Every request needs your API key as a bearer token. Buy any plan on the pricing page — the key is emailed to you and shown on the success screen. No account or login required; one key draws on your page balance.

Authorization: Bearer mai_live_xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

Base URL: https://www.miaibot.ai/api/v1

Scrape a page

POST /v1/scrape — fetch one URL and return its content. Costs 1 page. Set markdown, metadata, and/or brand to choose what comes back.

curl -X POST https://www.miaibot.ai/api/v1/scrape \
  -H "Authorization: Bearer $CRAWLER_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://example.com",
    "markdown": true,
    "metadata": true
  }'

Response:

{
  "ok": true,
  "sourceUrl": "https://example.com/",
  "item": {
    "url": "https://example.com/",
    "status": "success",
    "statusCode": 200,
    "engine": "cheerio",
    "metadata": { "title": "Example Domain", "language": "en", ... },
    "markdown": "# Example Domain\n..."
  }
}

Extract structured data (managed GPU)

POST /v1/extract — pass one or more URLs plus either an extractPrompt or an extractJsonSchema, and get structured JSON back. Extraction runs on our on-prem GPU, so it costs the same as a plain page fetch: 1 page per URL, AI included. Up to 25 URLs per call.

Because it runs on our hardware, this path is asynchronous: the call returns 202 with a job id per URL, which you then poll.

curl -X POST https://www.miaibot.ai/api/v1/extract \
  -H "Authorization: Bearer $CRAWLER_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "urls": ["https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html"],
    "extractPrompt": "Extract the title and price"
  }'

Response (202 Accepted):

{
  "ok": true,
  "mode": "async",
  "engine": "thecrawler-managed-qwen",
  "creditsHeld": 1,
  "jobs": [
    { "url": "https://books.toscrape.com/...", "id": "job_abc123",
      "poll": "/api/v1/extract/result/job_abc123" }
  ]
}

Then poll GET /v1/extract/result/{id} until it returns 200:

curl https://www.miaibot.ai/api/v1/extract/result/job_abc123 \
  -H "Authorization: Bearer $CRAWLER_KEY"

# 202 while running:  { "ok": true, "status": "processing", "pending": true }
# 200 when finished:
{
  "ok": true,
  "status": "success",
  "data": { "title": "A Light in the Attic", "price": "£51.77" }
}

Bring your own LLM (synchronous)

Prefer your own model? Add llmBaseUrl + llmModel (any public OpenAI-compatible endpoint) and /v1/extract runs synchronously, returning results in the same response instead of a job to poll.

curl -X POST https://www.miaibot.ai/api/v1/extract \
  -H "Authorization: Bearer $CRAWLER_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "urls": ["https://example.com"],
    "extractPrompt": "Extract the title",
    "llmBaseUrl": "https://api.openai.com/v1",
    "llmModel": "gpt-4o-mini"
  }'

Monitor a page for changes

POST /v1/monitoring — register a watchlist: we re-check the URL on a schedule, and when its content (or extracted data) changes we POST a watch.changed event to your webhookUrl. Each scheduled check costs 1 page. Add a prompt or jsonSchema to diff structured fields instead of raw content.

Watch count + minimum interval are set by your plan — Starter 5 (daily), Pro 25 (6-hourly), Scale 100 (hourly). List with GET /v1/monitoring; pause/resume/delete with /v1/monitoring/{id}.

curl -X POST https://www.miaibot.ai/api/v1/monitoring \
  -H "Authorization: Bearer $CRAWLER_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "url": "https://competitor.com/pricing",
    "intervalSecs": 86400,
    "webhookUrl": "https://your-app.com/hooks/crawler"
  }'

Response (201 Created):

{ "ok": true,
  "watch": { "id": "w_abc123", "url": "https://competitor.com/pricing",
             "intervalSecs": 86400, "active": true } }

What you pay for

One page = one credit. AI extraction is included on every page — there is no separate extraction charge. Search bills 1 page per scraped result and monitoring bills 1 page per scheduled check — same page model, no add-on SKU. We do notcharge for fetch failures (DNS / timeout / 4xx / 5xx / blocked), empty pages, documents we can't parse, or requests rejected by our guards — those credits are returned to your balance. Credits don't expire while your account is active. Full terms on the Terms page.

Errors

statuscodemeaning
400ssrf-blockedTarget URL resolves to a private/internal host
401Missing or invalid API key
402insufficient-creditsNot enough pages on your balance
403Polling a job that belongs to a different key
404job-not-foundUnknown or expired job id
400Bad body: missing urls[], extractPrompt, or schema

More

Crawl (multi-page link following), sitemap discovery, the validated extraction contracts, and the MCP server for Claude / Cursor are documented in the open-source repo: github.com/manchittlab/TheCrawler. Questions: hello@miaibot.ai.