← All insights
GEO

How AI Crawlers Read Websites

What GPTBot, ClaudeBot, PerplexityBot, Google-Extended, and Applebot-Extended actually do when they hit your site, and the robots.txt strategy that keeps you visible.

By Annette Thompson · Updated May 9, 2026 · 14 min read

The crawlers that feed AI search engines are technically simpler than Googlebot but politically more complicated. Googlebot, after twenty-plus years of public engagement, has well-documented behavior, a public Search Central documentation site, and a stable identity that webmasters know how to handle. The 2026 AI crawler landscape involves at least a dozen distinct user agents from six organizations, with overlapping but non-identical purposes, varying degrees of robots.txt compliance, and behavior that’s still being reverse-engineered through real server log analysis.

The practical implication for any business that wants AI search visibility is that handling AI crawlers correctly has become a real technical SEO discipline. Block too aggressively and you disappear from AI answers. Allow too loosely and you may be feeding training corpora you’d rather not feed. The middle path requires understanding what each crawler does, what it doesn’t do, and how to write a robots.txt that gets the policy right.

The 2026 AI crawler landscape

The user agents that matter, organized by organization:

OpenAI (3 distinct bots)

  • GPTBot — The training crawler. Fetches content for inclusion in GPT model training corpora.
  • ChatGPT-User — User-initiated fetcher. Triggered when a ChatGPT user asks a question that requires fetching a specific URL.
  • OAI-SearchBot — Web search crawler. Indexes content for ChatGPT’s web search capability.

Documented at platform.openai.com/docs/bots. All three respect robots.txt directives addressed to their specific user agent strings.

Anthropic (3 distinct bots)

  • ClaudeBot — Training crawler. Fetches content for Claude model training.
  • Claude-User — User-initiated fetcher when a Claude user asks the model to access a specific URL.
  • Claude-SearchBot — Search index crawler for Claude’s web search capability.

Anthropic publishes the policy at anthropic.com/contact-sales/for-publishers and documents that the three bots have different purposes (ALM Corp). Blocking ClaudeBot blocks training; blocking Claude-SearchBot blocks Claude’s ability to retrieve and cite your content during user conversations.

Google (1 AI-specific bot)

  • Google-Extended — Opt-out for AI training and Bard/Gemini grounding. Separate from Googlebot, which continues to crawl for classic search.

Blocking Google-Extended does not affect classic Google search rankings. It does affect whether Google AI uses your content for training and grounding inside Gemini and AI Overviews.

Apple (1 AI-specific bot)

  • Applebot-Extended — Opt-out for Apple Intelligence training. Separate from Applebot which crawls for Siri and Spotlight.

Apple introduced Applebot-Extended in 2024 as part of the Apple Intelligence rollout. It respects robots.txt and does not affect Siri or Spotlight when blocked.

Perplexity (2 documented bots, with controversy)

  • PerplexityBot — Stated training/index crawler. Documented to respect robots.txt.
  • Perplexity-User — User-initiated fetcher.

Perplexity has been documented in 2024-2025 by Wired, Cloudflare, and others using stealth crawlers that don’t identify as PerplexityBot and have been observed ignoring robots.txt. The company’s public position is that PerplexityBot respects robots.txt; the on-the-ground evidence is mixed.

ByteDance / Other

  • Bytespider — TikTok/ByteDance training crawler. Documented as ignoring robots.txt in many cases.
  • Meta-ExternalAgent — Meta’s AI training crawler.
  • Amazonbot — Used for Alexa Q&A and AI features.
  • Diffbot — General-purpose AI extraction crawler.
  • CCBot — Common Crawl, which feeds many open-source training datasets.

That’s roughly 14 distinct user agents from at least 7 organizations, all worth handling explicitly in a 2026 robots.txt.

What AI crawlers actually do (and don’t do)

The most important technical fact about AI crawlers in 2026: most of them don’t render JavaScript.

Vercel’s analysis of AI crawler behavior found that GPTBot, ClaudeBot, PerplexityBot, and similar bots fetch the initial HTML response and extract text from the raw HTML. They don’t run a headless Chrome to execute JavaScript and wait for client-rendered content to populate. ChatGPT and Claude crawlers do fetch JavaScript files (ChatGPT: 11.50%, Claude: 23.84% of requests), but they don’t execute them. They see the JavaScript files as text content, not as code that produces content.

The implication is severe and often missed:

A React or Vue SPA without server-side rendering shows up blank to AI crawlers. The HTML they fetch is essentially <div id="root"></div> plus some script tags. Whatever content the application renders client-side is invisible to the crawler, even though it’s perfectly visible to a human user with a modern browser.

We’ve audited Front Range businesses in 2026 that ranked in Google’s top 5 for their primary commercial keywords (because Googlebot does render JavaScript) and were entirely invisible to ChatGPT, Claude, and Perplexity. The fix isn’t a content rewrite. It’s a rendering-architecture change: server-side rendering, static generation, or a pre-rendering service like Prerender.io that serves pre-rendered HTML to recognized AI crawlers while serving the SPA to humans.

What gets parsed when a crawler hits a page

A simplified view of what GPTBot, ClaudeBot, and PerplexityBot do when they fetch a URL:

  1. Issue an HTTP request with the appropriate user-agent string.
  2. Receive the raw HTML response.
  3. Extract text from common content elements: <p>, <h1> through <h6>, <li>, <blockquote>, <table>, and similar.
  4. Extract structured data from JSON-LD blocks in the document head.
  5. Extract metadata from <title>, meta description, Open Graph tags, and similar.
  6. Tokenize the text content for indexing.
  7. Move on.

What they don’t do:

  • Execute JavaScript that produces content.
  • Render the page visually.
  • Compute layout to determine what’s “above the fold.”
  • Follow infinite-scroll patterns.
  • Submit forms or click buttons.
  • Wait for images to load.

This is why the Princeton GEO paper’s finding about “visual placement” mattering for citation is somewhat misleading without context. The “above the fold” weighting in AI search isn’t computed visually by the crawler. It’s computed by the re-ranking layer based on the position of content within the raw HTML source. If your direct answer is in the first <p> tag inside the <main> element, it’s “above the fold” to the AI engine. If it’s in the eighth <p> tag, it’s “below the fold.”

The 2026 robots.txt strategy

Robots.txt is no longer a file you write once and ignore. It’s a living policy that needs to handle a dozen-plus AI bots across multiple organizations. Three legitimate strategies depending on your goals:

Strategy 1: Allow everything (maximum AI visibility)

For a small business that wants every AI search citation it can get, the right policy is to allow all known AI bots and not even mention them in robots.txt. The default behavior is “allow,” and silence is allowed.

Sample minimal robots.txt:

User-agent: *
Disallow: /admin/
Disallow: /private/

Sitemap: https://yoursite.com/sitemap.xml

This allows all bots, AI or otherwise, except in specific paths you don’t want indexed. Maximum visibility upside, no opt-out from training corpora.

Strategy 2: Allow retrieval, block training

For a publisher that wants to be cited in AI answers but doesn’t want their content used to train future models. This requires explicitly allowing the search/retrieval bots and blocking the training bots.

# OpenAI
User-agent: OAI-SearchBot
Allow: /
User-agent: ChatGPT-User
Allow: /
User-agent: GPTBot
Disallow: /

# Anthropic
User-agent: Claude-SearchBot
Allow: /
User-agent: Claude-User
Allow: /
User-agent: ClaudeBot
Disallow: /

# Google
User-agent: Googlebot
Allow: /
User-agent: Google-Extended
Disallow: /

# Apple
User-agent: Applebot
Allow: /
User-agent: Applebot-Extended
Disallow: /

# Perplexity
User-agent: PerplexityBot
Allow: /
User-agent: Perplexity-User
Allow: /

# Default
User-agent: *
Disallow: /admin/

Sitemap: https://yoursite.com/sitemap.xml

This is the policy major publishers like the New York Times, Reuters, and the Wall Street Journal have adopted. It signals “we want to be cited but not used as training data.” It’s also the policy we recommend most often for established content businesses.

Strategy 3: Block aggressive crawlers, allow legitimate ones

Some sites in 2026 are being hit hard by aggressive AI crawlers (Bytespider, Meta-ExternalAgent, and unknown stealth crawlers) that consume meaningful bandwidth. The block-and-allowlist approach:

# Block specific aggressive bots
User-agent: Bytespider
Disallow: /
User-agent: Meta-ExternalAgent
Disallow: /
User-agent: CCBot
Disallow: /
User-agent: Diffbot
Disallow: /

# Allow named retrieval bots
User-agent: OAI-SearchBot
Allow: /
User-agent: PerplexityBot
Allow: /
User-agent: Claude-SearchBot
Allow: /
[etc.]

The honest caveat: robots.txt is a polite request, not an enforcement mechanism. Bots that don’t respect it (Bytespider has been documented ignoring robots.txt repeatedly) require server-level enforcement through firewall rules, Cloudflare bot management, or similar tooling.

Server-level enforcement for crawlers that ignore robots.txt

For high-traffic sites, robots.txt isn’t enough. Cloudflare introduced a Verified Bots program and AI bot blocking features in 2024. Similar capabilities exist in Akamai, Fastly, and most modern WAF products. The minimum viable enforcement stack:

  1. Robots.txt as the polite policy layer.
  2. Cloudflare or equivalent CDN with AI bot management enabled.
  3. Server log analysis for unrecognized user agents and rate-limiting them.
  4. Periodic audits of who’s actually crawling your site versus who claims to be.

For most small businesses on the Front Range, robots.txt alone is enough. The aggressive-crawler problem is mostly a publisher and large-site problem.

Verifying crawler identity

A common mistake: blocking a crawler in robots.txt that’s actually a spoofed user agent. Real crawlers identify themselves through verifiable IP ranges. The verification process:

  • OpenAI: Publishes IP ranges at platform.openai.com/docs/bots.
  • Anthropic: Publishes IP ranges in its publisher documentation.
  • Google-Extended: Uses Google’s published IP ranges.
  • Applebot-Extended: Uses Apple’s published IP ranges, verifiable through reverse DNS.
  • PerplexityBot: Publishes IP ranges, though stealth crawlers from Perplexity have been observed using residential IP space.

A user-agent string can be spoofed trivially. IP-range verification is the only reliable identity check. Server log analysis tools like Cloudflare Logpush, GoAccess, or Splunk can automate this.

What to do when a crawler isn’t visiting

A common diagnostic situation: you’ve optimized your site, but you don’t see AI search citations and your server logs don’t show the AI crawlers visiting at the expected rate. The diagnostic checklist:

  1. Check robots.txt for accidental blocks. Many sites inherit a 2018-era robots.txt that blocks aggressive crawlers wholesale and accidentally catches modern AI bots.
  2. Check CDN/WAF rules. Cloudflare’s AI bot blocking features can be turned on globally and block more aggressively than expected.
  3. Check sitemap submission. AI crawlers often discover URLs through sitemaps. Submit sitemaps to Bing Webmaster Tools (which feeds OAI-SearchBot and Copilot) and to Google Search Console.
  4. Check IndexNow protocol. Implementing IndexNow speeds up Bing-based crawler discovery, which lifts ChatGPT and Copilot retrieval.
  5. Check internal linking. AI crawlers follow links. Pages with no internal links pointing to them get crawled less.
  6. Check rendering. Re-verify that the content is in the raw HTML response, not in a JavaScript-populated DOM.

If the crawlers are visiting but the citations aren’t appearing, the issue is with the re-ranking layer, not the retrieval layer. The fix is content engineering, not technical.

A small case study: A Front Range professional services firm

We audited a Boulder firm in early 2026 that was getting strong Google rankings and zero AI search visibility. The technical findings:

  • Server logs showed Googlebot at expected volume and PerplexityBot at near-zero volume.
  • robots.txt had been generated by a 2019 WordPress plugin and explicitly blocked anything matching *Bot* in user agent. PerplexityBot was caught by that wildcard.
  • GPTBot was allowed, but Cloudflare’s AI bot management feature was set to “block AI crawlers” by default.
  • The site rendered server-side, so rendering wasn’t the issue.

The fix took 30 minutes: rewrote robots.txt to explicitly handle each AI bot, turned off Cloudflare’s blanket AI block while keeping bot-management for unknown crawlers, and submitted updated sitemaps to both Bing Webmaster Tools and the IndexNow API. Within 8 days PerplexityBot was crawling the site at expected volume, and the first Perplexity citations appeared in week 3.

The lesson: a meaningful percentage of “we’re invisible to AI search” complaints in 2026 are technical access issues, not content issues. Always check the access layer first.

The training opt-out question

Whether to allow AI training on your content is a real strategic question, not a default. Three positions we’ve seen clients take:

Position 1: Allow training (maximum visibility). The reasoning is that training-corpus inclusion is one of the highest-leverage long-term visibility signals. A brand that ends up in GPT-6’s parametric knowledge gets recommended for years without retrieval. This is the default for most small businesses we work with.

Position 2: Block training, allow retrieval (publisher position). The reasoning is that allowing training without compensation is undervalued long-term and that AI engines should pay or partner before using content for training. Allowing retrieval (search) but not training preserves citation potential while limiting training-corpus contribution. This is the position most large publishers have taken.

Position 3: Block everything (legal/competitive moat). The reasoning is that the content itself is a competitive moat and that any AI system surfacing it reduces commercial value. Used by some legal-research, financial-data, and proprietary-research businesses. Comes with a near-total absence from AI search results.

Position 1 is the right answer for almost every small and mid-sized business we work with. Positions 2 and 3 are appropriate for specific publisher and proprietary-data cases.

Frequently asked questions

What user agent does ChatGPT use to crawl websites?

ChatGPT uses three distinct user agents: GPTBot for training data collection, ChatGPT-User for user-initiated URL fetches, and OAI-SearchBot for ChatGPT’s web search retrieval. Each can be controlled independently in robots.txt. Blocking GPTBot prevents training-corpus inclusion. Blocking OAI-SearchBot prevents ChatGPT from finding and citing your content during user conversations.

How do I tell if AI crawlers are visiting my site?

Check server logs for the documented user agent strings: GPTBot, ChatGPT-User, OAI-SearchBot, ClaudeBot, Claude-User, Claude-SearchBot, PerplexityBot, Google-Extended, Applebot-Extended. For verification, cross-reference visitor IPs against the published IP ranges from each provider. Tools like Cloudflare Logpush, GoAccess, or simple grep against access logs work well for small sites.

Should I block AI crawlers in robots.txt?

For most small businesses, no. Blocking AI crawlers blocks AI search visibility, which is one of the fastest-growing organic discovery surfaces. The exception is publishers who want to monetize content licensing rather than feed it free to training corpora. The middle path is allowing search-specific bots (OAI-SearchBot, Claude-SearchBot, PerplexityBot) while blocking training-specific bots (GPTBot, ClaudeBot, Google-Extended).

Do AI crawlers render JavaScript?

Most AI crawlers do not execute JavaScript as of 2026. GPTBot, ClaudeBot, PerplexityBot, OAI-SearchBot, and Claude-SearchBot fetch the initial HTML response and extract text from the raw HTML. Client-side-rendered SPAs are essentially invisible to these crawlers. Server-side rendering, static generation, or pre-rendering services are necessary for sites that want AI search visibility.

What’s the difference between Google-Extended and Googlebot?

Googlebot crawls for classic Google Search and respects classic robots.txt directives. Google-Extended is a separate user agent introduced in 2023 specifically for AI training and AI grounding (Gemini, Bard, AI Overviews). Blocking Google-Extended does not affect classic Google search rankings. It does affect whether your content gets used in AI training and AI Overview generation.

Does Perplexity respect robots.txt?

PerplexityBot respects robots.txt according to Perplexity’s published policy. However, Perplexity has been documented (by Wired, Cloudflare, and others) using stealth crawlers that don’t identify as PerplexityBot and have been observed ignoring robots.txt. For sites concerned about unauthorized crawling, server-level bot management is necessary alongside robots.txt.

How do I get AI crawlers to visit my site more frequently?

Submit your sitemap to Bing Webmaster Tools (feeds OAI-SearchBot and Copilot). Implement IndexNow protocol for fast notification of content updates. Maintain strong internal linking from frequently-crawled pages. Publish on a regular cadence so crawlers learn to return more often. Verify that no robots.txt or CDN rules accidentally block the AI bots you want.

Can I see what content AI crawlers extracted from my site?

Indirectly yes, through testing. Ask ChatGPT, Claude, and Perplexity questions that should pull from your site’s content and observe whether they cite or quote it. For technical inspection, services like Diffbot and Mozilla’s Readability approximate the content extraction patterns AI crawlers use. The output of a Readability run on your page is a reasonable approximation of what an AI crawler “sees.”

Final thought

AI crawlers are a new layer of technical SEO discipline that didn’t exist in 2022 and now consumes a meaningful share of every audit we run. The work is approachable. Read the user-agent docs. Audit your robots.txt. Verify server-rendered HTML. Check crawler hits in server logs monthly. Submit your sitemap to the discovery surfaces that matter.

The businesses that handle this well in 2026 are the ones that treat crawler access as a basic technical hygiene step rather than an afterthought. The ones that don’t handle it well are often invisible in AI search for reasons that take 30 minutes to fix.


Internal links to add:

  • geo-generative-engine-optimization-explained
  • how-businesses-appear-in-ai-search-results
  • how-perplexity-chooses-sources
  • how-to-get-mentioned-by-chatgpt
  • how-structured-content-helps-ai-search

Schema markup: Article + FAQPage. Generated at build time from frontmatter.

Want this done for your business?

The free AI visibility audit takes ten minutes on your end and shows you exactly where you stand: in Google, in Maps, and in AI search. No pitch, no obligation.

Get my free AI visibility audit