Here's a problem that affects over a third of major websites: they're accidentally blocking the crawlers that power ChatGPT, Claude, and Perplexity — and they have no idea.
Cloudflare published data in 2024 showing that 35.7% of the top 1 million websites block GPTBot, OpenAI's web crawler. Most of these blocks are not intentional. They're the result of a misconfigured Cloudflare setting, a copied robots.txt template, or a CDN configuration that went in years ago and nobody touched since.
The consequence: if you block these crawlers, AI systems cannot read your content. You're invisible to ChatGPT for research queries. Perplexity can't cite you. Claude can't reference you in comparisons. This is a zero-one problem — either the crawler can access your site or it can't. There's no partial credit.
What Are AI Crawlers?
AI crawlers are automated bots run by AI companies to index web content. The three most important:
- GPTBot — OpenAI's crawler. Used for ChatGPT training and web browsing.
- ClaudeBot — Anthropic's crawler. Used for Claude's web access and training.
- PerplexityBot — Perplexity AI's crawler. Powers real-time web search in Perplexity answers.
- Google-Extended — Google's AI-specific training crawler (separate from Googlebot)
- anthropic-ai — Anthropic's secondary training crawler
- CCBot — Common Crawl's crawler, used by multiple LLMs for training data
These crawlers read your robots.txt file before accessing any page on your site. If your robots.txt tells them to stay out — explicitly or via a wildcard rule — they won't read your content. It's that simple.
How to Check If You're Blocking AI Crawlers
Method 1: Quick manual check
Run this command in your terminal (replace yourdomain.com with your actual domain):
curl https://yourdomain.com/robots.txtLook for any of these patterns:
# Block (explicit)
User-agent: GPTBot
Disallow: /# Block (via wildcard) User-agent: * Disallow: / ```
If you see Disallow: / under User-agent: * without a specific Allow: / rule for GPTBot, Claude, and Perplexity — those crawlers are blocked.
Method 2: Use ConduitScore's Blocklist Alarm
Go to conduitscore.com/blocklist-alarm, enter your domain, and get an instant per-bot status: Allowed, Blocked, or Unclear. The tool parses your robots.txt and shows you exactly which AI crawlers can access your site and which are blocked.
You can also check any domain at conduitscore.com/blocklist/[domain] to see its current AI crawler status.
Method 3: Run a full AI visibility scan
A full ConduitScore scan at conduitscore.com checks crawler access as one of its 7 categories. It surfaces any blocked crawlers as critical issues and gives you the exact fix.
Why Crawlers Get Blocked Accidentally
The most common causes:
Cloudflare Bot Fight Mode
Cloudflare's "Bot Fight Mode" setting, when enabled on certain configurations, blocks non-whitelisted bots. GPTBot, ClaudeBot, and PerplexityBot are not on Cloudflare's whitelist by default. The setting looks like it's protecting you from malicious bots, but it also blocks legitimate AI crawlers.
Fix: In Cloudflare, go to Security → Bots → Bot Fight Mode. If it's enabled, either disable it or create custom rules that explicitly allow GPTBot, ClaudeBot, and PerplexityBot.
Copied robots.txt templates
Many developers copy robots.txt templates from GitHub, Stack Overflow, or CMSs. Some of these templates include aggressive wildcard blocks like:
User-agent: *
Disallow: /
Allow: /public/This pattern blocks every bot except for the specific paths listed in Allow. If your publicly accessible content isn't in those paths, AI crawlers can't read it.
CDN and WAF configurations
Web Application Firewalls (WAFs) and some CDN configurations block bots by user-agent string. GPTBot and ClaudeBot are relatively new; many WAF blocklists were written before they existed and don't distinguish between malicious scrapers and legitimate AI crawlers.
WordPress plugins
Several WordPress security plugins (Wordfence, iThemes Security, All in One WP Security) include settings that block bot traffic. If you installed one of these and didn't configure the exceptions manually, AI crawlers may be blocked.
How to Fix Blocked AI Crawlers
Add these rules to your robots.txt file. If you already have a User-agent: * block, add bot-specific Allow rules before the wildcard block:
# AI crawlers — explicitly allow
User-agent: GPTBot
Allow: /User-agent: ClaudeBot Allow: /
User-agent: PerplexityBot Allow: /
User-agent: Google-Extended Allow: /
User-agent: anthropic-ai Allow: /
# Your existing rules below User-agent: * Disallow: /private/ ```
Bot-specific rules take precedence over wildcard rules in robots.txt. Adding explicit Allow: / for each AI crawler overrides any wildcard Disallow.
Verifying the Fix
After updating your robots.txt:
1. Wait 5–10 minutes for server caches to clear.
2. Re-run the curl command: curl https://yourdomain.com/robots.txt to confirm the new rules are live.
3. Use the ConduitScore Blocklist Alarm at conduitscore.com/blocklist/yourdomain.com to verify each bot shows "Allowed."
4. Run a full AI visibility scan to confirm the Crawler Access category score improved.
A Note on Privacy and Security
Allowing AI crawlers does not expose any private data. The crawlers only access publicly available pages — exactly the same pages any human visitor can see. If you have private content behind authentication, it's protected regardless of your robots.txt configuration.
The only content at risk would be content that is currently "security through obscurity" — publicly accessible but not easily discoverable. If you have content you don't want indexed, put it behind proper authentication, not robots.txt.
Frequently Asked Questions
Will allowing AI crawlers slow down my site?
Minimal impact. AI crawlers typically crawl at low rates (far lower than Googlebot) and follow standard crawl-delay directives. You can add Crawl-delay: 10 under each User-agent block if you want to rate-limit them.
Should I also allow Google-Extended? Yes if you want your content included in Google Gemini's AI answers. Google-Extended is separate from Googlebot — blocking it won't affect your Google search rankings, but it will exclude you from Google's AI-generated summaries.
What if I don't want my content in AI training data? That's a separate concern from crawler access for AI answers. If you want to opt out of training data but still appear in AI answers, allow GPTBot and ClaudeBot for browsing but research the separate opt-out mechanisms for training (OpenAI and Anthropic have published their training data opt-out processes).
How long until AI systems index my site after I allow crawlers? Varies by AI system. Perplexity tends to be faster (days to weeks). For ChatGPT training data, the timeline is longer and depends on when OpenAI next updates its model. For real-time browsing (where GPTBot fetches live results), the impact is more immediate.
The fastest way to see your current crawler status and get a prioritized fix list is to run a free scan at conduitscore.com. It takes 15 seconds and shows you exactly which crawlers can access your site right now.
Check Your AI Visibility Score
See how your website performs across all 7 categories in 30 seconds.
Scan Your Website Free