Should You Block AI Crawlers: Pros and Cons

Photo by the author

Ishtiaque Ahmed

Blocking AI training crawlers protects your content, cuts server costs by up to 75%, and has zero measurable impact on Google search rankings confirmed by Google's own documentation and validated across 6,000+ publisher sites. The real risk? Blocking the wrong type of AI crawler and losing visibility in AI search platforms where referral traffic grew 25x in a single year.

The recommended approach is selective blocking: block training crawlers, allow search and assistant bots.

FactorPro (Blocking Training Crawlers)Con (Blocking All AI Crawlers)
Content protectionPrevents unauthorized AI model trainingNo additional benefit vs. selective blocking
Server costsUp to 75% bandwidth reduction (documented savings)Same savings achievable with selective blocking
Google search rankingsNo impact confirmed by GoogleNo impact on organic rankings either way
AI search visibilityPreserved NYT still gets 240,600 ChatGPT visits despite blocking GPTBotLost removes your content from AI search results
AI referral trafficMaintained through search/assistant botsEliminated cuts off a channel growing 25x YoY
Analytics integrityReduces 16% bot-caused invalid trafficSame benefit with selective blocking
Legal positioningStrengthens licensing/compensation claimsSame benefit
Maintenance burdenModerate robots.txt + quarterly reviewHigh must track all bot categories continuously

That table covers the decision at a glance. The rest of this article gives you the framework, evidence, and implementation details to execute it correctly starting with a question most site owners haven’t thought to ask.

27% of Websites Block AI Crawlers Without Knowing It

Before you decide whether to block, confirm whether your site is already blocking. Research by ParseAI analyzing ~3,000 websites (mostly US/UK B2B SaaS and eCommerce) found that 27% block at least one major LLM crawler and most of that blocking happens at the CDN or WAF layer, not via robots.txt. Marketing teams often have no idea it’s occurring.

This blind spot widened in mid-2025. Cloudflare began blocking AI crawlers by default for all new domains starting July 1, 2025. Roughly 20% of global internet traffic passes through Cloudflare’s network, meaning millions of sites may be blocking AI crawlers through a default setting no one on the content team consciously chose.

One SEO practitioner on r/TechSEO confirmed this is already in effect:

“If you go into your Cloudflare settings, you’ll see that the ai bots are currently being blocked by default. It’s already happening. So if you want to allow them to crawl, you need to change it to allow them to crawl.” — u/billhartzer (2 upvotes)

Your content team could be investing hours in AI search optimization while your infrastructure silently prevents AI platforms from seeing any of it.

Audit Your Current AI Crawler Access First

Three places to check before making any blocking decisions:

  1. robots.txt — Navigate to yourdomain.com/robots.txt and look for User-agent lines referencing GPTBot, ClaudeBot, PerplexityBot, or Google-Extended. Any followed by Disallow: / means that crawler is blocked. Free tools like the Growtika Robots.txt Checker audit your file against 16+ AI crawlers simultaneously.
  1. CDN/WAF settings — In Cloudflare, check AI Crawl Control and Bot Fight Mode. For Akamai or AWS WAF, review bot management rules for AI bot signatures. This is where most invisible blocking happens.
  1. Server access logs — Search for AI user-agent strings (GPTBot, ClaudeBot, ChatGPT-User). A 200 response means access was granted; 403 or 429 means blocked or rate-limited. If you see zero AI crawler activity despite having publicly accessible content, something upstream is blocking before requests reach your server.

Complete this audit first. Everything else in this article assumes you know your starting position.

Three Categories of AI Bots — and Why the Distinction Changes Everything

Most guidance treats “AI crawlers” as one thing. That’s the root cause of almost every blocking mistake. AI bots fall into three functionally distinct categories, each with different implications for your content, traffic, and revenue.

I call this the Crawl-to-Value Taxonomy a framework for mapping every AI bot to its actual impact on your site:

CategoryPurposeKey BotsTraffic to Your SiteBlock?
1. Training CrawlersCollect content to build/update AI modelsGPTBot, ClaudeBot, Google-Extended, CCBot, meta-externalagent, BytespiderZero referral benefit✅ Yes
2. Search Index CrawlersBuild indexes powering AI search resultsOAI-SearchBot, Claude-SearchBotMedium drives AI search citations❌ No
3. Assistant/User BotsFetch content in real time for live user queriesChatGPT-User, PerplexityBot, Claude-UserHighest directly tied to click-throughs❌ No

According to Vercel’s analysis, corroborated by r/TechSEO practitioners, this three-category distinction is the foundation of every sound blocking decision.

Here’s the number that makes the case on its own: training crawlers drive ~80% of all AI crawling activity but provide 0% referral traffic, according to Cloudflare. Four-fifths of the AI bot load on your servers costs you everything and gives you nothing.

Category 1: Training Crawlers — Pure Cost, Zero Return

Training crawlers collect your content to build or update language models. Your words enter a training dataset. Users of the resulting AI model are never directed back to your site. There is no link, no citation, no referral.

Major training crawlers: GPTBot (OpenAI), ClaudeBot (Anthropic), Google-Extended (Google), CCBot (Common Crawl), meta-externalagent (Meta), Bytespider (ByteDance).

Volume context: GPTBot alone generated 569 million requests in a single month. ClaudeBot generated 370 million in the same period. Training crawlers drive up to 8x the volume of traditional search crawling and 32x that of AI search crawling.

Category 2: Search Index Crawlers — Your AI Search Visibility Layer

Search index crawlers build the indexes that power AI search results functionally similar to how Googlebot builds Google’s search index, but for AI platforms. Blocking them removes your content from AI search results.

Major search index crawlers: OAI-SearchBot (OpenAI), Claude-SearchBot (Anthropic).

ALM Corp’s analysis of 66.7 billion bot requests shows OAI-SearchBot reached 55% web coverage, while GPTBot dropped to just 12% coverage as more sites blocked it. The search crawler is expanding precisely because the training crawler is getting blocked and sites accessible to OAI-SearchBot gain disproportionate representation in ChatGPT’s search results.

Category 3: Assistant/User Bots — The Highest-Value AI Traffic

These bots fetch content in real time when a human asks an AI tool a question. They have the highest referral potential because the bot activity is directly triggered by a user who may click through.

Major assistant bots: ChatGPT-User (OpenAI), PerplexityBot (Perplexity), Claude-User (Anthropic).

User-driven AI bot crawling grew 15x in 2025 alone, according to Cloudflare’s year-in-review data. This is the fastest-growing category and the one most often accidentally blocked by blanket rules.

How to Identify Bot Categories in Your Server Logs

Training crawlers identify with names like “GPTBot” or “ClaudeBot.” Search index crawlers contain “SearchBot” (e.g., OAI-SearchBot). Assistant bots include “User” in their identifier (e.g., ChatGPT-User). Cross-reference any unknown user-agent string with the owning company’s published documentation for reliable classification.

The Pros of Blocking AI Training Crawlers

Protect Your Intellectual Property and Licensing Leverage

79% of top news websitesin the UK and US now block at least one AI training crawler, up from 57.5% in February 2024, according to BuzzStream research. Among all news sites analyzed, 49.4% specifically block GPTBot the single most-blocked AI crawler while 47.8% block CCBot and 44% block Google-Extended.

Blocking strengthens negotiating leverage. Publishers who allow free crawling effectively signal consent to scraping, undermining compensation claims. Playwire reports that licensing deals between AI companies and major publishers range from $1 million to over $250 million annually. Active lawsuits from The New York Times, Condé Nast, Forbes, Vox, and Reddit reflect a legal environment growing hostile to AI “fair use” arguments. The Thomson Reuters case rejected an AI fair use defense in early 2025.

For smaller publishers who won’t negotiate billion-dollar deals, blocking is still a principled stance that preserves future optionality your robots.txt file is becoming a de facto licensing statement.

Cut Server Costs by Up to 75%

The Read the Docs project documented a 75% reduction in bandwidth from 800GB to 200GB of daily traffic after blocking AI crawlers, saving approximately $1,500/month. A Clutch survey found 42% of small businesses reported bandwidth strain from bot traffic in the last 12 months. The Wikimedia Foundation reported a 50% bandwidth surge since January 2024, driven by AI crawler bulk downloads.

The financial impact hits real site owners hard. One developer shared a cautionary tale on r/CloudFlare:

“I am hosting about 600MB of files on a domain for people to download. Just this morning, I received a bill from DigitalOcean that I have $150 in bandwidth cost last month. Turns out that, just starting last month, OpenAI’s GPTBot’s crawling cost me 30TB of bandwidth, which equals downloading the entire directory 50,000 times.” — u/Isocrates_Noviomagi (46 upvotes)

Across Cloudflare’s network, AI crawlers generate more than 50 billion requests per day. The cost burden is asymmetric: training crawlers drive 80% of the volume but deliver zero direct value. Blocking them eliminates the costliest bot traffic with no upside sacrifice.

Fix Corrupted Analytics

AI scrapers caused 16% of known-bot invalid traffic in 2024, up 86% in H2 2024 alone, according to DoubleVerify. This means artificial page views, distorted conversion rates, and skewed engagement metrics corrupting the data you use for every other business decision.

There’s a circular reasoning trap here: you see “high traffic” that includes bot-generated page views, conclude blocking would hurt you, and continue allowing the bots that are inflating the numbers that justified their presence. Filter AI bot traffic from your analytics before evaluating whether blocking makes sense.

The Cons of Blocking AI Crawlers (and Where the Risk Actually Lives)

AI Referral Traffic Is Small but Growing Fast

ChatGPT referral traffic grew 25-fold between 2024 and 2025. ChatGPT drove nearly 21% of Walmart’s referral traffic in one documented period. AI platforms still account for less than 0.15% of global internet traffic but 25x growth rates don’t stay at 0.15% for long.

This risk is category-specific. Blocking training crawlers doesn’t cut you off from AI referral traffic. The New York Times blocked GPTBot but still received 240,600 visits from ChatGPT in January 2025. The visibility risk is concentrated in blocking search index crawlers (Category 2) and assistant bots (Category 3) not training crawlers.

Practitioner data from r/TechSEO (46,600+ members) adds nuance: webmasters reported AI referral traffic at 1-2% of total referrals, but AI citation rates were notably higher one webmaster observed ~50-100 chat appearances per day on niche sites, with roughly 7% converting to a referral within 5 minutes. The traffic is small today. The visibility footprint is already larger than most analytics dashboards show.

A TechSEO practitioner who implemented the selective approach described their real-world results on r/TechSEO:

“I’m sitting at 2% of genAI referral traffic (growing) with crawlers blocked (those for training the underlying model without reference or citation) and AI Search Bots whitelisted (those for grounding and web searches). So not seeing big difference for now but will monitor more closely now. Site Size: Dynamic – endless pages – 40k core pages at minimum. A lot of proprietary data, hence the block.” — u/shooting_star_s (2 upvotes)

Robots.txt Is Voluntary and Some Crawlers Don’t Comply

OpenAI, Anthropic, Google, and Perplexity officially state compliance with robots.txt. OpenAI has respected the directive since August 2023. But robots.txt has no legal enforcement mechanism, doesn’t apply retroactively, and is invisible to rogue scrapers.

Perplexity faced accusations of operating its fetching bot under a different user-agent than its declared PerplexityBot. Meta’s crawler has shown inconsistent adherence despite official claims. As Fastly noted in August 2025: “Many AI crawlers aren’t following the rules, and robots.txt can’t stop them.”

Webmasters dealing with non-compliant crawlers are finding this out the hard way. As one practitioner shared on r/TechSEO:

“I don’t know about pay per crawl, but blocking these bots for now is the only way to prevent these bots from overloading my infrastructure. Meta’s bot doesn’t even check the robots.txt nor does it obey any ‘nofollow’ signals. It just crawls anything anywhere. It’s just a waste of resources at this point.” — u/ByFrasasfo (4 upvotes)

This doesn’t make robots.txt useless it blocks the major, compliant crawlers responsible for the vast majority of training traffic. But it’s not a complete solution.

Maintenance Is Ongoing and Misconfiguration Is Easy

The CrowdSec AI Crawlers Blocklist contains ~25,000 IP addresses. New crawlers emerge constantly. Even advanced “tarpit” countermeasures have limits UNU C3 documented GPTBot escaping Nepenthes-style infinite content traps.

The most dangerous misconfiguration: blocking Googlebot instead of Google-Extended. One powers your Google Search rankings. The other trains Google’s AI models. Confuse them and your organic traffic disappears overnight.

Why Studies Show Both “No Impact” and “23% Traffic Drops”

This is the contradiction that paralyzes decision-making. Both findings are valid they measure different things.

The “No Impact” Data

Raptive and Playwire tracked 6,000+ publisher sites from June 2024 to May 2025. Blocking GPTBot, CCBot, ClaudeBot, and other training crawlers produced no statistically significant change in organic search traffic average variation within 1%. Raptive actively recommends blocking.

These studies measured organic search rankings and Google Search traffic. The finding is unambiguous: blocking training crawlers (Category 1) does not hurt Google rankings.

The “23% Drop” Data

Kim et al. (2025) analyzed the top 500 news publishers using SimilarWeb and Comscore data. Large publishers who blocked saw total monthly traffic drop 23.1%, with human traffic declining 13.9%.

Critical context: this study measured total traffic including AI referrals, not organic search rankings. The top 30 publishers account for 69% of traffic in the dataset, skewing the aggregate. Some publishers reversed their blocking after seeing these results.

The Resolution

The contradiction maps directly to the Crawl-to-Value Taxonomy:

  • Blocking training crawlers (Category 1) = no impact on organic search. Confirmed by Google, Raptive, and Playwire.
  • Blocking all AI crawlers (Categories 1 + 2 + 3) = traffic drops from lost AI referrals. This is what Kim et al. measured.

If you block only training crawlers while allowing search and assistant bots, the data indicates no meaningful traffic decline. Block everything, and you cut off a growing channel with larger sites losing more in absolute terms.

Most articles presenting these studies don’t explain why they disagree. Now you know: they asked different questions and measured different things.

Does Blocking AI Crawlers Stop AI Overviews and Zero-Click Searches?

No. Google AI Overviews are powered by Googlebot (the traditional search crawler), not Google-Extended (the AI training crawler). You cannot opt out of AI Overviews without removing your site from Google Search entirely.

The zero-click problem is substantial and blocking training crawlers doesn’t touch it:

  • AI-powered search summaries increased zero-click searches from 56% to 69% between May 2024 and May 2025, per Playwire’s analysis of Digital Content Next data
  • Click-through rates dropped 34-46% for pages appearing in AI Overview positions
  • IAB Tech Lab estimates ~$2 billion in annual publisher ad revenue losses from AI summaries, with niche sites losing up to 90% of traffic

These losses happen whether you block AI crawlers or not. Google referrals to news sites fell ~9% between January and March 2025, per Cloudflare data.

Here’s the irony: as traditional search CTRs decline from AI Overviews, AI search referral traffic becomes relatively more valuable making it even more costly to over-block at the exact moment zero-click anxiety is highest. Blocking training crawlers is a content protection measure. It is not a defense against AI Overview traffic erosion.

Selective Blocking Strategy: How to Implement It

5-Step Implementation Process

  1. Audit current blocking status Check robots.txt, CDN/WAF settings, and server logs (see audit checklist above)
  2. Edit robots.txt to block training crawlers and explicitly allow search/assistant bots (code below)
  3. Review CDN/WAF settings for conflicting rules particularly Cloudflare’s AI Crawl Control and Bot Fight Mode
  4. Verify implementation by checking server logs for correct 200 responses to allowed bots and 403 responses to blocked bots
  5. Monitor AI search visibility quarterly to confirm no unintended impact

Robots.txt Code for Selective Blocking

# Block AI training crawlers (Category 1)

User-agent: GPTBot

Disallow: /

User-agent: ClaudeBot

Disallow: /

User-agent: Google-Extended

Disallow: /

User-agent: CCBot

Disallow: /

User-agent: meta-externalagent

Disallow: /

User-agent: Bytespider

Disallow: /

# Allow AI search index crawlers (Category 2)

User-agent: OAI-SearchBot

Allow: /

User-agent: Claude-SearchBot

Allow: /

# Allow AI assistant/user bots (Category 3)

User-agent: ChatGPT-User

Allow: /

User-agent: PerplexityBot

Allow: /

Mistakes That Break Your Implementation

  • Blocking Googlebot instead of Google-Extended — Googlebot powers Google Search. Blocking it destroys your SEO overnight.
  • Blanket User-agent: * / Disallow: / rules — This blocks everything, including beneficial AI search and assistant bots.
  • Missing Allow: / directives — Without explicit allows, search/assistant bots may inherit broader Disallow rules.
  • Empty Disallow paths — A Disallow line without a path (e.g., just Disallow:) is ignored by most parsers and blocks nothing.

This is one of the most reversible technical decisions you can make. A robots.txt change takes effect immediately and can be undone in minutes. If you’re hesitant, know that the downside of trying selective blocking is near zero while the downside of continuing to serve 80% of your AI bot traffic for free compounds every quarter.

Business Model Recommendations

Business ModelTraining Crawlers (Cat. 1)Search Crawlers (Cat. 2)Assistant Bots (Cat. 3)Rationale
Ad-supported publishersBlockAllowAllowProtect content, reduce bandwidth, preserve AI referral revenue
E-commerceBlockAllowAllowAI search drives product discovery 21% of Walmart referrals from ChatGPT
B2B / thought leadership BlockAllowAllowAI citations directly serve brand authority goals
Independent creatorsBlockAllowAllowProtect original work; AI search = audience channel not dependent on SEO competition

The recommendation converges: block training, allow search and assistant regardless of business model. The difference is emphasis. E-commerce and B2B sites should prioritize monitoring AI referral traffic growth. Ad-supported publishers should prioritize bandwidth savings and analytics cleanup.

Layered Defense Beyond robots.txt

Robots.txt blocks compliant crawlers. For enforcement against non-compliant bots, layer these additional measures as needed:

  • CDN/WAF rules — Cloudflare’s AI Crawl Control provides auto-updating managed blocking. Other providers (Fastly, Akamai) offer similar WAF configurations. Check these first they may already be blocking or allowing bots independently of your robots.txt.
  • IP blocking — OpenAI publishes verifiable IP ranges for GPTBot. The CrowdSec blocklist covers ~25,000 AI bot IPs. Maintenance is ongoing fingerprints change and IPs rotate.
  • Rate limiting — Throttles request volume rather than blocking outright. Useful for Category 2/3 bots you want to allow at controlled pace.

Start with robots.txt (5 minutes, handles compliant bots). Add CDN/WAF rules next (handles infrastructure-level enforcement). Implement IP blocking only if you have developer support to maintain it. Rate limiting is a middle-ground option for specific situations.

Why Selective Blocking Creates a Competitive Advantage

Most blocking advice frames the decision defensively: protect your content. That framing misses the strategic opportunity.

As more sites block all AI crawlers, the pool of content available to AI search platforms shrinks. Sites that selectively allow search and assistant bots become a larger share of the AI search index gaining disproportionate visibility while competitors go dark.

The data shows this happening already. ALM Corp’s analysis of 66.7 billion requests found OAI-SearchBot at 55% web coverage while GPTBot dropped to 12% as blocking increased. The search crawler expands precisely because the training crawler gets blocked. Sites accessible to search bots inherit a growing share of AI citations.

With 79% of top news sites blocking at least one training crawler and 33.2% planning to block AI Overviews when controls become available, the competitive landscape is shifting. Every competitor that over-blocks is one fewer source competing for AI citations in your niche.

This isn’t speculation. It’s the mathematical consequence of a shrinking denominator.

Monitor Your Blocking Strategy — or Fly Blind

Implementing selective blocking is step one. Confirming it works is what separates strategy from guesswork.

Key Metrics to Track After Implementation

  • AI search citations Is your content being mentioned or recommended in ChatGPT, Perplexity, and Google AI Overview responses?
  • AI referral traffic Track visits from chatgpt.com, perplexity.ai, and related domains in GA4
  • Organic search traffic Baseline comparison to confirm no unintended SEO impact
  • Server logs Watch for new or reclassified AI crawler user-agents that need categorization
  • Bandwidth consumption Measure reduction after blocking training crawlers

Signals That Trigger Strategy Review

  • Sudden drop in AI referral traffic after a blocking change
  • New AI crawlers in your logs that you haven’t classified
  • CDN or hosting provider updating default bot policies (Cloudflare does this periodically)
  • Competitor content appearing more frequently in AI responses for your target queries
  • Meaningful shift in the proportion of traffic from AI platforms

Review cadence: Quarterly for most sites. Monthly if you’re in a fast-moving niche or tracking high-value AI search visibility.

The Monitoring Gap Most Sites Can’t Close Manually

Here’s the problem: you can verify your robots.txt syntax and check server logs, but you can’t easily confirm whether AI platforms are actually citing your content, whether competitors who over-blocked are losing visibility you could capture, or whether new crawlers need categorization not from log files alone.

ZipTie.dev tracks how your brand, products, and content appear in AI-generated search results across Google AI Overviews, ChatGPT, and Perplexity. Its competitive intelligence reveals which competitor content is being cited by AI engines so you can see if competitors’ over-blocking is creating opportunity you’re positioned to capture. The platform’s AI-driven query generator analyzes your actual content URLs to produce relevant monitoring queries, and its contextual sentiment analysis goes beyond basic positive/negative scoring to show how AI platforms characterize your brand.

Without monitoring, you won’t know if a CDN update silently changed your settings, if a new crawler is consuming bandwidth unchecked, or if the selective blocking strategy that worked last quarter still holds. The blocking decision is the starting point. Monitoring transforms it from a one-time guess into an ongoing optimization loop.

Frequently Asked Questions

Does blocking AI crawlers hurt my Google search rankings?

No. Google’s documentation confirms blocking Google-Extended has no impact on search inclusion or rankings. Raptive and Playwire validated this across 6,000+ sites organic traffic variation was within 1%.

  • Google-Extended (AI training) and Googlebot (search) are independent systems
  • Blocking one does not affect the other
  • This applies to all Category 1 training crawlers, not just Google-Extended

What’s the difference between GPTBot and OAI-SearchBot?

GPTBot collects content for AI model training it sends zero referral traffic. OAI-SearchBot builds the search index that powers ChatGPT’s search feature it directly enables AI citations and click-throughs. Block the first, allow the second.

Will blocking GPTBot stop my content from appearing in ChatGPT?

No. The New York Times blocked GPTBot but still received 240,600 ChatGPT visits in January 2025. ChatGPT’s search results are powered by OAI-SearchBot and ChatGPT-User separate bots from GPTBot. As long as you allow those, your content remains visible.

How do I check if my site is already blocking AI crawlers?

Check three layers: robots.txt, CDN/WAF settings, and server logs.

  • robots.txt: Visit yourdomain.com/robots.txt and search for AI crawler names
  • CDN/WAF: Review Cloudflare’s AI Crawl Control, Bot Fight Mode, or equivalent settings in your provider
  • Server logs: Look for AI user-agents and check HTTP response codes (200 = allowed, 403 = blocked)

Which AI crawlers should I block and which should I allow?

Block training crawlers: GPTBot, ClaudeBot, Google-Extended, CCBot, meta-externalagent, Bytespider.

Allow search and assistant bots: OAI-SearchBot, Claude-SearchBot, ChatGPT-User, PerplexityBot, Claude-User.

Can blocking AI crawlers stop my content from appearing in Google AI Overviews?

No. AI Overviews are powered by Googlebot the same crawler that indexes you for regular search. You can’t opt out of AI Overviews without removing yourself from Google Search entirely. Blocking Google-Extended (AI training) has no effect on AI Overviews.

Is robots.txt enough, or do I need additional blocking measures?

Robots.txt handles compliant bots which includes the major AI crawlers from OpenAI, Anthropic, and Google. For enforcement against non-compliant crawlers, add CDN/WAF rules as a second layer and IP blocking as a third. Start with robots.txt (5-minute implementation, addresses ~80% of the problem), then layer up based on your risk tolerance and technical resources.

Image by Ishtiaque Ahmed

Ishtiaque Ahmed

Author

Ishtiaque's career tells the story of digital marketing's own evolution. Starting in CPA marketing in 2012, he spent five years learning the fundamentals before diving into SEO — a field he dedicated seven years to perfecting. As search began shifting toward AI-driven answers, he was already researching AEO and GEO, staying ahead of the curve. Today, as an AI Automation Engineer, he brings together over twelve years of marketing insight and a forward-thinking approach to help businesses navigate the future of search and automation. Connect with him on LinkedIn.

14-Day Free Trial

Get full access to all features with no strings attached.

Sign up free