Background
Many service providers wish to restrict certain Web activities including spambot registration, web crawlers and more.
Crawler Identification Resources
- Dark Visitors – A List of Known AI Agents on the Internet
- udger crawler UA list
- Eight tips about consent for fediverse developers – background on consent in the Fediverse as it relates to web scraping
- https://mastodon.bentasker.co.uk/@scrapersnitch – automated OSINT account to notify of possible Fediverse scrapers
Anthropic AI
Anthropic is a U.S.-based AI company researching artificial intelligence as a public-benefit company to develop AI systems to “study their safety properties at the technological frontier” and use this research to deploy safe, reliable models for the public. Anthropic has developed a family of large language models (LLMs) named Claude as a competitor to OpenAI’s ChatGPT and Google’s Gemini.
To disallow ClaudeBot using robots.txt:
User-agent: ClaudeBot
Disallow: /
AppleBot-Extended
With Applebot-Extended, web publishers can choose to opt out of their website content being used to train Apple’s foundation models powering generative AI features across Apple products, including Apple Intelligence, Services, and Developer Tools. Applebot-Extended does not crawl webpages. Webpages that disallow Applebot-Extended can still be included in search results. Applebot-Extended is only used to determine how to use the data crawled by the Applebot user agent.
To disallow AppleBot-Extended using robots.txt:
User-agent: AppleBot-Extended
Disallow: /
Additional information, including information about Apple’s web crawler “AppleBot”, is available at https://support.apple.com/en-us/119829
CCBot
Common Crawl is a non-profit foundation founded with the goal of democratizing access to web information by producing and maintaining an open repository of web crawl data that is universally accessible and analyzable by anyone. The user agent is CCBot/2.0.
To disallow CCBot using robots.txt:
User-agent: CCBot
Disallow: /
FacebookBot
FacebookBot crawls public web pages to improve language models for speech recognition technology.
User agent: FacebookBot
Full user-agent string: Mozilla/5.0 (compatible; FacebookBot/1.0; +https://developers.facebook.com/docs/sharing/webmasters/facebookbot/)
To disallow FacebookBot using robots.txt:
User-agent: FacebookBot
Disallow: /
Google-Extended is Google’s Gemini/Vertex AI/Bard crawler. To disallow the Bard crawler:
User-agent: Google-Extended
Disallow: /
This does not stop all Google generative AI crawls. Google also scrapes content for AI-powered search results. To stop this, you will need to block the main Googlebot, which will also remove your site from Google Search.
User-agent: Googlebot
Disallow: /
To Disallow all Google bot traffic by IP address, see this json file
OpenAI
GPTBot is OpenAI’s web crawler and can be identified by the following user agent and string:
User agent: GPTBot
Full user-agent string: Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; GPTBot/1.0; +https://openai.com/gptbot)
To disallow GPTBot using robots.txt:
User-agent: GPTBot
Disallow: /
There are two user agents, GPTBot and GPT-User (for human user-initiated browsing via ChatGPT), opting out of either will block both.
To disallow GPTBot traffic by source IP:
52.230.152.0/24
52.233.106.0/24
To disallow GPT-User traffic by IP:
23.98.142.176/28
40.84.180.224/28
13.65.240.240/28
20.97.189.96/28
(reference: https://openai.com/gptbot.json and https://platform.openai.com/docs/plugins/bot)
PerplexityBot
Perplexity AI is an AI chatbot-powered research and conversational search engine that answers queries using natural language predictive text.
To disallow PerplexityBot using robots.txt:
User-Agent: PerplexityBot
Disallow: /
To disallow PerplexityBot traffic by IP:
54.90.207.250/32
23.22.208.105/32
54.242.1.13/32
18.208.251.246/32
34.230.5.59/32/22
18.207.114.171/32
54.221.7.250/32
(reference: https://docs.perplexity.ai/docs/perplexitybot and https://www.perplexity.ai/perplexitybot.json)