Web Crawlers and Scrapers

From the IFTAS Trust & Safety Library - supporting volunteer moderators in the Fediverse

Background

Many service providers wish to restrict certain Web activities including spambot registration, web crawlers and more.

Crawler Identification Resources

Anthropic AI

Anthropic is a U.S.-based AI company researching artificial intelligence as a public-benefit company to develop AI systems to “study their safety properties at the technological frontier” and use this research to deploy safe, reliable models for the public. Anthropic has developed a family of large language models (LLMs) named Claude as a competitor to OpenAI’s ChatGPT and Google’s Gemini.

To disallow ClaudeBot using robots.txt:

User-agent: ClaudeBot 
Disallow: /

AppleBot-Extended

With Applebot-Extended, web publishers can choose to opt out of their website content being used to train Apple’s foundation models powering generative AI features across Apple products, including Apple Intelligence, Services, and Developer Tools. Applebot-Extended does not crawl webpages. Webpages that disallow Applebot-Extended can still be included in search results. Applebot-Extended is only used to determine how to use the data crawled by the Applebot user agent.

To disallow AppleBot-Extended using robots.txt:

User-agent: AppleBot-Extended
Disallow: /

Additional information, including information about Apple’s web crawler “AppleBot”, is available at https://support.apple.com/en-us/119829

CCBot

Common Crawl is a non-profit foundation founded with the goal of democratizing access to web information by producing and maintaining an open repository of web crawl data that is universally accessible and analyzable by anyone. The user agent is CCBot/2.0.

To disallow CCBot using robots.txt:

User-agent: CCBot
Disallow: /

FacebookBot

FacebookBot crawls public web pages to improve language models for speech recognition technology.

User agent: FacebookBot
Full user-agent string: Mozilla/5.0 (compatible; FacebookBot/1.0; +https://developers.facebook.com/docs/sharing/webmasters/facebookbot/)

To disallow FacebookBot using robots.txt:

User-agent: FacebookBot
Disallow: /

Google

Google-Extended is Google’s Gemini/Vertex AI/Bard crawler. To disallow the Bard crawler:

User-agent: Google-Extended
Disallow: /

This does not stop all Google generative AI crawls. Google also scrapes content for AI-powered search results. To stop this, you will need to block the main Googlebot, which will also remove your site from Google Search.

User-agent: Googlebot
Disallow: /

To Disallow all Google bot traffic by IP address, see this json file

OpenAI

GPTBot is OpenAI’s web crawler and can be identified by the following user agent and string:

User agent: GPTBot
Full user-agent string: Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; GPTBot/1.0; +https://openai.com/gptbot)

To disallow GPTBot using robots.txt:

User-agent: GPTBot
Disallow: /

There are two user agents, GPTBot and GPT-User (for human user-initiated browsing via ChatGPT), opting out of either will block both.

To disallow GPTBot traffic by source IP:

52.230.152.0/24
52.233.106.0/24

To disallow GPT-User traffic by IP:

23.98.142.176/28
40.84.180.224/28
13.65.240.240/28
20.97.189.96/28

(reference: https://openai.com/gptbot.json and https://platform.openai.com/docs/plugins/bot)

PerplexityBot

Perplexity AI is an AI chatbot-powered research and conversational search engine that answers queries using natural language predictive text.

To disallow PerplexityBot using robots.txt:

User-Agent: PerplexityBot
Disallow: /

To disallow PerplexityBot traffic by IP:

54.90.207.250/32
23.22.208.105/32
54.242.1.13/32
18.208.251.246/32
34.230.5.59/32/22
18.207.114.171/32
54.221.7.250/32

(reference: https://docs.perplexity.ai/docs/perplexitybot and https://www.perplexity.ai/perplexitybot.json)

Updated on 2024-06-14
Was this page helpful?