OpenAI recently (publicly) launched a website crawler called GPTBot that “may potentially be used to improve future models” and will “help AI models become more accurate and improve their general capabilities and safety”. There is now also a documentation page with more details.
The following user agent and string identifies the GPTBot crawler:
User agent token: GPTBot
Full user-agent string: Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; GPTBot/1.0; +https://openai.com/gptbot)
The IP address block range used by the OpenAI crawler is documented on the following OpenAI link: https://openai.com/gptbot-ranges.txt.
You can prevent the GPTBot bot crawling your website via a standard Disallow entry in the robots.txt file:
User-agent: GPTBot
Disallow: /
The other known OpenAI bot is the ChatGPT-User bot used by plugins in ChatGPT. It is identified by the following user agent and string:
User agent token: ChatGPT-User
Full user-agent string: Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko); compatible; ChatGPT-User/1.0; +https://openai.com/bot
You can prevent the ChatGPT-User bot crawling your website via a standard Disallow entry in the robots.txt file:
User-agent: ChatGPT-User
Disallow: /
If you are concerned about other research bots crawling your website content, you can also disallow access for CCBot, the Common Crawler bot that “provides a copy of the internet to internet researchers, companies and individuals at no cost for the purpose of research and analysis”. It is identified by the following user agent and string:
User agent token: CCBot
Full user-agent string: CCBot/2.0 (http://commoncrawl.org/faq/)
You can prevent the CCBot bot crawling your website via a standard Disallow entry in the robots.txt file:
User-agent: CCBot
Disallow: /
Of course, you also have more granular control over what you disallow for each of the bots mentioned above. You can do a partial Allow or Disallow with something like this:
User-agent: GPTBot
Allow: /directory-1/
Disallow: /directory-2/