OpenAI launches web crawler “GPTBot” – How to stop it from entering your website?

ChatGPT-maker OpenAI has launched a new web crawler called “GPTBot” that will hunt for information in all websites to feed its AI models.

OpenAI has made one of the revolutionary AI models – ChatGPT, that enriched business and people’s work, by easing the job and making it much better. The onset of ChatGPT urged Google to flaunt its capability, which eventually lead to a healthy race of AI. OpenAI develops its own large language models (LLMs) like GPT-3.5 and GPT-4 that supports ChatGPT. LLMs are brains of AI models.

These LLMs are actually trained with vast amount of internet data that’s fed directly for generating the content. Recently, OpenAI has revealed its new web crawler – GPTBot, that helps its LLMs in feeding the data from the internet.

GPTBot Crawler by OpenAI

Web crawlers are bots, used by search engines like Google, Bing to read, scan and index the websites, in order to fetch the result or answer to a question when prompted at a later period. GPTBot similarly crawls the websites, scan the data and feed it to OpenAI’s LLMs for training it.

“Allowing GPTBot to access your site can help AI models become more accurate and improve their general capabilities and safety,” OpenAI notes in its GPTBot documentation. The company claims it is filtering out web pages that require paywall access, gather personally identifying information, and have text violating OpenAI’s policies.

Granting access to GPTBot to crawl your website will contribute in making ChatGPT better, but might raise privacy and security concerns. If any of the publishers, website-holders are not comfortable in allowing GPTBot to crawl their website, OpenAI offers a simple way to opt out.

How to stop GPTBot from crawling your website?

To block GPTBot from accessing your website, all you have to do is add two line of code to the site’s ‘robots.txt’ file by copy-pasting it. The code is:

User-agent: GPTBot

Disallow: /

You can also customize the crawling of GPTBot, restricting it to certain parts of your site. For this, you need to copy and paste this code in your ‘robots.txt’ and you can modify it on your convenience.

User-agent: GPTBot

Allow: /directory-1/

Disallow: /directory-2/

Though it’s unknown that will this GPTBot be training the current LLMs of OpenAI, it’s anticipated that OpenAI might use this web-crawler to train GPT-5. OpenAI has not announced the date of launch of the next version of GPT – GPT-5, but it will be more powerful than current models. As of now, the launch of GPT-5 is much far away, not even in 2024, Sam Altman said.

(For more such interesting informational, technology and innovation stuffs, keep reading The Inner Detail).

Kindly add ‘The Inner Detail’ to your Google News Feed by following us!