Build a robots.txt file with rules for search engines and crawlers.
robots.txt is a plain text file placed at the root of a website (e.g. https://example.com/robots.txt) that tells web crawlers which parts of the site they may and may not access. It follows the Robots Exclusion Protocol, first proposed in 1994 and now formalised as an internet standard (RFC 9309). Every major search engine — Google, Bing, Yandex, Baidu — reads this file before crawling a site.
Each block in a robots.txt file starts with a User-agent line specifying which crawler the rules apply to. An asterisk (*) matches all crawlers. Below the user-agent, Disallow lines list paths the crawler should not visit, while Allow lines grant exceptions within disallowed areas. Rules are matched by longest prefix, so a more specific Allow can override a broader Disallow. The Crawl-delay directive requests that a bot wait a specified number of seconds between requests, though not all crawlers honour it.
AI training datasets are built by specialised crawlers that scrape the open web. Common ones include GPTBot (OpenAI), ClaudeBot (Anthropic), CCBot (Common Crawl), Google-Extended (Google AI training), and Bytespider (ByteDance). Adding Disallow rules for these user agents prevents your content from being used for model training while keeping your site fully visible to search engines. Note that robots.txt is advisory — compliant bots will respect it, but it offers no technical enforcement.
The Sitemap line tells crawlers where to find your XML sitemap, which lists all the pages you want indexed along with metadata like last-modified dates and update frequency. You can include multiple Sitemap lines if you have more than one. Sitemap directives are not tied to a specific user-agent block — they apply globally and are typically placed at the end of the file.
Accidentally disallowing your entire site (Disallow: /) to all crawlers will deindex it from search engines. Blocking CSS and JavaScript files can prevent search engines from rendering pages correctly, harming your rankings. Using robots.txt to hide sensitive content is unreliable — it tells crawlers the URL exists while asking them not to visit it. For truly private content, use authentication or the noindex meta tag instead.
This tool runs entirely in your browser. No data is sent to any server. Your input stays on your machine.
ectoplasma.org · free tools