The Complete robots.txt Guide: Syntax, Directives, and Best Practices

Everything you need to know about robots.txt. Covers syntax, directives, testing, AI crawlers, platform guides, and common mistakes.

What Is robots.txt?

A robots.txt file is a plain text file placed at the root of a website that tells web crawlers which pages or sections they can and cannot access. It is the primary mechanism of the robots exclusion protocol, a standard that has governed crawler behavior on the web since 1994.

Every major search engine, from Google and Bing to newer AI crawlers, checks for a robots.txt file before crawling a site. The file acts as a set of instructions: "You can go here. Stay out of there." It does not force anything. Crawlers choose to respect it, and well-behaved ones do.

The concept is simple. You create a file called robots.txt, place it at https://example.com/robots.txt, and fill it with directives that specify which user agents (crawlers) should avoid which paths. That is the entire idea. The power comes from understanding the syntax, knowing what each directive does, and avoiding the mistakes that can accidentally hide your site from search engines.

If you are brand new to the topic, start with What is robots.txt? for a quick primer. This guide goes deeper, covering every directive, every edge case, and every platform.

A Brief History of the Robot Protocol

The robot exclusion protocol was proposed in 1994 by Martijn Koster after early web crawlers caused problems by aggressively crawling sites and overloading servers [1]. There was no formal standard at first, just a community agreement. Crawlers would look for /robots.txt and follow the rules inside it.

For nearly 30 years, the protocol lived as an informal convention. That changed in September 2022 when the IETF published RFC 9309, formalizing the robots exclusion protocol as an internet standard [2]. RFC 9309 codified the rules that Google, Bing, and other crawlers had already been following, while also clarifying ambiguities that had led to inconsistent behavior across different crawlers.

Today, the robot protocol is universally supported. Google has its own detailed specification [3] that closely follows RFC 9309 but adds a few Google-specific behaviors like wildcard pattern matching.

How robots.txt Works

Where It Lives

The robots.txt file must be placed at the root of your domain:

https://example.com/robots.txt

It applies only to the host where it is found. A robots.txt at https://example.com/robots.txt has no effect on https://blog.example.com or https://example.com:8080. Each subdomain and each port needs its own file [2].

The file must be served as a plain text file with a Content-Type of text/plain. It must be accessible via HTTP or HTTPS at the exact path /robots.txt. You cannot put it in a subdirectory, rename it, or serve it dynamically under a different URL.

For a step-by-step walkthrough, see How to create a robots.txt file.

How Crawlers Find and Process It

When a crawler arrives at your site, it follows this sequence:

  1. It requests https://yourdomain.com/robots.txt
  2. If the file exists (HTTP 200), the crawler parses the directives and follows them
  3. If the file returns a 404, the crawler assumes no restrictions and crawls everything
  4. If the server returns a 5xx error, most crawlers will treat the site as fully restricted (Google temporarily pauses crawling) [3]
  5. If the response is a redirect, the crawler follows it (up to a limit) to find the final robots.txt [4]

Understanding how crawlers handle redirects is relevant here. If your robots.txt triggers a redirect chain, the crawler might give up before reaching the actual file. Keep the path clean and direct. For more on how redirect chains affect crawlers, see how redirect chains work.

The Advisory Nature of robots.txt

This is critical to understand: robots.txt is advisory, not enforceable. Well-behaved crawlers (Googlebot, Bingbot, most commercial crawlers) respect the directives. Malicious bots, scrapers, and some less scrupulous crawlers ignore robots.txt entirely.

robots.txt is not a security mechanism. If you need to prevent access to sensitive content, use server-side authentication, password protection, or keep the content off the public internet. Putting a Disallow rule in robots.txt tells polite bots to stay away, but it does nothing to stop a determined scraper.

For a deeper look at what robots.txt actually does (and does not do), read What does robots.txt do?.

Complete Syntax Reference

The robots.txt file uses a simple line-based syntax. Each line is either a directive, a comment, or blank. Let's break down every element.

Basic Structure

A robots.txt file is organized into groups. Each group starts with one or more User-agent lines followed by one or more rules (Disallow, Allow). Here is a minimal example:

User-agent: Googlebot
Disallow: /private/
Allow: /private/public-page.html

User-agent: *
Disallow: /admin/

Groups are separated by blank lines. Each group targets specific crawlers and defines rules for them.

For a full breakdown of how to read these files, see How to read robots.txt.

Comments

Lines starting with # are comments. You can also add comments at the end of a directive line:

# This is a comment
User-agent: * # This applies to all crawlers
Disallow: /tmp/ # Temporary files

Comments are ignored by parsers. Use them to document why rules exist, especially if your robots.txt is complex.

User-agent

The User-agent directive specifies which crawler a group of rules applies to. Every rule group must begin with at least one User-agent line.

User-agent: Googlebot

The wildcard * matches all crawlers that are not matched by a more specific group:

User-agent: *
Disallow: /private/

You can stack multiple User-agent lines to apply the same rules to several crawlers:

User-agent: Googlebot
User-agent: Bingbot
Disallow: /staging/

Crawler matching is case-insensitive in practice, though RFC 9309 recommends case-sensitive matching [2]. Google's implementation is case-insensitive for the user-agent value [3].

For a deep dive into user-agent strings and which bots use which names, read robots.txt User-agent explained.

Disallow

The Disallow directive tells a crawler not to access a specific path or path prefix.

User-agent: *
Disallow: /admin/
Disallow: /private/
Disallow: /tmp/secret.html

A Disallow: /admin/ rule blocks any URL that starts with /admin/, including /admin/login, /admin/settings/security, and so on.

An empty Disallow means "disallow nothing" (allow everything):

User-agent: *
Disallow:

This is a common way to explicitly state that all crawling is permitted.

For more on how Disallow works with real-world examples, see robots.txt Disallow explained.

Allow

The Allow directive permits access to a path that would otherwise be blocked by a Disallow rule. It is most useful for creating exceptions within broader blocks.

User-agent: *
Disallow: /private/
Allow: /private/about.html

This blocks everything under /private/ except /private/about.html.

When both Allow and Disallow match a URL, the more specific (longer) rule wins. If they are the same length, Allow takes precedence per Google's implementation [3].

Sitemap

The Sitemap directive tells crawlers where to find your XML sitemap. Unlike other directives, it is not tied to any User-agent group. It can appear anywhere in the file.

Sitemap: https://example.com/sitemap.xml
Sitemap: https://example.com/sitemap-posts.xml

You can list multiple sitemaps. Each Sitemap line must contain a full, absolute URL.

Including your sitemap in robots.txt is a simple way to help search engines discover your content. For instructions on adding one, see How to add a sitemap to robots.txt. If you need to generate a sitemap first, learn what a sitemap is and how to create one.

Crawl-delay

The Crawl-delay directive tells a crawler to wait a specified number of seconds between requests. Not all crawlers support it. Notably, Google ignores Crawl-delay entirely (use Google Search Console to adjust crawl rate instead) [5]. Bing, Yandex, and some others do respect it.

User-agent: Bingbot
Crawl-delay: 10

This asks Bingbot to wait 10 seconds between each request to your server.

Use Crawl-delay sparingly. Setting it too high can significantly slow down how quickly your site gets crawled and indexed. For details on when and how to use it, read robots.txt Crawl-delay.

Wildcards: * and $

Wildcards are not part of the original robots exclusion protocol or RFC 9309, but Google and Bing both support them [3]. They add powerful pattern-matching capabilities.

The * (asterisk) matches any sequence of characters:

User-agent: *
Disallow: /*.pdf$
Disallow: /directory/*/private/

The $ (dollar sign) anchors a match to the end of the URL:

User-agent: *
Disallow: /*.php$

This blocks all URLs ending in .php but would not block /page.php?id=1 (because the URL does not end at .php).

Some practical wildcard patterns:

# Block all PDF files
Disallow: /*.pdf$

# Block all URLs with query parameters
Disallow: /*?

# Block a pattern in any subdirectory
Disallow: /*/feed/

For a complete reference with more examples, see robots.txt wildcards.

Wildcard support varies by crawler. Google and Bing support * and $ patterns. Other crawlers may not. Always test your patterns to make sure they match what you expect. You can test your robots.txt to verify wildcard behavior.

Detailed Syntax Reference

For the full syntax specification with additional edge cases and formal grammar, see our dedicated robots.txt syntax reference.

Here is a quick-reference summary of all directives:

| Directive | Scope | Purpose | |---|---|---| | User-agent | Group | Specifies which crawler(s) the following rules apply to | | Disallow | Group | Blocks access to a path or path prefix | | Allow | Group | Permits access to a path (overrides Disallow) | | Sitemap | Global | Points to an XML sitemap URL | | Crawl-delay | Group | Requests a delay between successive requests | | * (wildcard) | In path | Matches any sequence of characters | | $ (anchor) | In path | Matches end of URL |

Common robots.txt Examples

Let's walk through the most common configurations. For an extended collection, visit robots.txt examples.

Allow All Crawlers

The simplest robots.txt. Either an empty file or:

User-agent: *
Disallow:

Both have the same effect: no restrictions for any crawler.

Block All Crawlers

User-agent: *
Disallow: /

This blocks every crawler from every page. Use this on staging environments, development servers, or sites that are not ready for indexing.

Block Specific Directories

User-agent: *
Disallow: /admin/
Disallow: /cgi-bin/
Disallow: /tmp/
Disallow: /private/
Allow: /private/team.html

This keeps crawlers out of administrative and temporary directories while making an exception for one specific page.

Block Specific Crawlers

User-agent: BadBot
Disallow: /

User-agent: AnotherBot
Disallow: /

User-agent: *
Disallow:

This blocks two specific bots from the entire site while allowing everyone else full access.

Block URL Parameters

User-agent: *
Disallow: /*?sessionid=
Disallow: /*?sort=
Disallow: /*?ref=

This prevents crawlers from indexing URLs with specific query parameters that often create duplicate content.

WordPress-Typical Configuration

User-agent: *
Disallow: /wp-admin/
Allow: /wp-admin/admin-ajax.php
Disallow: /wp-includes/
Disallow: /wp-content/plugins/
Disallow: /*?s=
Disallow: /*?p=
Disallow: /tag/
Disallow: /author/

Sitemap: https://example.com/sitemap.xml

This is a common starting point for WordPress sites. It blocks admin areas while allowing the AJAX handler that some themes need, and prevents crawling of search results, internal redirect URLs, and low-value archive pages.

Blocking AI Crawlers

The rise of large language models has brought a new wave of web crawlers. Companies like OpenAI, Anthropic, and others send bots to crawl the web for training data and retrieval-augmented generation. Many site owners now want to control which AI crawlers can access their content.

Here are the most common AI crawler user agents as of early 2026:

| Bot | Company | User-agent | |---|---|---| | GPTBot | OpenAI | GPTBot | | ChatGPT-User | OpenAI | ChatGPT-User | | ClaudeBot | Anthropic | ClaudeBot | | Claude-Web | Anthropic | Claude-Web | | PerplexityBot | Perplexity | PerplexityBot | | Bytespider | ByteDance | Bytespider | | Google-Extended | Google (AI training) | Google-Extended | | FacebookBot | Meta | FacebookBot | | Applebot-Extended | Apple | Applebot-Extended | | cohere-ai | Cohere | cohere-ai |

To block all of these:

# Block AI crawlers
User-agent: GPTBot
Disallow: /

User-agent: ChatGPT-User
Disallow: /

User-agent: ClaudeBot
Disallow: /

User-agent: Claude-Web
Disallow: /

User-agent: PerplexityBot
Disallow: /

User-agent: Bytespider
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: FacebookBot
Disallow: /

User-agent: Applebot-Extended
Disallow: /

User-agent: cohere-ai
Disallow: /

You can also allow some AI crawlers while blocking others. For example, if you want to appear in Perplexity's answers but block training crawlers:

User-agent: GPTBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: Bytespider
Disallow: /

User-agent: PerplexityBot
Disallow:

This space is evolving quickly. New AI crawlers appear regularly, and some do not always identify themselves clearly. For a current, maintained list and copy-paste configurations, see How to block AI crawlers with robots.txt.

Blocking an AI crawler with robots.txt only works if that crawler respects the protocol. While major companies like OpenAI and Anthropic have publicly committed to honoring robots.txt [6] [7], smaller or less reputable crawlers may not. For stronger protection, consider server-side blocking by user-agent string or IP range in addition to robots.txt rules.

robots.txt vs Meta Robots vs X-Robots-Tag

Three mechanisms exist for controlling how search engines interact with your content. They serve different purposes and work at different levels.

robots.txt

  • Scope: Controls crawling at the URL/path level
  • How it works: Tells crawlers which URLs they can request
  • Key limitation: Cannot control indexing. A blocked URL can still appear in search results if other pages link to it.

Meta Robots Tag

  • Scope: Per-page control, set in the HTML <head>
  • How it works: Directives like noindex, nofollow, noarchive control indexing and link-following behavior
  • Key advantage: Can prevent a page from appearing in search results
<meta name="robots" content="noindex, nofollow">

X-Robots-Tag

  • Scope: Per-response control, set as an HTTP header
  • How it works: Same directives as meta robots, but applied via HTTP headers
  • Key advantage: Works on non-HTML resources (PDFs, images, videos)
X-Robots-Tag: noindex, nofollow

When to Use Each

| Goal | Use | |---|---| | Block crawling of entire directories | robots.txt | | Prevent a specific page from appearing in search results | Meta robots noindex | | Prevent indexing of a PDF or image | X-Robots-Tag | | Save crawl budget | robots.txt | | Block and deindex a page | Meta robots (the page must be crawlable for the bot to see the tag) |

A common mistake is using robots.txt to try to deindex pages. If you block a URL with Disallow, crawlers cannot see the noindex tag on that page, so the page might remain in search results indefinitely. To deindex a page, you must allow crawling so the bot can find and process the noindex directive.

For a full comparison, read robots.txt vs meta robots. For details on the noindex interaction, see robots.txt and noindex.

Platform-Specific Guides

Different CMS platforms handle robots.txt differently. Some generate it automatically, some let you edit it directly, and some require workarounds.

WordPress

WordPress generates a virtual robots.txt by default. You can customize it in several ways:

  1. Plugin-based editing: Plugins like Yoast SEO or Rank Math provide a robots.txt editor in the dashboard
  2. Physical file: Create an actual robots.txt file in your root directory (this overrides the virtual one)
  3. Code-based: Use the robots_txt filter in your theme's functions.php

A typical WordPress robots.txt:

User-agent: *
Disallow: /wp-admin/
Allow: /wp-admin/admin-ajax.php

Sitemap: https://example.com/sitemap_index.xml

For detailed instructions, see How to edit robots.txt in WordPress.

Shopify

Shopify generates a default robots.txt that you cannot directly edit through the file system. Since 2021, Shopify allows customization through the robots.txt.liquid theme template [8].

To customize:

  1. Go to Online Store > Themes > Actions > Edit code
  2. Under Templates, add a new template called robots.txt.liquid
  3. Add your custom directives

Shopify's default robots.txt already blocks checkout pages, cart pages, internal search, and other paths that should not be indexed.

For a step-by-step guide, read How to edit robots.txt in Shopify.

Wix

Wix automatically generates a robots.txt file. Direct editing is limited, but you can:

  1. Go to Settings > SEO > SEO Tools > Robots.txt Editor
  2. Add custom rules through the interface

Wix's auto-generated file typically handles the basics well, blocking admin paths and internal pages.

Squarespace

Squarespace does not provide a built-in robots.txt editor. The platform generates a default file that blocks certain system paths. To add custom rules:

  1. Go to Settings > Advanced > Code Injection
  2. You cannot edit robots.txt directly; instead, use page-level SEO settings to control individual page indexing

For Squarespace, the meta robots approach (setting pages to "Hide from search engines" in page settings) is often more practical than trying to modify robots.txt.

Static Sites and Custom Servers

If you control your own server, just create a robots.txt file in your web root directory. For static site generators like Next.js, Gatsby, Hugo, or Astro, place the file in your public or static directory so it gets served at the root path.

Testing and Validating Your robots.txt

A misconfigured robots.txt can block search engines from your entire site. Testing before deploying is not optional.

Online Testing Tools

The fastest way to validate your robots.txt is with an online tester. Enter your URL, check specific paths against specific user agents, and see exactly what is allowed and what is blocked.

You can test your robots.txt using our free tool, which parses your file, checks syntax, and lets you test individual URLs against your rules.

Google Search Console

Google Search Console includes a robots.txt tester (under Crawl > robots.txt Tester in the legacy interface). It shows you exactly how Googlebot interprets your file and highlights any syntax errors [5].

Manual Verification

You can also verify your robots.txt by:

  1. Fetching it directly: visit https://yoursite.com/robots.txt in a browser
  2. Checking the HTTP status code (should be 200)
  3. Verifying the Content-Type header is text/plain
  4. Walking through each rule manually to ensure it matches your intent

Ongoing Monitoring

robots.txt files can change unexpectedly, especially on platforms that auto-generate them or when deployment processes overwrite the file. Set up monitoring to alert you if your robots.txt changes in ways you did not expect. For approaches to this, see robots.txt monitoring.

For a full guide on checking and verifying your file, read How to check robots.txt.

Common Mistakes

Years of helping people debug their robots.txt files have revealed the same mistakes appearing over and over. Here are the ones that cause the most damage.

Blocking Your Entire Site Accidentally

User-agent: *
Disallow: /

This is correct if you want to block all crawling. It is a disaster if you left it in place after migrating from a staging environment. Always check your robots.txt after a site launch or migration.

If your site has disappeared from search results and you suspect robots.txt, see How to fix blocked by robots.txt.

Using robots.txt to Deindex Pages

As covered above, Disallow prevents crawling, not indexing. If external links point to a disallowed URL, search engines may still index and display it (typically with limited information). Use meta robots noindex to actually remove pages from search results.

Blocking CSS and JavaScript

Modern search engines render pages to understand their content. Blocking /css/ or /js/ directories prevents Googlebot from rendering your pages properly, which can hurt your rankings [9].

# Don't do this
User-agent: *
Disallow: /css/
Disallow: /js/
Disallow: /images/

Incorrect Path Syntax

Paths in robots.txt are case-sensitive. /Admin/ and /admin/ are different paths. Also, a trailing slash matters: Disallow: /private matches /private, /private/, /private-page, and /privately-held. To match only the directory, use Disallow: /private/.

Forgetting the Leading Slash

Every path must start with /. This is wrong:

Disallow: admin/

This is correct:

Disallow: /admin/

Too Many or Conflicting Rules

Complex robots.txt files with dozens of overlapping Allow and Disallow rules become difficult to maintain and debug. When rules conflict, the result depends on which crawler is interpreting them. Keep your rules as simple as possible.

Wrong File Location

The file must be at the exact root URL: https://example.com/robots.txt. Putting it at https://example.com/pages/robots.txt or any other path has no effect.

Not Updating After Site Changes

Redesigns, URL structure changes, CMS migrations, and new sections all require a robots.txt review. A robots.txt that was correct for your old URL structure can silently break things after a migration. Add a robots.txt audit to your launch checklist.

Encoding and BOM Issues

The file must be UTF-8 encoded without a byte order mark (BOM). Some text editors on Windows add a BOM by default, which can cause parsing failures in certain crawlers. If your robots.txt looks correct in a browser but crawlers are ignoring it, check for hidden BOM characters at the start of the file.

For more guidance on avoiding these pitfalls, read robots.txt best practices.

robots.txt and SEO

The relationship between robots.txt and search engine optimization is nuanced. Understanding it can help you manage your crawl budget and avoid common misconceptions.

Crawl Budget

Crawl budget is the number of pages a search engine will crawl on your site within a given time period. For large sites (tens of thousands of pages or more), crawl budget is a real concern. Using robots.txt to block low-value pages (internal search results, paginated archives, faceted navigation URLs) directs crawlers toward your most valuable content.

For smaller sites (under a few thousand pages), crawl budget is rarely a practical issue. Googlebot can easily crawl a small site in its entirety.

Indexing Misconceptions

The most persistent misconception: Disallow does not remove pages from Google's index. Repeat that to yourself. A disallowed page can still appear in search results. Google may show the URL with a note like "No information is available for this page" or display anchor text from incoming links.

If you need to remove a URL from search results:

  1. Allow crawling (remove the Disallow rule)
  2. Add a noindex meta tag or X-Robots-Tag
  3. Wait for the crawler to process the page
  4. Optionally use Google Search Console's URL Removal tool for faster action

Link Equity and Disallowed Pages

When you block a page with robots.txt, search engines cannot crawl it to find outbound links on that page. Any link equity (PageRank) that would flow through those links is effectively lost. If a blocked page receives backlinks from external sites, the equity from those backlinks cannot be distributed further.

Should You Have a robots.txt?

Not every site strictly needs one. If you want all your content crawled and indexed, an empty robots.txt or no robots.txt at all is fine. But most sites benefit from at least a basic configuration that blocks admin areas and includes a sitemap reference.

Staging and Pre-launch SEO

One of the most common SEO disasters is launching a site with a staging robots.txt still in place. During development, Disallow: / makes sense. In production, it means zero organic traffic. Automate the swap as part of your deployment pipeline, and verify the live robots.txt immediately after every launch.

For a full discussion on whether your site needs one, read Do you need a robots.txt file?. For SEO-specific strategies, see robots.txt and SEO.

International Sites and Crawling

If your site serves content in multiple languages or targets multiple regions, robots.txt applies equally to all localized versions on the same host. You do not need separate rules for different languages unless they live on different subdomains (in which case each subdomain has its own robots.txt).

For multilingual sites using hreflang tags, make sure your localized pages are crawlable so search engines can discover and process the hreflang annotations. Blocking localized pages with robots.txt would prevent search engines from understanding your international targeting. To learn more about hreflang implementation, see what hreflang is and how it works.

Also ensure that your XML sitemaps for all language versions are accessible and referenced in your robots.txt Sitemap directive. Following sitemap best practices helps search engines discover all versions of your content efficiently.

Advanced Patterns and Techniques

Combining Wildcards for Precision

You can chain wildcards to create precise matching patterns:

# Block all paginated URLs in any section
User-agent: *
Disallow: /*/page/*

# Block specific file types across the entire site
Disallow: /*.json$
Disallow: /*.xml$

# Block URLs containing specific parameters
Disallow: /*utm_*

Handling Multiple Environments

Use a restrictive robots.txt on staging and development environments:

# staging.example.com/robots.txt
User-agent: *
Disallow: /

And a permissive one in production. Automate this as part of your deployment process so the staging block never accidentally goes live.

Selective AI Crawler Access

Some sites want to allow AI-powered search (like Perplexity or Google AI Overviews) while blocking training crawlers. This requires knowing which user-agent string each crawler uses and what it is used for:

# Allow traditional search crawlers
User-agent: Googlebot
Disallow:

User-agent: Bingbot
Disallow:

# Block AI training crawlers
User-agent: GPTBot
Disallow: /

User-agent: Google-Extended
Disallow: /

# Allow AI search (retrieval) crawlers
User-agent: PerplexityBot
Disallow:

Protecting Sensitive Content

While robots.txt is not a security tool, you can use it as one layer in a defense-in-depth approach. Combine robots.txt Disallow rules with:

  • Server-side authentication for truly private content
  • noindex meta tags to prevent search engine indexing
  • IP-based access controls for admin areas
  • Firewall rules to block known bad crawlers by IP range

No single mechanism is sufficient on its own. robots.txt handles well-behaved crawlers. Server-side controls handle everything else.

Rate Limiting with Crawl-delay

For servers with limited resources, combine Crawl-delay with selective access:

User-agent: Bingbot
Crawl-delay: 5
Disallow: /api/

User-agent: *
Crawl-delay: 10
Disallow: /api/

Remember that Google ignores Crawl-delay. Use Google Search Console for Googlebot-specific rate adjustments.

Quick Reference Cheat Sheet

Here is a condensed reference for copy-pasting common configurations:

Allow everything:

User-agent: *
Disallow:

Block everything:

User-agent: *
Disallow: /

Block one bot:

User-agent: BadBot
Disallow: /

Block a directory:

User-agent: *
Disallow: /secret/

Allow exception within a block:

User-agent: *
Disallow: /private/
Allow: /private/public.html

Include sitemap:

Sitemap: https://example.com/sitemap.xml

Putting It All Together

A well-structured robots.txt for a typical business website might look like this:

# Robots.txt for example.com
# Last updated: 2026-04-14

# Default rules for all crawlers
User-agent: *
Disallow: /admin/
Disallow: /cgi-bin/
Disallow: /tmp/
Disallow: /*?sessionid=
Disallow: /*?sort=
Disallow: /internal/
Allow: /internal/careers/

# Block AI training crawlers
User-agent: GPTBot
Disallow: /

User-agent: Google-Extended
Disallow: /

User-agent: Bytespider
Disallow: /

# Sitemaps
Sitemap: https://example.com/sitemap.xml
Sitemap: https://example.com/sitemap-news.xml

This configuration:

  • Blocks admin and temporary directories from all crawlers
  • Prevents session ID and sort parameter URLs from being crawled
  • Creates an exception for the careers section within a blocked directory
  • Blocks AI training crawlers
  • Points to two sitemaps

Test every change before deploying. Validate with a robots.txt testing tool. Monitor your file for unexpected changes. And remember: robots.txt controls crawling, not indexing.

References

  1. Koster, M. "A Standard for Robot Exclusion." 1994. https://www.robotstxt.org/orig.html
  2. Koster, M., Illyes, G., Zeller, H., Sassman, L. "RFC 9309: Robots Exclusion Protocol." IETF, September 2022. https://www.rfc-editor.org/rfc/rfc9309
  3. Google. "Google's robots.txt specification." Google Search Central. https://developers.google.com/search/docs/crawling-indexing/robots/robots_txt
  4. Google. "How Google interprets the robots.txt specification." Google Search Central. https://developers.google.com/search/docs/crawling-indexing/robots/robots_txt#handling-http-result-codes
  5. Google. "Robots.txt report." Google Search Central. https://developers.google.com/search/docs/crawling-indexing/robots/robots-report
  6. OpenAI. "GPTBot." OpenAI Platform Documentation. https://platform.openai.com/docs/gptbot
  7. Anthropic. "ClaudeBot." Anthropic Documentation. https://docs.anthropic.com/en/docs/claude-web-crawling
  8. Shopify. "Customize robots.txt." Shopify Help Center. https://help.shopify.com/en/manual/promoting-marketing/seo/editing-robots-txt
  9. Google. "Understand the JavaScript SEO basics." Google Search Central. https://developers.google.com/search/docs/crawling-indexing/javascript/javascript-seo-basics

Test your robots.txt instantly

Validate your robots.txt file, check directives, and find crawling issues. Free and fast.

Test Your robots.txt