A robots.txt file contains directives for search engines, which you can use to prevent search engines from crawling specific parts of your website.
When implementing robots.txt, keep the following best practices in mind:
A robots.txt file tells search engines your website’s rules of engagement.
Search engines regularly check a website's robots.txt file to see if there are any instructions for crawling the website. We call these instructions ‘directives’.
If there’s no robots.txt file present or if there are no applicable directives, search engines will crawl the entire website.
Although all major search engines respect the robots.txt file, search engines may choose to ignore (parts of) your robots.txt file. While directives in the robots.txt file are a strong signal to search engines, it’s important to remember the robots.txt file is a set of optional directives to search engines rather than a mandate.
The robots.txt is the most sensitive file in the SEO universe. A single character can break a whole site.
The robots.txt file is the implementation of the robots exclusion standard, or also called the robots exclusion protocol.
The robots.txt file plays an essential role from a search engine optimization (SEO) point of view. It tells search engines how they can best crawl your website.
Using the robots.txt file you can prevent search engines from accessing certain parts of your website, prevent duplicate content and give search engines helpful tips on how they can crawl your website more efficiently.
Be careful when making changes to your robots.txt though: this file has the potential to make big parts of your website inaccessible for search engines.
Robots.txt is often over used to reduce duplicate content, thereby killing internal linking so be really careful with it. My advice is to only ever use it for files or pages that search engines should never see, or can significantly impact crawling by being allowed into. Common examples: log-in areas that generate many different urls, test areas or where multiple facetted navigation can exist. And make sure to monitor your robots.txt file for any issues or changes.
The mass majority of issues I see with robots.txt files fall into four buckets: 1) the mishandling of wildcards. It's fairly common to see parts of the site blocked off that were intended to be blocked off. Sometimes, if you aren't careful, directives can also conflict with one another. 2) Someone, such as a developer, has made a change out of the blue (often when pushing new code) and has inadvertently altered the robots.txt without your knowledge. 3) The inclusion of directives that don't belong in a robots.txt file. Robots.txt is web standard, and is somewhat limited. I oftentimes see developers making directives up that simply won't work (at least for the mass majority of crawlers). Sometimes that's harmless, sometimes not so much.
Let’s look at an example to illustrate this:
You’re running an E-commerce website and visitors can use a filter to quickly search through your products. This filter generates pages which basically show the same content as other pages do. This works great for users, but confuses search engines because it creates duplicate content. You don’t want search engines to index these filtered pages and waste their valuable time on these URLs with filtered content. Therefor, you should set up Disallow rules so search engines don't access these filtered product pages.
Preventing duplicate content can also be done using the canonical URL or the meta robots tag, however these don’t address letting search engines only crawl pages that matter. Using a canonical URL or meta robots tag will not prevent search engines from crawling these pages. It will only prevent search engines from showing these pages in the search results. Since search engines have limit time to crawl a website, this time should be spend on pages that you want to appear in search engines.
An incorrectly set up robots.txt file may be holding back your SEO performance. Check if this is the case for your website right away!
It's a very simple tool, but a robots.txt file can cause a lot of problems if it's not configured correctly, particularly for larger websites. It's very easy to make mistakes such as blocking an entire site after a new design or CMS is rolled out, or not blocking sections of a site that should be private. For larger websites, ensuring Google crawl efficiently is very important and a well structured robots.txt file is an essential tool in that process. You need to take time to understand which sections of your site are best kept away from Google so that they spend as much of their resource as possible crawling the pages that you really care about.
An example of what a simple robots.txt file for a WordPress website may look like:
Let’s explain the anatomy of a robots.txt file based on the example above:
user-agentindicates for which search engines the directives that follow are meant.
*: this indicates that the directives are meant for all search engines.
Disallow: this is a directive indicating what content is not accessible to the
/wp-admin/: this is the
pathwhich is inaccessible for the
In summary: this robots.txt file tells all search engines to stay out of the
Each search engine should identify himself with a
user-agent. Google’s robots identify as
Googlebot for example, Yahoo’s robots as
Slurp and Bing’s robot as
BingBot and so on.
user-agent record defines the start of a group of directives. All directives in between the first
user-agent and the next
user-agent record are treated as directives for the first
Directives can apply to specific user-agents, but they can also be applicable to all user-agents. In that case, a wildcard is used:
You can tell search engines not to access certain files, pages or sections of your website. This is done using the
Disallow directive. The
Disallow directive is followed by the
path that should not be accessed. If no
path is defined, the directive is ignored.
In this example all search engines are told not to access the
Allow directive is used to counteract a
Disallow directive. The
Allow directive is supported by Google and Bing. Using the
Disallow directives together you can tell search engines they can access a specific file or page within a directory that’s otherwise disallowed. The
Allow directive is followed by the
path that can be accessed. If no
path is defined, the directive is ignored.
In the example above all search engines are not allowed to access the
/media/ directory, except for the file
Important: when using
Disallow directives together, be sure not to use wildcards since this may lead to conflicting directives.
Search engines will not know what to do with the URL
http://www.domain.com/directory.html. It’s unclear to them whether they’re allowed to access.
Disallow rules in a site's robots.txt file are incredibly powerful, so should be handled with care. For some sites, preventing search engines from crawling specific URL patterns is crucial to enable the right pages to be crawled and indexed - but improper use of disallow rules can severely damage a site's SEO.
Each directive should be on a separate line, otherwise search engines may get confused when parsing the robots.txt file.
Example of incorrect robots.txt file
Prevent a robots.txt file like this:
Robots.txt is one of the features I most commonly see implemented incorrectly so it's not blocking what they wanted to block or it's blocking more than they expected and has a negative impact on their website. Robots.txt is a very powerful tool but too often it's incorrectly setup.
Not only can the wildcard be used for defining the
user-agent, it can also be used to match URLs. The wildcard is supported by Google, Bing, Yahoo and Ask.
In the example above all search engines aren’t allowed access to URLs which include a question mark (?).
Developers or site-owners often seem to think they can utilise all manner of regular expression in a robots.txt file whereas only a very limited amount of pattern matching is actually valid - for example wildcards ("*"). There seems to be a confusion between .htaccess files and robots.txt files from time to time.
To indicate the end of a URL, you can use the dollar sign ($) at the end of the
In the example above search engines aren’t allowed to access all URLs which end with .php. URLs with parameters, e.g.
https://example.com/page.php?lang=en would not be disallowed, as the URL doesn't end after
Even though the robots.txt file was invented to tell search engines what pages not to crawl, the robots.txt file can also be used to point search engines to the XML sitemap. This is supported by Google, Bing, Yahoo and Ask.
The XML sitemap should be referenced as an absolute URL. The URL does not have to be on the same host as the robots.txt file. Referencing the XML sitemap in the robots.txt file is one of the best practices we advise you to always do, even though you may have already submitted your XML sitemap in Google Search Console or Bing Webmaster Tools. Remember, there are more search engines out there.
Please note that it’s possible to reference multiple XML sitemaps in a robots.txt file.
Multiple XML sitemaps:
The example above tells all search engines not to access the directory
/wp-admin/ and that there are two XML sitemaps which can be found at
A single XML sitemap:
The example above tells all search engines not to access the directory
/wp-admin/ and that the XML sitemap can be found at
Comments are preceded by a
# and can either be placed at the start of a line or after a directive on the same line. Everything after the
# will be ignored. These comments are meant for humans only.
The examples above communicate the same.
Crawl-delay directive is an unofficial directive used to prevent overloading servers with too many requests. If search engines are able to overload a server, adding
Crawl-delay to your robots.txt file is only a temporary fix. The fact of the matter is, your website is running on a poor hosting environment and you should fix that as soon as possible.
The way search engines handle the
Crawl-delay differs. Below we explain how major search engines handle it.
Google does not support the
Crawl-delay directive. However, Google does support defining a crawl rate in Google Search Console. Follow the steps below to set it:
Bing, Yahoo and Yandex
Bing, Yahoo and Yandex all support the
Crawl-delay directive to throttle crawling of a website. Their interpretation of the crawl-delay is different though, so be sure to check their documentation:
Crawl-delay directive should be placed right after the
Baidu does not support the crawl-delay directive, however it’s possible to register a Baidu Webmaster Tools account in which you can control the crawl frequency similar to Google Search Console.
We recommend to always use a robots.txt file. There’s absolutely no harm in having one, and it’s a great place to hand search engines directives on how they can best crawl your website.
The robots.txt can be useful to keep certain areas or documents on your site from being crawled and indexed. Examples are for instance the staging site or PDFs. Plan carefully what needs to be indexed by search engines and be mindful that content that's been made inaccessible through robots.txt may still be found by search engine crawlers if it's linked to from other areas of the website.
The best practices for robots.txt files are categorized as follows:
The robots.txt file should always be placed in the
root of a website (in the top-level directory of the host) and carry the filename
robots.txt, for example:
https://www.example.com/robots.txt. Note that the URL for the robots.txt file is, like any other URL, case-sensitive.
If the robots.txt file cannot be found in the default location, search engines will assume there are no directives and crawl away on your website.
It’s important to note that search engines handle robots.txt files differently. By default, the first matching directive always wins.
However, with Google and Bing specificity wins. For example: an
Allow directive wins over a
Disallow directive if its character length is longer.
In the example above all search engines, including Google and Bing are not allowed to access the
/about/ directory, except for the sub-directory
In the example above all search engines except for Google and Bing aren’t allowed access to
/about/ directory, including
Google and Bing are allowed access because the
Allow directive is longer than the
You can only define one group of directives per search engine. Having multiple groups of directives for one search engine confuses them.
The disallow directive triggers on partial matches as well. Be as specific as possible when defining the
Disallow directive to prevent unintentionally disallowing access to files.
The example above doesn’t allow search engines access to:
For a robot only one group of directives is valid. In case directives meant for all robots are followed with directives for a specific robot, only these specific directives will be taken into considering. For the specific robot to also follow the directives for all robots, you need to repeat these directives for the specific robot.
Let’s look at an example which will make this clear:
In the example above all search engines except for Google are not allowed to access
/not-launched-yet/. Google only isn't allowed access to
/not-launched-yet/, but is allowed access to
If you don’t want googlebot to access
/not-launched-yet/ then you need to repeat these directives for
Please note that your robots.txt file is publicly available. Disallowing website sections in there can be used as an attack vector by people with malicious intent.
Robots.txt can be dangerous. You're not only telling search engines where you don't want them to look, you're telling people where you hide your dirty secrets.
Robots.txt file directives only apply to the host where the file is hosted.
http://example.com/robots.txt is valid for
http://example.com, but not for
It's a best practice to only have one robots.txt file available on your (sub)domain, that's over at ContentKing we audit your website for this. If you have multiple robots.txt files available, be sure to 301-redirect them to the canonical robots.txt file.
In case your robots.txt file is conflicting with settings defined in Google Search Console, Google often chooses to use the settings defined in Google Search Console over the directives defined in the robots.txt file.
It's important to monitor your robots.txt file for changes. At ContentKing, we see lots of issues where incorrect directives and sudden changes to the robots.txt file cause major SEO issues. This holds true especially when launching new features or a new website that has been prepared on a test environment, as these often contain the following robots.txt file:
We built robots.txt change tracking and alerting for this reason.
We see it all the time: robots.txt files changing without knowledge of the digital marketing team. Don't be that person. Start monitoring your robots.txt file now receive alerts when it changes!
Although some say it’s a good idea to use a
noindex directive in your robots.txt file, it’s not an official standard and Google openly recommends on not using it. Google hasn't made it clear exactly why, but we believe we should take their recommendation (in this case) seriously. It makes sense, because:
noindexdirective isn't fool proof, as it's not an official standard. Assume it's not going to be followed 100% by Google.
noindexdirective, other search engines won't use it to noindex pages.
The best way to signal to search engines that pages should not be indexed is using the meta robots tag or X-Robots-Tag. If you're unable to use these, and the robots.txt
noindex directive is your last resort than you can try it but assume it's not going to fully work, then you won't be disappointed.
In this chapter we’ll cover a wide range of robots.txt file examples.
There’s multiple ways to tell search engines they can access all files:
Or having an empty robots.txt file or not having a robots.txt at all.
Please note: one extra character can make all the difference.
Please note that when disallowing Googlebot, this goes for all Googlebots. That includes Google robots which are searching for instance for news (
googlebot-news) and images (
The robots.txt file below is specifically optimized for WordPress, assuming:
Please note that this robots.txt file will work in most cases, but you should always adjust it and test it to make sure it applies to your exact situation.
I'd still always look to block internal search results in robots.txt on any site because these types of search URLs are infinite and endless spaces. There's a lot of potential for Googlebot getting into a crawler trap.
Even though the robots.txt is well respected by search engines, it’s still a directive and not a mandate.
Pages that are inaccessible for search engines due to the robots.txt, but do have links to them can still appear in search results if they are linked from a page that is crawled. An example of what this looks like:
Protip: it’s possible to remove these URLs from Google using Google Search Console’s URL removal tool. Please note that these URLs will only be temporarily removed. In order for them to stay out Google’s result pages you need to remove the URLs every 90 days.
Use robots.txt to block out undesirable and likely harmful affiliate backlinks. Do not use robots.txt in an attempt to prevent content from being indexed by search engines, as this will inevitably fail. Instead apply robots directive noindex when necessary.
Google has indicated that a robots.txt file is generally cached for up to 24 hours. It’s important to take this into consideration when you make changes in your robots.txt file.
It’s unclear how other search engines deal with caching of robots.txt, but in general it's best to avoid caching your robots.txt file to avoid search engines taking longer than necessary to be able to pick up on changes.
For robots.txt files Google currently supports a file size limit of 500 kb. Any content after this maximum file size may be ignored.
It’s unclear whether other search engines have a maximum filesize for robots.txt files.
No, take this example:
Also: if a page is disallowed using robots.txt and the page itself contains a
<meta name="robots" content="noindex,nofollow"> then search engines robots will still keep the page in the index, because they’ll never find out about
<meta name="robots" content="noindex,nofollow"> since they are not allowed access.
Yes, you should be careful. But don’t be afraid to use it. It’s a great tool to help search engines better crawl your website.
From a technical point of view, no. The robots.txt file is an optional directive. We can't say anything about if from a legal point of view.
Yes. When search engine don’t encounter a robots.txt file in the root (in the top-level directory of the host) they’ll assume there are no directives for them and they will try to crawl your entire website.
No, this is not advisable. Google specifically recommends against using the noindex directive in the robots.txt file.
We know that all major search engines below respect the robots.txt file:
Including the following directives in your robots.txt prevents all search engines from indexing search result pages on your WordPress website, assuming no changes were made to the functioning of the search result pages.