Last week Google made headlines with its proposal to standardize the quarter-decade-old robots.txt “non-standard”, its updated robots.txt documentation, and the open-sourcing of its robots.txt parser.
As the monitoring, parsing, and validation of robots.txt files is a core component of ContentKing, we followed these developments with great interest and quickly wrote about the RFC’s highlights and the updated Google documentation on how Google crawls robots.txt files.
Today we spent some time browsing the robots.txt repository and playing with the open-sourced robots.txt validator. We found some interesting things to share, which we’ve collected in this article.
But before diving in, let us state one important caveat: the open-sourced code is not the code that Google actually runs in production.
How do we know that?
Easy: Google announced on July 2nd that support for
Noindex directives will only be dropped as of September 1, 2019, yet the open-sourced code already doesn’t support this directive.
Alright, let’s dive right in!
Overall we were surprised by the flexibility Google shows when dealing with external input (robots.txt are usually handcrafted by webmasters) and how they really seem to want to be doing the right thing when it comes to staying out of restricted areas of websites.
This mostly comes down to accepting (gross) misspellings of the directives.
A great example of Google being very lenient towards spelling mistakes is how they interpret
Take a look at lines 691 through 699:
Yup, the following directive will keep Googlebot out of
/state-secrets/ without any problem:
Google's flexible attitude to disallow rules is quite ironic. While they'll bend over backwards to accept all kinds of misspellings and grammatical errors, Google nonetheless doesn't accept that any wildcard user-agent declaration applies to its Adsbot crawler. That's right, a user-agent: * disallow rule will NOT apply to Google's Adsbot; you need a separate disallow rule specifically for that one. I'm sure it makes sense for some Google engineer's reasoning, but personally I find it yet another example of Google's pervasive hypocrisy. "Do as we say, not as we do."
Although Google is very easy-going when it comes to the
Disallow directive, ensuring that common - and perhaps even not-so common - misspellings don’t cause Googlebot to eagerly index the state secrets hosted on your website, the same can’t be said about
Check out lines 687 through 689:
Only “allow” is accepted. A las, no “
Allaw: /state-secrets/public/” for us!
This makes sense though: Google is a well-behaving crawler and wants to be sure that it doesn’t go where you don’t want it to go. Even when you mess up.
This means being flexible when it comes to disallows and strict when it comes to allows. It chooses to err on the side of being more restricted rather than less restricted.
Let’s move up a bit, to the group declaration: the
user-agent string. Google’s flexible here too.
Check out lines 680 through 685:
Yup, you can write “user-agent”, “useragent” or even “user agent,” it’s all the same to Google.
For XML sitemap references, you’re similarly free to write either “sitemap” or “site-map” (who in the world does that?!).
See the code at lines 701 through 704:
I'm not surprised how lenient Google is with its parsing. People make the same mistakes over and over. In their mind, it's better to do PR for years and hope people fix their mistakes or just take a few minutes to add to the code base to allow for common mistakes that are likely going to happen anyway.
If it makes the results better or the information easier to understand, I think in these cases they will make allowances for the errors.
We’re all used to separating keys and values with a colon, right? But that turns out to be totally optional as well, as per lines 328 through 333:
So the following works perfectly fine:
Who knows, maybe one day this trick will come in handy when minimizing the size of your robots.txt file 😉
Now there’s a caveat though: since there’s not an explicit separator anymore, some of Google’s leniency gets lost. For example, the following will not work to keep Googlebot out of
This is because in this case Google is unable to separate the key (“User agent”) from the value (“googlebot”).
Google has one more “Google-specific optimization” trick up its sleeve, which is to allow extraneous characters in the global (“wildcard user-agent”) group.
On lines 575 through 577 we see the following:
Which means that the following robots.txt file will keep robots perfectly clear of our well-protected state secrets:
…but only when you separate the user-agent correctly, using the colon separator. Google’s more than reasonable, but there’s a limit to it.
We’ve learned two important things:
We’ve also learned that the following Frankenstein of a robots.txt file will work perfectly fine:
And yes, it’s perfectly OK if that example makes you throw up in your mouth a little!
Not much. It confirms our assumptions here and there (that at the scale at which Google operates, they can’t afford to be too strict) and provides us with some entertainment from browsing through their code, but it doesn’t mean we will change our robots.txt parsing or validation.
There are two reasons for that:
We won’t be adjusting our robots.txt best practices either. They’re best practices for a reason: so that it’s unequivocally clear what you mean. What if your future colleagues have to deal with a robots.txt file riddled with typos? Yeah, not great.
We want to wrap up by extending a big thank you to Gary Illyes for taking these steps, and in true “give them a finger, and they’ll take your whole hand” fashion, we would like to put in a request as well: please do the same with your XML sitemap parser.
We know that Google is waaaay more flexible than the XML sitemap standard dictates, and it would be awesome to know exactly how. 🙂