Industry News

Google’s robots.txt interpretation is as flexible as a ballerina

  • July 8, 2019
  • Vincent van Scherpenseel

Last week Google made headlines with its proposal to standardize the quarter-decade-old robots.txt "non-standard" (opens in a new tab), its updated robots.txt documentation, and the open-sourcing of its robots.txt parser (opens in a new tab).

As the monitoring, parsing, and validation of robots.txt files is a core component of ContentKing, we followed these developments with great interest and quickly wrote about the RFC's highlights and the updated Google documentation on how Google crawls robots.txt files (opens in a new tab).

Today we spent some time browsing the robots.txt repository and playing with the open-sourced robots.txt validator. We found some interesting things to share, which we've collected in this article.

But before diving in, let us state one important caveat: the open-sourced code is not the code that Google actually runs in production.

How do we know that?

Easy: Google announced on July 2nd (opens in a new tab) that support for Noindex directives will only be dropped as of September 1, 2019, yet the open-sourced code already doesn't support this directive.

Alright, let's dive right in!

Disallow, dissalow, disallaw, diasllow. It's all the same to Google

Overall we were surprised by the flexibility Google shows when dealing with external input (robots.txt are usually handcrafted by webmasters) and how they really seem to want to be doing the right thing when it comes to staying out of restricted areas of websites.

This mostly comes down to accepting (gross) misspellings of the directives.

A great example of Google being very lenient towards spelling mistakes is how they interpret Disallow directives.

Take a look at lines 691 through 699 (opens in a new tab):

bool ParsedRobotsKey::KeyIsDisallow(absl::string_view key) {
	return (
		absl::StartsWithIgnoreCase(key, "disallow") ||
		(kAllowFrequentTypos && ((absl::StartsWithIgnoreCase(key, "dissallow")) ||
		(absl::StartsWithIgnoreCase(key, "dissalow")) ||
		(absl::StartsWithIgnoreCase(key, "disalow")) ||
		(absl::StartsWithIgnoreCase(key, "diasllow")) ||
		(absl::StartsWithIgnoreCase(key, "disallaw")))));
}

Yup, the following directive will keep Googlebot out of /state-secrets/ without any problem:

User-agent: googlebot
diasllow: /state-secrets/

Google's flexible attitude to disallow rules is quite ironic. While they'll bend over backwards to accept all kinds of misspellings and grammatical errors, Google nonetheless doesn't accept that any wildcard user-agent declaration applies to its Adsbot crawler. That's right, a user-agent: * disallow rule will NOT apply to Google's Adsbot; you need a separate disallow rule specifically for that one. I'm sure it makes sense for some Google engineer's reasoning, but personally I find it yet another example of Google's pervasive hypocrisy. "Do as we say, not as we do."

Allow is a different beast though

Although Google is very easy-going when it comes to the Disallow directive, ensuring that common - and perhaps even not-so common - misspellings don't cause Googlebot to eagerly index the state secrets hosted on your website, the same can't be said about Allow directives.

Check out lines 687 through 689 (opens in a new tab):

bool ParsedRobotsKey::KeyIsAllow(absl::string_view key) {
	return absl::StartsWithIgnoreCase(key, "allow");
}

Only "allow" is accepted. A las, no "Allaw: /state-secrets/public/" for us!

This makes sense though: Google is a well-behaving crawler and wants to be sure that it doesn't go where you don't want it to go. Even when you mess up.

This means being flexible when it comes to disallows and strict when it comes to allows. It chooses to err on the side of being more restricted rather than less restricted.

User Agents: a hyphen or a space is totally fine

Let's move up a bit, to the group declaration: the user-agent string. Google's flexible here too.

Check out lines 680 through 685 (opens in a new tab):

bool ParsedRobotsKey::KeyIsUserAgent(absl::string_view key) {
	return ( absl::StartsWithIgnoreCase(key, "user-agent") ||
	(kAllowFrequentTypos && (absl::StartsWithIgnoreCase(key, "useragent") ||
	absl::StartsWithIgnoreCase(key, "user agent"))));
}

Yup, you can write "user-agent", "useragent" or even "user agent," it's all the same to Google.

For XML sitemap references, you're similarly free to write either "sitemap" or "site-map" (who in the world does that?!).

See the code at lines 701 through 704 (opens in a new tab):

bool ParsedRobotsKey::KeyIsSitemap(absl::string_view key) {
	return ((absl::StartsWithIgnoreCase(key, "sitemap")) ||
			(absl::StartsWithIgnoreCase(key, "site-map")));
}

I'm not surprised how lenient Google is with its parsing. People make the same mistakes over and over. In their mind, it's better to do PR for years and hope people fix their mistakes or just take a few minutes to add to the code base to allow for common mistakes that are likely going to happen anyway.

If it makes the results better or the information easier to understand, I think in these cases they will make allowances for the errors.

Colon-separated needed? Not at all!

We're all used to separating keys and values with a colon, right? But that turns out to be totally optional as well, as per lines 328 through 333 (opens in a new tab):

char* sep = strchr(line, ':');
if (nullptr == sep) {
	// Google-specific optimization: some people forget the colon, so we need to
	// accept whitespace in its stead.
	static const char * const kWhite = " \t";
	sep = strpbrk(line, kWhite); [..]
}

So the following works perfectly fine:

Useragent googlebot
Disallow /state-secrets/

Who knows, maybe one day this trick will come in handy when minimizing the size of your robots.txt file 😉

Now there's a caveat though: since there's not an explicit separator anymore, some of Google's leniency gets lost. For example, the following will not work to keep Googlebot out of /state-secrets/:

User agent googlebot
Disallow /state-secrets/

This is because in this case Google is unable to separate the key ("User agent") from the value ("googlebot").

More user-agent craziness

Google has one more "Google-specific optimization" trick up its sleeve, which is to allow extraneous characters in the global ("wildcard user-agent") group.

On lines 575 through 577 (opens in a new tab) we see the following:

if (user_agent.length() >= 1 && user_agent[0] == '*' &&
	(user_agent.length() == 1 || isspace(user_agent[1]))) {
	seen_global_agent_ = true;

Which means that the following robots.txt file will keep robots perfectly clear of our well-protected state secrets:

User-agent: * all-government-stay-out
Disallow: /state-secrets/

…but only when you separate the user-agent correctly, using the colon separator. Google's more than reasonable, but there's a limit to it.

What we've learned

We've learned two important things:

  1. Google's very lenient when it comes to accepting spelling errors.
  2. Google errs on the safe side: it assumes that it's restricted rather than unrestricted.

We've also learned that the following Frankenstein of a robots.txt file will work perfectly fine:

Useragent googlebot
Diasllow /state-secrets/
Site-map: https://www.example.com/sitemap.xml

And yes, it's perfectly OK if that example makes you throw up in your mouth a little!

What this means for ContentKing

Not much. It confirms our assumptions here and there (that at the scale at which Google operates, they can't afford to be too strict) and provides us with some entertainment from browsing through their code, but it doesn't mean we will change our robots.txt parsing or validation.

There are two reasons for that:

  1. Google's spelling leniency is simply that: Google's. This is not part of the RFC (which is itself still just a draft), so Google could theoretically change this behavior at any moment.
  2. Again: it's Google's. Other search engines may or may not have similar rules, and as a vendor operating on a global scale, we can't afford to just follow Google's way of doing things.

We won't be adjusting our robots.txt best practices (opens in a new tab) either. They're best practices for a reason: so that it's unequivocally clear what you mean. What if your future colleagues have to deal with a robots.txt file riddled with typos? Yeah, not great.

A thank you and a request to Gary Illyes

We want to wrap up by extending a big thank you to Gary Illyes for taking these steps, and in true "give them a finger, and they'll take your whole hand" fashion, we would like to put in a request as well: please do the same with your XML sitemap parser.

We know that Google is waaaay more flexible than the XML sitemap standard dictates, and it would be awesome to know exactly how. 🙂

Vincent is ContentKing's Chief Executive Officer. He's passionate about product management and loves to work at the intersection of design, development and business. Which makes ContentKing the perfect challenge for him.