Welcome to the first edition of SEO in Focus, a monthly recurring blog series where we interview SEO Experts on all things SEO.
In this first edition of SEO in Focus we're discussing crawler traps with Dawn Anderson!
Crawler traps can seriously damage a site, but it depends on the type of trap the crawler is in.
If you know Dawn Anderson, you know she loves to write and talk about technical SEO.
For those who don’t know her: Dawn is the founder of Move It Marketing, a Manchester-based digital marketing agency.
Before we get started, let’s first define what crawler traps are.
In SEO, “crawler traps” are structural issues within a website that cause crawlers to find a virtually infinite number of irrelevant URLs. That’s a bad thing, because they waste crawl budget and can cause duplicate-content issues.
Crawler traps can seriously damage a site, but it depends on the type of trap the crawler is in. Whilst infinite spaces such as calendars, which can have no end, and dynamically generated parameters such as those on eCommerce sites can be very problematic types of crawler traps, the worst kind I’ve ever seen are pages that pull in logical, but incorrect, parameters.
When we’re talking about these types of pages, I mean pages with content that looks fine at first sight, and is changed based on the parameters it pulls in.
For example, say you have an eCommerce platform with shoes and subcategories of heels, flats, kitten heels, boots, slippers, wellington boots and sandals. An infinite loop might pull in heels and flats together because one of the subcategory variables, which pulls in content dynamically and changes the URL, is programmed incorrectly in the template.
Depending on the content output created by these dynamic variables, the page output created can either make total sense or be complete nonsense. But, they’re topically related and semantically strong (shoes, heels, kitten heels, boots, slippers).
Examples of imaginary URLs:
Yes, these types of crawler traps can really tank a site over time. It’s that serious.
The reason is: Google tends to recognize a standard crawler trap reasonably quickly based on the more known crawler trap patterns and will reduce the revisit times to the rogue paths created. The exception here is that they don’t do this very quickly with logical, but incorrect, parameters.
Sometimes they even begin to visit those logical, but incorrect, parameters more than the content you wanted them to visit, and they might index them at scale.
You can distinguish two phases in the detection of crawler traps:
The widely known types of parameters tend to get crawled for a while, and then the crawl drops down sharply once the parameter and URLs generated start to develop patterns recognised presumably by Googlebot (or other parts of the crawl scheduling system).
Then the parameter appears in Google Search Console under
URL Parameters so we can tell Google whether these are representative parameters (for tracking purposes) or active parameters (that change the content or the order of the content). Representative parameters usually contain patterns in their strings like
?utm_ and so forth. The active parameters might include identifiers such as for example subcategories, sizes, colours and so forth. All these parameters change the content, or change the order of the content. Think: sorting based on prices, best reviewed, descending, ascending etcetera.
URL parameter handling is basically Google saying: “Hey, we found this path a number of times. Are you sure this the route you wanted us to take?”. We can give Google hints on different directions, particularly on sites with many permutations on the same item.
But with logical, but incorrect, parameters that’s not the case. They often don’t show up in Google Search Console because they’re not recognized as a crawler trap.
Googlebot, being the non-judgemental crawler it is at first, will just keep looping through the URLs and the indexer will just keep indexing these ‘logical (but incorrect) parameter’ driven pages, because often the content in the pages is also created on the fly and depends on the same variables pulled as those in the URLs.
Headings, subheadings, calls to action, and so forth all get variable output to build the pages in parts.
How is Googlebot meant to know that it’s highly unlikely people would have boots with kitten heels?
So, it’s likely you’d end up indexing pages for boots with kitten heels across potentially every other variant (size/colour), and so forth. If you thought that the normal parameters for eCommerce sites were bad for index bloat, then multiply this by 10,000.
Eventually (this can take a long time), Google realises that the pages which the illogical parameters are creating are of really low value and the download rate (crawl rate) of these URLs starts to drop away.
Who is looking in search for boots with kitten heels after all?
But, some of these spewed variable-driven pages also make sense. You might expect to see kitten heels and heels together for example, but they’re still not what you intended to index. Programmatic template flaws have caused it.
By the way, these likely won’t even appear in the URL parameter handling in Google Search Console at all. You’ll see these in strange analytics visits, server log files and Google Search Console.
You might not notice it either for quite a while, and you might also receive quite a bit of additional traffic from it at first, because you have more indexed pages meeting longer-tailed queries.
But, eventually the crawl rate drops down further and further as the patterns picks up that these pages have no value at all.
You literally unravelled your site (or parts of it), and it’s a huge job to sort out. You diluted strength all over the place, and you need to put your building back together. Good luck with it, because it can take a while.
Submitting XML sitemaps at scale on incorrectly pulled (but logical) parameters exacerbates the issue further.
They usually come about as a programmatically generated issue with dynamically pulling in the wrong variables in a template. It gets worse when there are a ton of internal links to these pages in the navigation or XML sitemaps. If this is the case, crawlers just keep looping round while adding every possible variant in the paths (and page output). Essentially, finding an infinite amount of URLs with what looks like logical content.
Always check URL parameters and always, always check which pages are being pulled by your programmatic variables in templates.
Always keep a keen eye out for anomalies and double check anything programmatically implemented, particularly when this impacts dynamic elements.