How to Improve Crawling & Indexing for Large Sites
Many websites that use faceted search rely solely on the canonical to point back to the start page for that category search. The issue there is very often that these pages are not at all similar and are no longer appropriate for the use of the canonical tag. Instead, these pages should have a noindex tag.
Hi John! Can you tell us a little about yourself?
My name is John Morabito and I am the Director of (opens in a new tab). We are a full-service marketing and media agency, offering everything from research to execution for brands across beauty, CPG, retail, healthcare, and B2B.
I’m always looking to share and learn. If your brand is struggling with SEO, please reach out; we offer a free search opportunity analysis.
What’s the biggest challenge when it comes to getting large sites crawled and indexed properly, and how do you tackle that?
In short, it’s large numbers of low-quality pages that a site generates for one reason or another. I have seen a lot of cases where faceted search is the culprit, however, things like user profiles or tag pages on publishing sites can also cause issues.
During our crawling and indexing audit, we look at the number of indexable pages that are crawlable on a website and then compare that to the number of pages within Google Search Console,
site: query, and our XML sitemaps. This gives us four different data points to better understand any misalignment in what we are putting forth as “the website” and what Google is actually grabbing onto.
Very often we see things like the canonical being completely ignored, and issues around the way canonicals are implemented for paginated pages.
In the case of ignored canonicals, this can cause issues with pages being crawled less frequently, as Google has a nearly infinite possible number of URLs to encounter.
Many websites that use faceted search rely solely on the canonical to point back to the start page for that category search. The issue there is very often that these pages are not at all similar and are no longer appropriate for the use of the canonical tag. Instead, these pages should have a noindex tag. One problem with this, is that if anyone links to these pages, equity would eventually be dropped, with or without the “follow” link attribute. For this reason, we recommend periodically scanning your backlink profile for dynamic URLs and recreating those as static pages, should enough people link to them.
In one recent example, we found a real estate site that had two major issues relating to pagination.
First, the links to pages past
Beyond that, they had a canonical on each page past
page 1, which pointed to the first page in the series. Canonicals on paginated sets should always be self-referential, and when we made this fix, plus showing the pagination in a way crawlers could access, we saw a massive increase in the number of indexed pages, which was our ultimate goal.
Communicating hierarchical relationships between pages is another one of the more challenging things we deal with on large websites.
Often, we’re looking at things like click depth to get to a given property detail page (using the real estate example again). For this niche, the traffic really comes from the landing pages for a given area, but the people selling the home both on the agency and seller side always want to see that listing page ranking in the top of search results.
The problem is that these singular listings are usually one of hundreds of thousands of listings on the site. Usually, real estate sites’ area landing pages are sorted either by price or newness on the market. So, it may be challenging to determine where an individual listing sits within the site architecture. This could be true of products in a large eCommerce site as well, or blog posts on a large publishing site.
Our solution is often to create additional crawlable “inverted” landing pages, where we sort by the opposite of the regular sorting; i.e. we may sort by price low to high.
Additionally, we’ll add more internal links to pagination past the “next page.” Usually, we will recommend adding links to four to five pages on either side of the page that the bot is currently on. This greatly flattens the site architecture and provides a greater number of crawl paths to each listing.
How did Google’s announcement about not using the pagination attributes in 2019 impact you and the recommendations you make?
In some small ways, yes, this changed how we address priority around this attribute, but usually only if it’s missing or wrong.
Because other search engines still use these tags, we do still recommend using them for most of our clients. Oftentimes, sites we work with already have them implemented, so we continue using them to reinforce how the site is currently being crawled and indexed.
With that said, there is more going on with pagination than just link attributes. Looking at things like reducing click-depth through flattening pagination tunnels is a more productive use of time than obsessing over pagination link attributes.
Are you a fan of using the robots.txt to prevent search engines from accessing certain website sections? If so, why?
There are lots of great applications of a disallow in robots.txt and I often do use this as a solution, however sometimes it’s best to look at how the bots are even getting into those dark corners of your site. Then work to resolve that issue from its source.
In what situations do you use the nofollow link attribute on internal links?
This is getting increasingly complex. Google has basically said they may or may not still crawl links with this link attribute. However, it is still likely going to be a handy tool. The
nofollow attribute, in my opinion, can be useful for controlling facets or spider traps. The best answer to both scenarios is to not create low-value pages in the first place, but that’s not always a reality is it?
In Shopify, there is no control over the robots.txt, so nofollow is a tool that can be used on shop filter pages to prevent them from being crawled in most cases.
In addition to the nofollow on the link to the page itself, a
noindex, follow on the page is also ideal. Link equity will be lost after a period of time, but it’s worth it to keep the
follow. As I said earlier, you should be scanning your backlinks for links to blocked pages and either unblock them, or recreate them as static pages. This is fairly rare though, so generally I do not worry about link equity from pages that are created with filters.
How do you deal with discontinued product or listing pages, at scale?
It depends, but I have two answers:
- In general, if the URL has no traffic, backlinks, or keywords ranked, a 404 is fine or a 410 is even better. The
410says “page really gone."
- If the page did have value, here’s what we do for:
In the world of electronics, for example, product lines come back each year with new SKUs and model numbers. These are a great candidate for a 1-1 redirect, where the old product is redirected to the new one. I would recommend displaying a message letting users know they hit a legacy URL and are now on a replacement SKU, but most sites can get away with not doing that.
For apparel retailers that have seasonal collections and lots of products that continually go out of stock, we try to find a similar product to redirect to, but oftentimes we’re left with the choice of redirecting to a category page or leaving the product page up with a notification. Our approach depends on the client and our ability to action a solution. In some cases, we may recommend taking customer emails from the PDP (“Product Detail Page”).
We recommend, generally, that we leave all of the listings live at all times, even if a particular home is off the market. For a period of time, we link to them in a “sold” section. Then, we unlink these pages so they are not eating up crawl budget, but allow them to remain as indexable pages. This makes bringing them back to life a bit easier when they do eventually go back on the market. It also results in some low levels of traffic from exact address searches.
How do you see the future of crawling & indexing evolve over time?
In many ways, the future is (opens in a new tab) already for some sites. Currently, Google does offer an Indexing API which can only be used to submit pages with either (opens in a new tab) or (opens in a new tab).
I do expect that they will allow some expanded use of this API, but I’m not certain they will roll it out across all verticals. If they do, the API is pretty easy to use and I can see more advanced SEOs starting to adopt using this in favor of XMLs. Using the API does require a bit of coding.
Last question: what’s your number one tip about improving the crawling and indexing processes for large sites?
Think deeply about site architecture and click depth for each section of a website.
Generally speaking, we can make a great deal of headway by introducing a greater number of category/landing/search etc. pages that link off to other pages on the site such as products/posts/property detail pages.
Oftentimes, issues arise around indexing/ranking content that gets pushed many hundreds of clicks deep in these sites. Think about ways to flatten website architecture, without making it overly flat.
Like most things in life, this is about balance!