How to Improve Crawling & Indexing for Large Sites

April 23, 2020
Steven van Vessum

Many websites that use faceted search rely solely on the canonical to point back to the start page for that category search. The issue there is very often that these pages are not at all similar and are no longer appropriate for the use of the canonical tag. Instead, these pages should have a noindex tag.

John Morabito, SEO Director, Stella Rising

Hi John! Can you tell us a little about yourself?

My name is John Morabito and I am the Director of SEO at Stella Rising (opens in a new tab). We are a full-service marketing and media agency, offering everything from research to execution for brands across beauty, CPG, retail, healthcare, and B2B.

You can discover my thoughts on the Stella Rising blog (opens in a new tab) or on sites like Search Engine Watch (opens in a new tab) and SEMRush (opens in a new tab).

I’m always looking to share and learn. If your brand is struggling with SEO, please reach out; we offer a free search opportunity analysis.

What’s the biggest challenge when it comes to getting large sites crawled and indexed properly, and how do you tackle that?

In short, it’s large numbers of low-quality pages that a site generates for one reason or another. I have seen a lot of cases where faceted search is the culprit, however, things like user profiles or tag pages on publishing sites can also cause issues.

Prevent index bloat with ContentKing's smart alerts

Be alerted immediately when large amounts of new, indexable pages are accidentally added to your site!

During our crawling and indexing audit, we look at the number of indexable pages that are crawlable on a website and then compare that to the number of pages within Google Search Console, site: query, and our XML sitemaps (opens in a new tab). This gives us four different data points to better understand any misalignment in what we are putting forth as “the website” and what Google is actually grabbing onto.

After being crawled a page is forwarded to the indexers.

Very often we see things like the canonical (opens in a new tab) being completely ignored, and issues around the way canonicals are implemented for paginated pages.

In the case of ignored canonicals, this can cause issues with pages being crawled less frequently, as Google has a nearly infinite possible number of URLs to encounter.

Many websites that use faceted search rely solely on the canonical to point back to the start page for that category search. The issue there is very often that these pages are not at all similar and are no longer appropriate for the use of the canonical tag. Instead, these pages should have a noindex tag (opens in a new tab). One problem with this, is that if anyone links to these pages, equity would eventually be dropped, with or without the “follow” link attribute. For this reason, we recommend periodically scanning your backlink profile for dynamic URLs and recreating those as static pages, should enough people link to them.

In one recent example, we found a real estate site that had two major issues relating to pagination (opens in a new tab).

First, the links to pages past page 1 were in a JavaScript dropdown, which crawlers, even when rendering the DOM, could not see.

Beyond that, they had a canonical on each page past page 1, which pointed to the first page in the series. Canonicals on paginated sets should always be self-referential, and when we made this fix, plus showing the pagination in a way crawlers could access, we saw a massive increase in the number of indexed pages, which was our ultimate goal.

Communicating hierarchical relationships between pages is another one of the more challenging things we deal with on large websites.

Often, we’re looking at things like click depth to get to a given property detail page (using the real estate example again). For this niche, the traffic really comes from the landing pages for a given area, but the people selling the home both on the agency and seller side always want to see that listing page ranking in the top of search results.

The problem is that these singular listings are usually one of hundreds of thousands of listings on the site. Usually, real estate sites’ area landing pages are sorted either by price or newness on the market. So, it may be challenging to determine where an individual listing sits within the site architecture. This could be true of products in a large eCommerce site as well, or blog posts on a large publishing site.

Our solution is often to create additional crawlable “inverted” landing pages, where we sort by the opposite of the regular sorting; i.e. we may sort by price low to high.

Additionally, we’ll add more internal links to pagination past the “next page.” Usually, we will recommend adding links to four to five pages on either side of the page that the bot is currently on. This greatly flattens the site architecture and provides a greater number of crawl paths to each listing.

How did Google’s announcement about not using the pagination attributes in 2019 impact you and the recommendations you make?

In some small ways, yes, this changed how we address priority around this attribute, but usually only if it’s missing or wrong.

Because other search engines still use these tags, we do still recommend using them for most of our clients. Oftentimes, sites we work with already have them implemented, so we continue using them to reinforce how the site is currently being crawled and indexed.

With that said, there is more going on with pagination than just link attributes. Looking at things like reducing click-depth through flattening pagination tunnels is a more productive use of time than obsessing over pagination link attributes.

Are you a fan of using the robots.txt to prevent search engines from accessing certain website sections? If so, why?

There are lots of great applications of a disallow in robots.txt (opens in a new tab) and I often do use this as a solution, however sometimes it’s best to look at how the bots are even getting into those dark corners of your site. Then work to resolve that issue from its source.

In what situations do you use the nofollow link attribute on internal links?

This is getting increasingly complex. Google has basically said they may or may not still crawl links with this link attribute. However, it is still likely going to be a handy tool. The nofollow attribute (opens in a new tab), in my opinion, can be useful for controlling facets or spider traps (opens in a new tab). The best answer to both scenarios is to not create low-value pages in the first place, but that’s not always a reality is it?

In Shopify, there is no control over the robots.txt (opens in a new tab), so nofollow is a tool that can be used on shop filter pages to prevent them from being crawled in most cases.

In addition to the nofollow on the link to the page itself, a noindex, follow on the page is also ideal. Link equity will be lost after a period of time, but it’s worth it to keep the follow. As I said earlier, you should be scanning your backlinks for links to blocked pages and either unblock them, or recreate them as static pages. This is fairly rare though, so generally I do not worry about link equity from pages that are created with filters.

How do you deal with discontinued product or listing pages, at scale?

It depends, but I have two answers:

In general, if the URL has no traffic, backlinks, or keywords ranked, a 404 (opens in a new tab) is fine or a 410 (opens in a new tab) is even better. The 410 says “page really gone."
If the page did have value, here’s what we do for:

Electronics Retailer

In the world of electronics, for example, product lines come back each year with new SKUs and model numbers. These are a great candidate for a 1-1 redirect, where the old product is redirected (opens in a new tab) to the new one. I would recommend displaying a message letting users know they hit a legacy URL and are now on a replacement SKU, but most sites can get away with not doing that.

Apparel Retailer

For apparel retailers that have seasonal collections and lots of products that continually go out of stock, we try to find a similar product to redirect to, but oftentimes we’re left with the choice of redirecting to a category page or leaving the product page up with a notification. Our approach depends on the client and our ability to action a solution. In some cases, we may recommend taking customer emails from the PDP (“Product Detail Page”).

Real Estate

We recommend, generally, that we leave all of the listings live at all times, even if a particular home is off the market. For a period of time, we link to them in a “sold” section. Then, we unlink these pages so they are not eating up crawl budget, but allow them to remain as indexable pages. This makes bringing them back to life a bit easier when they do eventually go back on the market. It also results in some low levels of traffic from exact address searches.

How to Handle Discontinued Products Without Ruining Your SEO (opens in a new tab)

How do you see the future of crawling & indexing evolve over time?

In many ways, the future is here (opens in a new tab) already for some sites. Currently, Google does offer an Indexing API which can only be used to submit pages with either JobPosting (opens in a new tab) or BroadcastEvent embedded in a VideoObject (opens in a new tab).

24/7 monitoring of crawling and indexing directives

Be alerted about changes to crawling and indexing directives on your site in real-time. No more nasty SEO surprises!

I do expect that they will allow some expanded use of this API, but I’m not certain they will roll it out across all verticals. If they do, the API is pretty easy to use and I can see more advanced SEOs starting to adopt using this in favor of XMLs. Using the API does require a bit of coding.

Last question: what’s your number one tip about improving the crawling and indexing processes for large sites?

Think deeply about site architecture and click depth for each section of a website.

Generally speaking, we can make a great deal of headway by introducing a greater number of category/landing/search etc. pages that link off to other pages on the site such as products/posts/property detail pages.

Oftentimes, issues arise around indexing/ranking content that gets pushed many hundreds of clicks deep in these sites. Think about ways to flatten website architecture, without making it overly flat.

Like most things in life, this is about balance!

Continue reading in-depth interviews with SEO specialists

You can check out our previous editions of SEO in Focus here:

March 2020
How to leverage Data Science for SEO w/ JR Oakes
February 2020
How to Win at Enterprise SEO w/ Ron Cierniakoski
November 2019:
Scaling Schema implementations with Martha van Berkel
October 2019:
Kick-ass Content Marketing campaigns with Stacey MacNaught
September 2019:
Winning at eCommerce SEO with Luke Carthy
August 2019:
Scale up your SEO agency with Hannah Thorpe’s tips
July 2019:
Technical SEO: what is it, and why there is such a growing demand with Tom Pool
June 2019:
Unconscious Incompetence: why it's dangerous for SEOs and how to overcome it with Chris Green
May 2019:
How to Win Featured Snippets Like a Boss with Izzi Smith
April 2019:
Knowledge Graph: what is it and why is it important? with Jason Barnard
March 2019:
How real-time data changes the way we do SEO with Gerry White
February 2019:
Getting Things Done with Edge SEO with Dan Taylor
January 2019:
Why Your SEO Process Needs To Be Agile with Kevin Indig
December 2018:
Why Crawler Traps Can Seriously Hurt Your SEO with Dawn Anderson

Steven van Vessum

Steven is ContentKing's VP of Community. This means he's involved in everything community and content marketing related. Right where he wants to be. He gets a huge kick out of letting websites rank and loves to talk SEO, content marketing and growth.

View Steven van Vessum's profile

Name:	`viewedOuibounceModal`
Provider:	ContentKing
Purpose:	Store state of having closed the exit intent modal.
Expiry:	1 month
Type:	http_cookie

Name:	`nette-samesite`
Provider:	ContentKing
Purpose:	Communication with account service.
Expiry:	Session
Type:	http_cookie

Name:	`session_timestamp`
Provider:	ContentKing
Purpose:	Utility for tracking of session start time.
Expiry:	Session
Type:	SessionStorage

Name:	`what-intent`
Provider:	ContentKing
Purpose:	Utility for tracking the current input method.
Expiry:	Session
Type:	SessionStorage

Name:	`what-input`
Provider:	ContentKing
Purpose:	Utility for tracking the current input method.
Expiry:	Session
Type:	SessionStorage