What’s the definition of crawl budget?

Crawl budget is the time or amount of pages search engines spend crawling your website on a daily basis. Because of the enormous size of the internet, search engines have to divide their attention across all websites and prioritize their crawl effort. Assigning each website crawl budget helps them with this.

Crawl budget is a common term within the SEO workfield. Often Crawl budget is referred to crawl space or crawl time.

Why should you care about crawl budget?

You want search engines to find as much of your indexable pages, and you want them to do that as quickly as possible.

If you’re wasting crawl budget, it’s unlikely search engines will be able to. This means that part of your website is left undiscovered by search engines. This part of your website is therefore inaccessible to potential visitors searching in search engines. Naturally, this hurts your success. Therefore crawl budget optimization should be high on your ToDo-list.

What is the crawl budget for my website?

Google’s most transparent about their crawl budget for your website. If you have your website verified in Google Search Console, you can get some insight in your website’s crawl budget (only within Google of course).

Log in to Google Search Console, choose a website and go to Crawl > Crawl Stats. There you see the amount of pages crawled per day by Google.

Crawl budget in Google Search Console

In the example above we see that the average daily crawl budget is 27 pages. So in theory, if this average crawl budget stays the same you would have a monthly crawl budget of 27 pages x 30 days = 810 pages.

Alternatively, you can analyze your webserver’s log files to get insight in search engines’ crawl behaviour of your website.

How do you optimize your crawl budget?

You first need to identify where crawl budget is being wasted. Below we cover some of the most common reasons for wasted crawl budget.

Identify and fix wasted crawl budget

Common reasons for wasted crawl budget are:

  • Links which are broken or redirecting
  • Pages with high load time and time-outs
  • Incorrect URLs in XML sitemaps
  • High amounts of non-indexable pages

Broken links are a dead end for search engines. The same goes for chains of redirects which are too long, search engines may then break off their crawl process. There’s a limit as to how many redirects in a chain search engines and browsers can take.

Similar to browsers, Google seems to follow a maximum of five chained redirects. It’s unclear how well other search engines deal with subsequent redirects, but we strongly advise you to avoid chained redirects entirely and keep usage of redirects to a minimum.

It’s clear that by fixing broken links and redirecting links you can quickly regain some of your wasted crawl budget. Besides regaining crawl budgets, you’re also significantly improving visitor’s user experience. Redirects, and chains of redirects in particular, cause longer page load time and thereby hurt user experience.

To make finding broken and redirecting links easy, within ContentKing we’ve dedicated a special issue for this. Go to Issues > Links to find out if you are wasting crawl budgets because of fault links. Update all links so they link to an indexable URL.

Broken and redirecting links in ContentKing

Pages with high load time and timing out

When pages have high load times, or when pages even time-out, search engines can visit fewer pages within the allotted crawl budget for your website. Besides, high page load times and time-outs hurt the user experience of your visitors significantly, resulting in a lower conversion rate.

Page load times above 2 seconds are an issue. Ideally though, your page loads in under 1 second. Regularly check your page load times with tools such as Pingdom, WebPagetest or GTmetrix.

Google reports on page load time in both Google Analytics (under Behaviour > Site Speed) and Google Search Console under Crawl > Crawl Stats.

Google Search Console and Bing Webmaster Tools both report on page time-outs. In Google Search Console this can be found under Crawl > Crawl Errors and in Bing Webmaster Tools under Reports & Data > Crawl Information.

Check regularly if your pages are loading fast enough and take action immediately if they don’t. Fast loading pages are vital to your online success.

Incorrect URLs in XML sitemaps

XML sitemap errors in Google Search Console

All URLs included in XML sitemaps should be indexable. Especially for large websites, search engines heavily rely on XML sitemaps to find all your pages. If your XML sitemaps are cluttered with pages which for instance don’t exist anymore or redirect, you’re wasting crawl budget. Regularly check your XML sitemap for non-indexable URLs which don’t belong in there. Google Search Console reports on XML sitemap issues under Crawl > Sitemaps. Bing Webmaster Tools does the same under Configure My Site > Sitemaps.

One of the crawl budget optimization best practices is to split your XML sitemaps up into smaller sitemaps. You can for instance create XML sitemaps for each of your website section. If you’ve done that, you can quickly determine if there’s any issues going on in certain sections of your website. Say your XML sitemap for section A contains 500 links, and 480 are indexed then you’re doing pretty good. But if your XML sitemap for section B contains 500 links and only 120 are indexed that’s something to look into. You may have included a lot of non-indexable URLs in the XML sitemap for section B.

High amounts of non-indexable pages

If your website contains a high amount of non-indexable pages which are accessible to search engines you’re basically keeping search engines busy sifting through irrelevant pages.

In order to find out if you have a high amount of non-indexable pages, look up the total amount of pages crawlers found within your website. You can do this using Screaming Frog or ContentKing.

In ContentKing the total amount of pages crawled is shown on Pages overview at the top.

Amount of pages crawled in ContentKing

Compare this amount to the amount of pages indexed using a site: query:

Amount of pages indexed in Bing

In this example there were over 200,000 pages found, but actually there were only 30,000 indexed by Bing. There can be two explanations for this:

  1. Search engines are still indexing your website, and it’s only a matter of time before they indexed your entire website. These 30.000 pages will go up soon.
  2. There’s an crawl budget issue which you should be looking into. Find out which sections are incorrectly accessible to search engines and instruct them not to crawl these sections of your website using the robots.txt file.

How pages within your website link to one another plays a big role in crawl budget optimization. We call this the internal link structure of your website. Backlinks aside, pages that have few internal links get much less attention from search engines than pages that are linked on a lot of pages.

Prevent a very hierarchical link structure with pages in the middle having little links. These pages will not be crawled that often. It’s even worse for pages at the bottom of the hierarchy: becauses of the limited amount of links they may very well be neglected by search engines.

Make sure your most important pages have plenty of internal links. Pages that have recently been crawled typically rank better in search engines. Keep this in mind, and adjust your internal link structure for this.

For example, if you have a blog article dating from 2011 which drives a lot of organic traffic make sure to keep linking to it from other content. Because you produced a lot more blog articles over the years, that article from 2011 is automatically being pushed down in your website’s internal link structure.

What are common reasons for crawl budget waste?

There are a few common ways to waste crawl budget that we come across very often. We cover them below, including ways to regain your crawl budget.

  1. Product filters
  2. Indexable internal search result pages
  3. Tag pages

The first two are also called ‘crawl traps’. Crawl traps are issues within websites that create a virtually endless amount of URLs which are accessible to search engines. These in particular pose a big issue to your crawl budget.

Product filter

Within a product filter each filter criterion has at least two values. Combining these criteria enables a visitor to drill down in the product offering. This is very useful for a visitor, but if the product filter is crawlable however, an virtual endless amount of URLs are generated. This forms a crawl trap for search engines.

Solution:

  1. Make sure search engines are instructed through the robots.txt file not to access URLs which are generated by the product filter. If this is not possible for you, consider using the URL parameter handling settings in Google Search Console and Bing Webmaster Tools to instruct Google and Bing what pages not to crawl.
  2. Add rel=”nofollow” to links on filter links.

Indexable internal search result pages

Generally, you don’t want to have search result pages from your website’s internal search functionality getting crawled and subsequently indexed by search engines. This causes duplicate content easily, and should in most cases be avoided. Assuming you don’t want your internal search result pages to be crawled and indexed, be sure to inform search engines that these internal search result pages shouldn’t be accessed.

Solution: Instructing search engines not to access these internal search result pages can easily be done using your robots.txt. We’ve included an example robots.txt for a WordPress website which takes care this. If this is not possible for you, consider using the URL parameter handling settings in Google Search Console and Bing Webmaster Tools to instruct Google and Bing what pages not to crawl.

Tag pages

A less technical reason to have a high amount of non-indexable pages could be that you used to like working with tags for your blog articles back in 2010. You’re an avid writer, so over time you’ve accumulated thousands of tags. When you read about Google Panda you decided to put aon these tags pages, which fixed only possible indexing issues.

While preventing possible indexing issues, it created crawl issues: search engines are crawling these pages but find time after time that they can’t index these pages, so they ignore them. But they do spend your valuable crawl budget on those tag pages.

Solution: Instructing search engines not to access these tag pages anymore can easily done with your robots.txt file.

How do you increase your website’s crawl budget?

During an interview between Eric Enge and Google’s former head of the webspam team Matt Cutts the relation between authority and crawl budget was brought up.

Matt Cutts said:

The best way to think about it is that the number of pages that we crawl is roughly proportional to your PageRank. So if you have a lot of incoming links on your root page, we’ll definitely crawl that. Then your root page may link to other pages, and those will get PageRank and we’ll crawl those as well. As you get deeper and deeper in your site, however, PageRank tends to decline.

Even though Google has abandoned updating PageRank values of pages publicly, we think (a form of) PageRank is still apart of their algorithm. Since PageRank is a misunderstood and confusing term, let’s call it page authorityThe take-away here is that Matt Cutts basically says: there’s a pretty strong relation between page authority and crawl budget.

So, in order to increase your website’s crawl budget you need to increase the authority of your website. A big part of this is done by earning more links from external websites.

Frequently asked questions about crawl budget

  1. How do I increase my crawl budget?
  2. Should I be using canonical URL and meta robots at all?

1. How do I increase my crawl budget?

Google has indicated there’s a strong relation between page authority and crawl budget. The more authority a page has, the more crawl budget it has. Simply put, more authority equals more crawl budget.

2. Should I be using canonical URL and meta robots at all?

Yes, and it’s important to understand the differences between indexing issues and crawl issues.

The canonical URL and meta robots tags send a clear signal to search engines what page they should show in their index, but it does not prevent them from crawling those pages.

You can use the robots.txt file and the nofollow link relation for dealing with crawl issues.

Ready to try ContentKing?

Finally understand what’s really happening on your website.
Please enter a valid domain name (www.example.com).