Crawl budget is a term invented by the SEO industry to indicate a number of related concepts and systems that search engines use when deciding how many pages, and which pages, to crawl. It's basically the attention that search engines will give your website.
Because they don't have unlimited resources, and they divide their attention across millions of websites. So they need a way to prioritize their crawling effort. Assigning crawl budget to each website helps them do this.
That's based on two factors, crawl limit and crawl demand:
Crawl budget is a common term within SEO. Crawl budget is sometimes also referred to as crawl space or crawl time.
hreflang variants and PDF files.
Crawl limit, or host load if you will, is an important part of crawl budget. Search engines crawlers are designed to prevent overloading a web server with requests so they're careful about this.How search engines determine the crawl limit of a website? There are a variety of factors influencing the crawl limit. To name a few:
Another thing to consider is having separate mobile and desktop sites running on the same host. They have a shared crawl limit too. So keep this in mind.
Crawl demand, or crawl scheduling, is about determining the worth of re-crawling URLs. Again, many factors influence crawl demand among which:
Forcing Google's crawlers to come back to your site when there is nothing more important to find (i.e. meaningful change) is not a good strategy and they're pretty smart at working out whether the frequency of these pages changing actually adds value. The best advice I could give is to concentrate on making the pages more important (adding more useful information, making the pages content rich (they will naturally trigger more queries by default as long as the focus of a topic is maintained). By naturally triggering more queries as part of 'recall' (impressions) you make your pages more important and lo and behold: you'll likely get crawled more frequent.
While search engine crawling systems have massive crawl capacity, at the end of the day it's limited. So in a scenario where 80% of Google's data centers go offline at the same time, their crawl capacity decreases massively and in turn all websites' crawl budget.
Massive thanks to Dawn Anderson for providing us with details on crawl limit, crawl demand and crawl capacity!
You want search engines to find and understand as many as possible of your indexable pages, and you want them to do that as quickly as possible. When you add new pages and update existing ones, you want search engines to pick these up as soon as possible. The sooner they've indexed the pages, the sooner you can benefit from them.
If you’re wasting crawl budget, search engines won't be able to crawl your website efficiently. They'll spend time on parts of your site that don't matter, which can result in important parts of your website being left undiscovered. If they don't know about pages, they won't crawl and index them, and you won't be able to bring visitors in through search engines to them.
You can see where this is leading to: wasting crawl budget hurts your SEO performance.
Please note that crawl budget is generally only something to worry about if you've got a large website, let's say 10,000 pages and up.
One of the more under-appreciated aspects of crawl budget is load speed. A faster loading website means Google can crawl more URLs in the same amount of time. Recently I was involved with a site upgrade where load speed was a major focus. The new site loaded twice as fast as the old one. When it was pushed live, the number of URLs Google crawled per day went up from 150,000 to 600,000 - and stayed there. For a site of this size and scope, the improved crawl rate means that new and changed content is crawled a lot faster, and we see a much quicker impact of our SEO efforts in SERPs.
A very wise SEO (okay, it was AJ Kohn) once famously said "You are what Googlebot eats.". Your rankings and search visibility are directly related to not only what Google crawls on your site, but frequently, how often they crawl it. If Google misses content on your site, or doesn't crawl important URLs frequently enough because of limited/unoptimized crawl budget, then you are going to have a very hard time ranking indeed. For larger sites, optimizing crawl budget can greatly raise the profile of previously invisible pages. While smaller site need to worry less about crawl budget, the same principles of optimization (speed, prioritization, link structure, de-duplication, etc.) can still help you to rank.
I mostly agree with Google and for the most part many websites do not have to worry about crawl budget. But for websites that are large-in-size and especially ones that are updated frequently such as publishers, optimizing can make a significant difference.
Out of all the search engines, Google is the most transparent about their crawl budget for your website.
If you have your website verified in Google Search Console, you can get some insight into your website’s crawl budget for Google.
Follow these steps:
Crawl Stats. There you can see the number of pages that Google crawls per day.
During the summer of 2016, our crawl budget looked like this:
We see here that the average crawl budget is 27 pages / day. So in theory, if this average crawl budget stays the same, you would have a monthly crawl budget of 27 pages x 30 days = 810 pages.
Fast forward 2 years, and look at what our crawl budget is right now:
Our average average crawl budget is 253 pages / day, so you could say that our crawl budget went up 10X in 2 years' time.
It's very interesting to check your server logs to see how often Google's crawlers are hitting your website. It's interesting to compare these statistics to the ones being reported in Google Search Console. It's always better to rely on multiple sources.
Optimizing your crawl budget comes down to making sure no crawl budget is wasted. Essentially, fixing the reasons for wasted crawl budget. We monitor thousands of websites; if you were to check each one of them for crawl budget issues, you'd quickly see a pattern: most websites are suffering from the same kind of issues.
Common reasons for wasted crawl budget that we encounter:
https://www.example.com/toys/cars?color=black. In this case, the parameter is used to store a visitor's selection in a product filter.
I’ve often said that Google is like your boss. You wouldn’t go into a meeting with your boss unless you knew what you were going to talk about, the highlights of your work, the goals of your meeting. In short, you’ll have an agenda. When you walk into Google’s “office”, you need the same thing. A clear site hierarchy without a lot of cruft, a helpful XML sitemap, and quick response times are all going to help Google get to what’s important. Don’t overlook this often misunderstood element of SEO.
To me, the concept of crawl budget is one of THE key points of technical SEO. When you optimize for crawl budget, everything else falls into place: internal linking, fixing errors, page speed, URL optimization, low-quality content, and more. People should dig into their log files more often to monitor crawl budget for specific URLs, subdomains, directory, etc. Monitoring crawl frequency is very related to crawl budget and super powerful.
In most cases, URLs with parameters shouldn't be accessible for search engines, because they can generate a virtually infinite quantity of URLs.We've written extensively about this type of issue in our article about crawler traps.
URLs with parameters are commonly used when implementing product filters on eCommerce sites. It's fine to use them,; just make sure they aren't accessible to search engines.
How can you make them inaccessible to search engine?
You don't want search engine to spend their time on duplicate content pages, so it's important to prevent, or at the very least minimize, the duplicate content in your site.
How do you do this? By...
Check out some more technical reasons for duplicate content and how to fix them.
Pages with very little content aren't interesting to search engines. Keep them to a minimum, or avoid them completely if possible. One example of low-quality content is a FAQ section with links to show the questions and answers, where each question and answer is served over a separate URL.
Broken links and long chains of redirects are dead ends for search engines. Similar to browsers, Google seems to follow a maximum of five chained redirects in one crawl (they may resume crawling it later). It's unclear how well other search engines deal with subsequent redirects, but we strongly advise that you avoid chained redirects entirely and keep the usage of redirects to a minimum.
It's clear that by fixing broken links and redirecting links, you can quickly recover wasted crawl budget. Besides recovering crawl budget, you’re also significantly improving a visitor's user experience. Redirects, and chains of redirects in particular, cause longer page load time and thereby hurt the user experience.
To make finding broken and redirecting links easy, we've dedicated special Issues to this within ContentKing.
Links to find out if you are wasting crawl budgets because of faulty links. Update each link so that it link to an indexable page, or remove the link if it's no longer needed.
All URLs included in XML sitemaps should be for indexable pages. Especially with large websites, search engines heavily rely on XML sitemaps to find all your pages. If your XML sitemaps are cluttered with pages that, for instance, don't exist anymore or are redirecting, you're wasting crawl budget. Regularly check your XML sitemap for non-indexable URLs that don't belong in there. Check for the opposite as well: look for pages that are incorrectly excluded from the XML sitemap. The XML sitemap is a great way to help search engines spend crawl budget wisely.
Google Search Console
Google Search Console reports on XML sitemap issues under
Bing Webmaster Tools
Bing Webmaster Tools does the same under
Configure My Site >
In ContentKing, we report on this as well, under
XML Sitemap >
Page is incorrectly included in XML sitemap:
One best practice for crawl-budget optimization is to split your XML sitemaps up into smaller sitemaps. You can for instance create XML sitemaps for each of your website's sections. If you’ve done this, you can quickly determine if there are any issues going on in certain sections of your website.
Say your XML sitemap for section A contains 500 links, and 480 are indexed: then you're doing pretty good. But if your XML sitemap for section B contains 500 links and only 120 are indexed, that's something to look into. You may have included a lot of non-indexable URLs in the XML sitemap for section B.
When pages have high load times or they time out, search engines can visit fewer pages within their allotted crawl budget for your website. Besides that downside, high page load times and timeouts significantly hurt your visitor's user experience, resulting in a lower conversion rate.
Google reports on page load time in both Google Analytics (under
Site Speed) and Google Search Console under
Google Search Console and Bing Webmaster Tools both report on page timeouts. In Google Search Console, this can be found under
Crawl Errors, and in Bing Webmaster Tools, it's under
Reports & Data >
Check regularly to see if your pages are loading fast enough, and take action immediately if they aren't. Fast-loading pages are vital to your online success.
If your website contains a high number of non-indexable pages that are accessible to search engines, you're basically keeping search engines busy sifting through irrelevant pages.
We consider the following types to be non-indexable pages:
In order to find out if you have a high number of non-indexable pages, look up the total number of pages that crawlers have found within your website and how they break down. You can easily do this using ContentKing:
In this example, there are 63,137 URLs found, of which only 20,528 are pages.
And out of these pages, only 4,663 are indexable for search engines. Only 7.4% of the URLs found by ContentKing can be indexed by search engines. That's not a good ratio, and this website definitely needs to work on that by cleaning up all references to them that are unnecessary, including:
How pages within your website link to one another plays a big role in crawl budget optimization. We call this the internal link structure of your website. Backlinks aside, pages that have few internal links get much less attention from search engines than pages that are linked to by a lot of pages.
Avoid a very hierarchical link structure, with pages in the middle having few links. In many cases these pages will not be frequently crawled. It’s even worse for pages at the bottom of the hierarchy: because of their limited amount of links, they may very well be neglected by search engines.
Make sure that your most important pages have plenty of internal links. Pages that have recently been crawled typically rank better in search engines. Keep this in mind, and adjust your internal link structure for this.
For example, if you have a blog article dating from 2011 that drives a lot of organic traffic, make sure to keep linking to it from other content. Because you've produced many other blog articles over the years, that article from 2011 is automatically being pushed down in your website's internal link structure.
You usually don't have to worry about the crawl-rate of your important pages. It's usually pages that are new, that you didn't link to, and that people aren't going to that may not be crawled often.
The best way to think about it is that the number of pages that we crawl is roughly proportional to your PageRank. So if you have a lot of incoming links on your root page, we’ll definitely crawl that. Then your root page may link to other pages, and those will get PageRank and we’ll crawl those as well. As you get deeper and deeper in your site, however, PageRank tends to decline.
Even though Google has abandoned updating PageRank values of pages publicly, we think (a form of) PageRank is still used in their algorithms. Since PageRank is a misunderstood and confusing term, let's call it page authority. The take-away here is that Matt Cutts basically says: there's a pretty strong relation between page authority and crawl budget.
So, in order to increase your website's crawl budget, you need to increase the authority of your website. A big part of this is done by earning more links from external websites. More information about this can be found in our link building guide.
When I hear the industry talking about crawl budget, we usually talk about the on-page and technical changes we can make in order to increase the crawl budget over time. However, coming from a link building background, the largest spikes in crawled pages we see in Google Search Console directly relate to when we win big links for our clients.
Google has indicated there's a strong relation between page authority and crawl budget. The more authority a page has, the more crawl budget it has. Simply put, to increase your crawl budget, build your page's authority.
Yes, and it’s important to understand the differences between indexing issues and crawl issues.
The canonical URL and meta robots tags send a clear signal to search engines what page they should show in their index, but it does not prevent them from crawling those other pages.
You can use the robots.txt file and the nofollow link relation for dealing with crawl issues.