Search engines crawl billions of pages every day. But they index fewer pages than this, and they show even fewer pages in their results. You want your pages to be among them. So, how do you take control of this whole process and improve your rankings?
To answer that question, first we need look at how the crawling and indexing process works. Then we’ll discuss all the methods you can put to work to control this process.
Search engines’ crawlers are tasked with finding and crawling as many URLs as possible. They do this to see if there’s any new content out there. These URLs can be both new ones and URLs they already knew about. New URLs are found by crawling pages they already knew. After crawling, they pass on their results to the indexer.
The indexers receive the contents of URLs from the crawlers. Indexers then try to make sense of this content by analyzing it (including the links, if any). The indexer processes canonicalized URLs and determines the authority of each URL.
Take control of the crawling and indexing process by making your preferences clear to search engines. By doing so, you help them understand what sections of your website are most important to you.
In this chapter we’ll cover all the methods and which to use when. We’ve also put together a table to illustrate what they can and cannot do.
First let’s explain some concepts:
Furthermore, it’s important to understand what crawl budget is. Crawl budget is the amount of time search engines’ crawlers spend on your website. You want them to spend it wisely, and you can give them instructions for that.
The robots.txt file is a central location that provides basic ground rules for crawlers. We call these ground rules directives. If you want to keep crawlers from crawling certain URLs, your robots.txt is the best way to do that.
If crawlers aren’t allowed to crawl a URL and request its content, the indexer will never be able to analyse its content and links. This can prevent duplicate content, and it also means that the URL in question will never be able to rank. Also, search engines will not be able to consolidate topical relevance and authority signals when they don’t know what’s on the page. Those signals will therefore be lost.
An example for using robots.txt
A site’s admin section is a good example of where you want to apply the robots.txt file to keep crawlers from accessing it. Let’s say the admin section resides on: https://www.example.com/admin/.
Block crawlers from accessing this section using the following directive in your robots.txt:
Please note that URLs that are disallowed from being crawled by search engines can still appear in search results. This happens when the URLs are linked to from other pages, or were already known to search engines before they were made inaccessible through robots.txt. Search engines will then display a snippet like this:
Robots.txt cannot resolve existing duplicate content issues. Search engines will not forget about a URL simply because they can’t access it.
Adding a canonical URL or a meta robots noindex attribute to a URL that’s been blocked through robots.txt will not get it deindexed. Search engines will never know about your request for deindexing, because your robots.txt file is keeping them from finding out.
The robots.txt file is an essential tool in optimizing crawl budget on your website. Using the robots.txt file, you can tell search engines not to crawl the parts of your website that are irrelevant for them.
What the robots.txt file will do:
What the robots.txt file will not do
Want to read more about robots.txt?
Check out the ultimate robots.txt reference guide.
The meta robots tags instructs search engines on how to index pages, while keeping the page accessible for visitors. Often it’s used to instruct search engines not to index certain pages. When it comes to indexing, it’s a stronger signal than the canonical URL.
Implementing the meta robots tag for pages is generally done by including it in the source. For other documents such as PDFs or images, it’s done through the X-Robots-Tag HTTP header.
An example for the use of the meta robots tag
Say you have ten landing pages for Google AdWords traffic. You copied the content from other pages and then slightly adjusted it. You don’t want these landing pages to be indexed, because that would cause duplicate content issues, so you include the meta robots tag with the noindex directive.
The meta robots tag helps you prevent duplicate content, but it doesn’t attribute topical relevance and authority to another URL. That’s just lost.
Besides instructing search engines not to index a page, the meta robots noindex directive also discourages search engines from crawling the page. Some crawl budget is preserved because of this.
Contrary to its name, the meta robots nofollow tag will not influence crawling. Search engine crawlers will still crawl pages that have a nofollow tag, but they won’t pass on authority to other pages.
What the meta robots tag will do:
What the meta robots tag will not do:
Want to read more about the meta robots tag?
Check out the ultimate guide to meta robots tag.
A canonical URL communicates the canonical version of a page to search engines, encouraging search engines to index the canonical version. The canonical URL can reference itself or other pages. If it’s useful for visitors to be able to access multiple versions of a page and you want search engines to treat them as one version, the canonical URL is the way to go. When one page references another page using the canonical URL, most of its topical relevance and authority is attributed to the target URL.
An example for the use of a canonical URL
Say you have an eCommerce website with a product in three categories. The product is accessible via three different URLs. This is fine for visitors, but search engines should only focus on crawling and indexing one URL. Choose one of categories as the primary one, and canonicalize the other two categories to it.
Make sure to 301 redirect URLs that don’t serve a purpose for visitors anymore to the canonical version. This enables you to attribute all their topical relevance and authority to the canonical version. This also helps to get other websites to link to the canonical version.
A canonical URL is a guideline, rather than a directive. Search engines can choose to ignore it.
Applying a canonical URL will not preserve any crawl budget, as it doesn’t prevent search engines from crawling pages. It prevents them from indexing other versions of pages.
What a canonical URL will do:
What a canonical URL will not do:
Want to read more about canonical URLs?
Check out the ultimate canonical URL reference guide.
The rel=“alternate” hreflang=“x” link attribute, or hreflang attribute for short, is used to communicate to search engines what language your content is in and what geographical region your content is meant for. If you’re using the same content to target multiple regions, hreflang is the way to go. It enables you to rank with the same content in each market and prevent duplicate content in the process.
An example of using hreflang
You’re targeting several English speaking markets using subdomains for each market. Each subdomain contains the same content:
Within each market you want to rank with the same content and prevent duplicate content. Here’s where hreflang comes in.
What the hreflang attribute will do:
What the hreflang attribute will not do:
Want to read more about hreflang?
Check out the ultimate hreflang reference guide.
The rel=“prev” and rel=“next” link attributes are used to communicate the relationships among a series of pages to search engines. For series of similar pages, such as paginated blog archive pages or paginated product category pages, it’s highly advisable to use the rel=“prev” and rel=“next” link attributes. Search engines will understand that the pages are very similar, which will eliminate duplicate content issues.
In most cases, search engines will not rank other pages than the first one in the paginated series.
What the rel=”prev” and rel=”next” link attributes will do:
What the rel=”prev” and rel=”next” link attributes will not do:
Want to read more about rel=“prev” and rel=“next” link attributes?
Check out the ultimate pagination reference guide.
The rel=“alternate” mobile attribute, or mobile attribute for short, communicates the relationship between a website’s desktop and mobile versions to search engines. It helps search engines show the right website for the right device and prevents duplicate content issues in the process.
What the mobile attribute will do:
What the mobile attribute will not do:
Want to read more about the mobile attribute?
Check out the ultimate mobile attribute reference guide.
If you’re unable to make changes (quickly) to your website, you can set up parameter handling in Google Search Console and Bing Webmaster Tools. Parameter handling defines how search engines should deal with URLs that contain a parameter. Using this, you can tell Google and Bing not to crawl and/or index certain URLs.
In order to set up parameter handling, you need URLs that are identifiable by a pattern. Parameter handling should only be used in certain situations, for example sorting, filtering, translating, and saving session data.
Keep in mind that configuring this for Google and Bing will not affect how other search engines crawl your website.
What parameter handling will do:
What parameter handling will not do:
HTTP authentication requires users or machines to log in to gain access to a (section of a ) website. Here’s an example of how it looks:
Without a username and password, you (or a robot) won’t get past the login screen, and you won’t be able to access anything. HTTP authentication is great way to keep unwanted visitors – both humans and search engine crawlers – out of for instance a test environment. Google recommends using HTTP authentication to prevent search engine crawlers from accessing test environments:
If you have confidential or private content that you don’t want to appear in Google Search results, the simplest and most effective way to block private URLs from appearing is to store them in a password-protected directory on your site server. Googlebot and all other web crawlers are unable to access content in password-protected directories.
What HTTP authentication will do:
What HTTP authentication will not do:
Google Search Console shares their crawl behavior with you. To check it out:
If you’re fairly tech savvy, you can find out how often Google’s crawls your website by analyzing your website’s log files.
It’s worth noting that Google determines how often they should crawl your website using the crawl budget for your website.
Although it’s not recommended for Google and Bing, you can use the crawl-delay robots.txt directive to achieve this. We’d never recommend setting this up for Google and Bing, because their crawlers are smart enough to know when your website is having a hard time, and they’ll check back later in that case.
There’s a few ways to go about preventing search engines from crawling parts of your website, or just specific pages:
It means actions by a search engine operator to try to make sense of a website, in order to make it findable through their search engine.
The best way to answer this is to create an account with ContentKing to evaluate how indexable your website is for search engines. As you’ve been able to read above, there are many ways to influence how search engines index your website.
As often as Google crawls your website. Its crawlers will pass on whatever they have found to the indexer, which takes care of indexing websites.
There’s no single answer to this question, as it depends on the promotion of the new website. Promoting it speeds up the crawling and indexing process. If you do this well, a small website can be indexed with an hour. Alternatively, it can also take months to index an entirely new website.
Please note that having your website indexed by search engines doesn’t mean your pages will start ranking high right off the bat. Achieving high rankings takes a lot more time.
Search engines can be prevented from indexing a website or page via these methods: