Log File Analysis for SEO: An Introduction
Log file analysis plays an important role in SEO because they show search engine crawler’s true behavior on your site.
In this article, we’ll describe what log file analysis is, why it’s important, how to read log files, where to find them, how to get them ready for analysis and we'll run through the most common use cases!
What is log file analysis in SEO?
Through log file analysis, SEOs aim to get a better understanding of what search engines are actually doing on their websites, in order to improve their SEO performance.
Analyzing your log files is like analyzing Google Analytics data – if you don’t know what you’re looking at and what to look for, you’re going to waste a lot of time without learning anything. You need to have a goal in mind.
So before you dive into your log files, make a list of questions and hypotheses you want to answer or validate. For example:
- Are search engines spending their resources crawling your most important pages, or are they wasting your precious crawl budget on useless URLs?
- How long does it take Google to crawl your new product category containing
- Are search engines crawling URLs that aren’t part of your site structure (“orphaned pages”)?
Why is log file analysis important in SEO?
Since only log files show crawlers’ true behavior, these files are essential to understanding how they crawl your site. We often see this being the case:
Legacy crawlers, or even a monitoring platform like ContentKing, only simulate what search engines see; they do not provide a true reflection of how search engines crawl. And to be clear, Google Search Console doesn’t tell you how they crawl either.
Using log file analysis, you can for example uncover important issues such as:
- Unfortunate search engine crawl priorities: your logs will show you what pages (and sections) get crawled most frequently. And – especially with large sites – you’ll often see that search engines are spending a lot of time crawling pages that carry little to no value. You can then take action and adjust things like your robots.txt file, internal link structure, and faceted navigation.
- 5xx errors: your log files help to identify 5xx error response codes, which you can then use as a starting point for follow-up investigations.
- Orphaned pages: orphaned pages are pages that live outside of your site structure – they have no internal links from other pages. Because of this, most crawl simulations will not be able to discover these pages, so they’re easy to forget about. If they’re getting crawled by search engines, your log files will reflect this. And boy do search engines have a good memory – they rarely “forget” about URLs. You can then take action: for example include the orphaned pages in the site structure, redirect them, or remove them entirely.
In case you're wondering if that's it? No, further down this article we'll describe the most common log file analysis use cases in great detail. No sweat. Now, let's carry on!
What is a log file?
A log file is a text file containing records of all the requests a server has received, from both humans and crawlers, and its responses to the requesters.
Throughout this article, when we talk about “the request", we’re referring to the request a client makes to a server. The response the server sends back is what we’ll refer to as “the response”.
The kinds of requests that are logged
How requesting works
Before we continue, we need to talk about how these requests work.
When your browser wants to access a web page on a server, it sends a request. Among other things, this request consists of these elements:
HTTP Method: for example
URL Path: the path to the resource that’s requested, for example
/for the homepage.
HTTP Protocol version: for example
HTTP Headers: for example
user-agent string, preferred languages, and the referring URL.
Next, the server sends back a response. This response consists of three elements:
- HTTP Status Code: the three-digit response to the client’s request.
HTTP Headers: headers containing for example the
content-typethat was returned and instructions on how long the client should cache the response.
- HTTP Body: the body (e.g. HTML, CSS, etc.) is used to render and display the page in your browser. The body payload may not always be included – for example when a server returns a 301 status code.
When logging is active on a web server, all the requests it receives are logged in a so-called access log file. These records typically contain information about each request received, such as the HTTP Status Code that the server returned and the size of the requested file. These access log files are typically saved in standardized text file formats, such as (opens in a new tab) or (opens in a new tab).
These access logs come straight from the source – the web server that received the request. Gathering logs becomes trickier if you’re running a large website with a complex setup that uses for example:
- Load balancers
- Separate servers to serve assets (e.g.
- A Content Delivery Network (CDN)
In practice, you’ll find that you need to pull logs from different places and combine them to get a complete picture of all of the requests that were made. You may also need to reformat some of the log files to make sure they’re in the same format.
Because of their decentralized nature and massive scale, it’s no small feat for CDNs to provide access to the access logs across all of their machines. However, some of the bigger CDN providers offer solutions:
- Cloudflare offers (opens in a new tab), as part of their enterprise plan
- Akamai offers (opens in a new tab), as part of their DataStream product.
- AWS Cloudfront offers (opens in a new tab), as part of their standard platform.
If you’re not on Cloudflare’s enterprise plan, but you do want to gain access to log files, you can generate them on the fly using (opens in a new tab). Cloudflare Workers are scripts that run on the (opens in a new tab) (a server on the "edge of the CDN": the data center where the CDN connects with the internet, typically closest to the visitor), allowing you to intercept requests destined for your server. You can modify these requests, redirect them, or even respond directly.
Moving forward we’ll refer to the general concept of running scripts on a CDN’s edge as “edge workers”.
The possibilities for edge workers are endless. Besides generating log files on the fly, here are a few abilities that will help to illustrate their power:
- Adjust your robots.txt
- Implement redirects
- Implement X-Robots-Tag headers
- Change titles and meta descriptions
- Implement Schema markup
And the list goes on. It’s important to note though that using CDN workers adds a lot of complexity, as they’re yet another place where something can go wrong. Read more about mitigating these risks.
If you’re using the Cloudflare CDN, you can use their Cloudflare Workers to construct the logs. If you don’t want to write the script yourself, you can install the (opens in a new tab) app to do the heavy lifting for you. It’s entirely up to you what you want to log and in what format you want it, depending on where and how you’re going to process it. You could go for a standardized text file format like the Common Log Format, or a different one. Whatever works best for you.
After constructing the CDN logs, you save them to for example BigQuery or ELK, after which you can start analyzing the logs.
Here are several guides on how to save them:
The anatomy of a log file record
Now that we’ve got that out of the way, let’s get our hands dirty and view a sample record from an nginx access log. To give you some context, this record describes one request out of tens of thousands my personal website has received over the last month.
Let’s dissect this record and see what we’re looking at:
22.214.171.124– the requester’s IP address.
-– the first dash is hard-coded by default on nginx (which the record above comes from), but back in the day it was used to identify the client making the HTTP request. However, nowadays it is no longer used.
-– the second dash is an optional field for user identification. For example, when requests to access HTTPAuth-protected URLs are made, you’ll see the username here. If no user identification is sent, the record will contain
[25/May/2021:07:50:39 +0200]– the date and time of the request.
"GET / HTTP/1.1"– the request; consisting of the HTTP Method (“GET”), the requested resource (“/” – the homepage) and the HTTP version used (“HTTP/1.1”).
200– the HTTP response code.
12179– the size in bytes of the resource that was requested. When redirected resources are requested, you’ll either see zero or a very low value here, as there is no body payload to be returned.
“-”– had there been a referrer, it would have been shown here instead of
"Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)"– the requester’s (opens in a new tab), which can be used to identify the requester.
Logging configurations can vary significantly. You may also encounter the request time (how long the server spent processing the request) – and some go all-out and even log the full body response.
A record contains the requester’s user-agent string, which can be used to help identify the requester. In the example above, we see the request was made by Google’s main crawler, called
Google has different crawlers for different purposes, and the same holds true for other search engines like Bing, DuckDuckGo, and Yandex.
The user-agent strings that you find in your log files have nothing to do with the robots.txt user agent or the user agent used in robots directives to influence crawling, indexing, and serving behaviour.
However, well-behaving crawlers will have their unique identifier (e.g. “Googlebot”) present in all three to allow for easy identification.
Common misconceptions around log file analysis
Log file analysis is unimportant for small websites
Those of you with small websites may be wondering by now whether there’s any value in log file analysis.
There definitely is, because it’s important to understand how search engines crawl your site. And how this results in indexed pages. Without log files, you’d just keep guessing. And if your website is important to your business, that’s too risky.
Log file analysis is a one-time thing
How often should you be performing log file analysis? Like many things in SEO, log file analysis is not a one-off task. It’s an ongoing process. Your website is constantly changing, and search engine crawlers are adapting to these changes. As an SEO, it’s your responsibility to monitor their behavior to make sure the crawling and indexing process runs smoothly.
Google Search Console’s Crawl Stats report is a replacement for log files
(opens in a new tab), Google revamped their “Crawl Stats” report in Google Search Console. But it’s still not a replacement for log files! While it is a big improvement compared to the previous Crawl Stats report, the new Crawl Stats report only contains information about Google’s crawlers – it only provides a high-level digest of Google’s crawl behavior. You can drill down in the dataset, but you’ll soon find you’re looking at sampled data.
If you can’t get your hands on log files, then the Crawl Stats report is of course useful, but it’s not meant to be a replacement for log files.
Google Analytics shows you search engines’ crawl behavior
In case you’re wondering why we haven’t brought up Google Analytics yet, that’s because Google Analytics does not track how search engine crawlers behave on your site.
Google Analytics aims to track what your visitors do on your site. Tracking search engine behavior, meanwhile, is a whole different ball game. And besides, search engine crawlers don’t execute the Google Analytics tracking code (or tracking code for other analytics platforms).
Where to find your log files
We now know what log files are, what different types there are, and why they’re important. So let’s move on to the next step and describe where to find them!
As we mentioned in the Access logs section, if you’ve got a complex hosting setup, you’ll need to take action to gather the log files. So, before you go looking for your log files, make sure you have a firm understanding of your hosting setup.
The most popular web servers are Apache, nginx, and IIS. You’ll often find the access logs in their default location, but keep in mind that the web server can be configured to save them in a different location. Or, access logging may be disabled altogether.
See below for links to documentation explaining web server access log configuration, and where to find the logs:
Log file history
Keep in mind that log files may only be kept for a short time, let’s say
If you’re doing log file analysis to get an idea of how search engine behavior has changed over time, you’re going to need a lot of data. Maybe even
12–18 months’ worth of log files.
For most log file analysis use cases, we’d recommend analyzing at least
3 months’ worth of log files.
Filter out non-search-engine crawler records
When doing log file analysis for SEO, you’re only interested in seeing what search engine crawlers are doing, so go ahead and filter out all other records.
You can do this by removing all records made by clients that don’t identify as a search engine in their user-agent string. To get you started with a list of user-agent strings you’ll be interested in, here are (opens in a new tab), and here are (opens in a new tab).
Be sure to also verify that you’re really dealing with search engine crawlers rather than other crawlers posing as search engine crawlers. We recommend checking out (opens in a new tab) on this, but you can apply this to other search engines as well.
This process should also filter out any records containing personally identifiable information (PII) that could potentially identify specific individuals, such as for example IP addresses, usernames, phone numbers, and email addresses.
Use cases for log file analysis
Now, it’s time to go through the most common use cases for log file analysis to better understand how search engines behave on your site, and what you can do to improve your SEO performance. Even though we’ll primarily mention Google, all of the use cases can be applied to other search engines too.
Here are all of the use cases we'll cover:
- 1. Understand crawl behavior
- 2. Verify alignment on what’s important to your business
- 3. Discover crawl budget waste
- 4. Discover sections with most crawl errors
- 5. Discover indexable pages that Google isn’t crawling
- 6. Discover orphan pages
- 7. Keep tabs during a migration project
1. Understand crawl behavior
The best starting point for you is to first understand how Google is currently crawling your site. Enter the wondrous land of log files.
The goals here are as follows:
- Build foundational knowledge that’s required to make the most of the use cases we’ll cover below.
- Improve SEO forecasts by getting better at predicting how long it’ll take for new and updated content to start ranking.
1a. Build foundational knowledge
1b. Improve SEO forecasts
2. Verify alignment on what’s important to your business
Perhaps Google's spending a lot of crawl budget on URLs that are irrelevant to you, neglecting pages that should be speerheading your SEO strategy.
The goals here is to find out if this is the case, and if so — to fix it.
Now, you may think to leverage the
changefreq field in your XML sitemap to realign their focus, but this field is (largely) ignored. Your best bet to fixing the alignment issue is to go through the steps we just discussed above.
3. Discover crawl budget waste
You want Google to use your site's crawl budget on crawling your most important pages. While crawl budget issues mostly apply to large sites, analyzing crawl budget waste will help you improve your internal link structure and fix crawl inefficiencies. And there's tremendous value in that.
Let's find out if Google's wasting crawl budget on URLs that are completely irrelevant, and fix it!
4. Discover sections with most crawl errors
When Google's is hitting lots of crawl errors (
5xx HTTP status codes), they're having a poor crawl experience. Not only is this a waste of crawl budget, but Google can choose to stop their crawl too. And, it's likely visitors would have a similarly poor experience.
Let's find out where most crawl errors are happening on your site, and get them fixed.
5. Discover indexable pages that Google isn’t crawling
Ideally, all of your indexable pages are frequently crawled by Google. The general consensus is that pages that get crawled frequently are more likely to perform well than those that are crawled infrequently.
So, let's find out which of your indexable pages are crawled infrequently — and fix that.
6. Discover orphan pages
Orphan pages are pages that have no internal links, and hence live outside of your site structure. Your SEO monitoring platform doesn’t find them because it likely only relies on link finding and your XML sitemap. Making these pages part of your site structure (again) can really help you in improve your site's SEO performance. Finding authoritative orphan pages is like finding the lost key to your wallet containing ten Bitcoin you bought in 2013.
Let's find if your website has orphan pages, and adopt them into your site structure.
7. Keep tabs during a migration project
During migrations, keeping tabs on Google's crawling behavior is important for two reasons:
- In preparation of the migration, you need to create a redirect plan that includes your most important URLs and where they should redirect to.
- After you've launched your changes, you need to know whether Google's crawling your new URLs and how they're progressing. If something is holding them back, you need to know pronto.
7a. Complete redirect plan
7b. Monitor Google's crawl behavior
The importance of ongoing monitoring
Your website is never done, and SEO is never done either. If you want to win at SEO, you'll always be tweaking your website.
With all of the changes you continuously make, it's important to make sure log file analysis is part of your ongoing SEO monitoring efforts. Go through the use cases we've covered above, and set up alerts in cases your log files show abnormal behavior from Google.
Your log files are the only way to learn about Google's true behavior on your site. Don't leave money on the table, leverage insights from log file analysis.