An XML Sitemap is a special document which lists all pages on a website to provide search engines with an overview of all the available content. It’s strongly recommended to implement an XML Sitemap, especially on larger websites (500+ pages).
Stick to the following best practices when implementing an XML Sitemap:
An XML Sitemap is a special document which lists all pages on a website and is meant for search engines. Compare it to a telephone book: it tells the search engine what content is available and how to reach it. Furthermore some extra information can be provided, such as when the content was last updated and what the relative importance is of the content.
XML Sitemaps are very useful for search engines, as it provides them with a single overview of all the available content at once. This serves for them as both a starting point for the first time they go through your website as a way to quickly discover newly added content.
What’s important to note is the distinction between XML sitemaps and “regular” sitemaps (also called “HTML sitemaps”). Those sitemaps are meant for your visitors to find content on your website, while XML sitemaps are meant for search engines.
XML Sitemaps help search engines to assess your website’s content, and is a mechanism to notify them of new or updated content. Therefore it’s recommended to implement them whenever feasible. And especially for larger websites (500+ pages) they become a real must-have.
An XML Sitemap is meant for search engines, and thus they are formatted in a language that’s easy to understand for computers: XML. Fortunately XML is also quite readable for humans as well, so let’s take a look at an example:
Now, to understand what’s going on let’s dissect the individual parts!
<?xml version="1.0" encoding="UTF-8"?>
This header denotes that the contents is structured according to version 1.0 of the XML standard and describe the character encoding. It basically informs search engines what they can expect from the file.
This urlset definition encapsulates all the URLs contained in the sitemap and describes which version of the XML Sitemap standard is used. Note that the urlset gets closed at the bottom of the document:
Finally we get to the most important part: the definition of the individuals URLs through the
url-tag. Every URL definition needs to contain at least the
loc-tag (short for location). The value of this tag should be the full URL of the page, including the protocol (e.g. “http://”).
On top of that every URL definition may contain the following optional properties:
lastmod: the date of when the content on that URL was last modified. The date is in “W3C datetime” format.
priority: the priority of the URL, relative to your own website on a scale between 0.0 and 10.0.
changefreq: how often the content on the URL is expected to change. Possible values are always, hourly, daily, weekly, monthly, early and never.
Just like your website’s pages, the XML Sitemap resides on its own URL. Usually the URL for an XML Sitemap is /
sitemap.xml, and it’s recommended to follow this convention to make it easy for search engines to discover it.
However, if for any reason this is not possible you can choose a different location or filename, as long as you reference it in your robots.txt file through the Sitemap-directive:
XML Sitemaps have a couple of limitations to keep in mind:
If your XML Sitemap exceeds these limits you need to split them across multiple XML Sitemaps and use an XML Sitemap Index.
Whenever you cross the limitations for a single XML Sitemap you need to split them up into separate XML Sitemaps and bundle them together with an XML Sitemap Index. This index is a separate XML-file which references the various XML Sitemaps.
Let’s take a look at an example:
This XML Sitemap Index references two XML Sitemaps:
sitemap2.xml.gz. Let’s dissect this file as well!
<?xml version="1.0" encoding="UTF-8"?>
Nothing new here, just like with the XML Sitemap file we first define that the file is in XML format and which character encoding is used.
Now, instead of a urlset definition we see a sitemapindex definition. This definition encapsulates all the sitemaps contained in the sitemap index and again which version of the XML Sitemap standard is used. Just like the urlset definition the sitemapindex definition is closed at the bottom of the document:
And then on to the meat: the actual definition of the individuals sitemaps. Just like for URLs, every sitemap definition needs to contain at least the
loc-tag, containing the full URL of the individual XML Sitemap.
On top of that the sitemap definition may optionally contain a lastmod definition. The date when the referenced XML sitemap was last updated. Again in “W3C datetime” format.
Similar to XML Sitemaps there is a convention for the location and filename of the XML Sitemap Index: /
sitemap_index.xml. But again you’re free to deviate from this, as long as you reference it in your robots.txt file:
When implementing XML Sitemaps it’s essential to follow these best practices.
Make sure that your XML Sitemap provides an up-to-date picture of your website. Whenever a page is removed it should also be delisted from your XML Sitemap. If you’re using the optional
lastmod-tag, make sure to update the timestamp whenever the page changes.
Your XML Sitemap should only describe indexable pages. This means that you should leave out all URLs pointing to redirects (e.g. 301 status code) and missing pages (e.g. 404 status code).
Furthermore these pages need to be indexable, which means they are accessible for search engines (no exclusion in robots.txt) and there are no directives telling search engines not to index the page (such as meta robots, canonical links or x-robots-tag).
Whenever possible stick to the default location and filename for your XML Sitemap (
/sitemap.xml) and XML Sitemap Index (
/sitemap_index.xml). This makes it the easiest for search engines to find them.
When you’re deviating from the convention for the URL of your XML Sitemap or XML Sitemap Index you should reference it in your robots.txt file. However, even if you’re sticking to the standard URL it’s recommended to include a reference to it in your robots.txt to ensure discoverability by search engines.
Although for every URL you can define the lastmod, priority and changefreq properties, this is fully optional. Defining them won’t hurt, and there may be a slight chance search engines will use this information, but it’s generally understood that search engines don’t pay (much) attention to them.
Make sure that your XML Sitemaps don’t contain more than 50.000 URLs and the uncompressed filesize is limited to 50MB. Whenever you exceed either limit you should split the XML Sitemap up and use an XML Sitemap Index.
The .gz extension is added to the filename when the XML Sitemap is compressed (via gzip compression). XML Sitemaps containing many URLs usually grow to significant file sizes, and through the use of compression the impact of this on disk storage and network transfer time can be reduced.