Duplicate content refers to very similar, or the exact same, content being on multiple pages. Keep this in mind:
Taken narrowly, duplicate content refers to very similar, or the exact same, content being on multiple pages within your own website or on other websites.
Taken broadly, duplicate content is content that adds little to no value for your visitors. Therefore, pages with little to no body content are also considered to be duplicate content.
You should avoid having duplicate content, as it confuses search engines and may harm your SEO performance. Having a dozen duplicate content pages on a website of 100 pages is something to look into and fix, but duplicate content will really weigh down your SEO performance when there is an excessive amount of duplicate content (a ratio of more than 3 duplicate content pages for every normal page)
Duplicate content is bad for two reasons:
"Duplicate content can cause serious SEO issues and send conflicting signals to search engines. Put the right measures in place to ensure your content has unique URLs, so every page gets the best chance to rank well and drive traffic to your site."
"Duplicate content is the most pervasive and misunderstood SEO issue. There are so many forms of duplication that you have to watch out for, and one small technical error can lead to literally thousands of duplicate pages. Canonical is not always the right solution, and this article from ContentKing does an amazing job of identifying the problem and solution to dozens of common issues with duplicate content.
I have seen very successful websites stymied by duplicate content. In these cases, fixing the issues that lead to duplicate content alone can often result in a 20% or higher increase in organic traffic. When you have millions of visitors, that can be hundreds of thousands in additional revenue."
"Every time you create 3 or 4 versions of one of your pages you are competing against yourself 3 or 4 times before this page even starts competing with other pages in the SERPs."
"People often have mispersceptions about duplicate content. If I had a quarter everytime I heard an SEO say that duplicate content would earn you a Panda penalty, I'd have at least $50. That's a joke. Small industry.
Anyway, if you have one or two, less significant pages with duplicate content, it's really nothing to worry about. The real issues come along when your own website is generating multitudes of duplicate content due to poor web development and technical SEO issues. These may lead to crawling complications and traffic issues. Duplicate content may also be concerning if another domain is scraping your content and those pages are outranking your own, which is rarely the case, but it does happen!
Lastly, probably the biggest conern with duplicate content is in regards to the dillution of backlinks that happen as a result of it. If I have two versions of the same page, and users don't know which one is the 'main' one, then it may receive backlinks and the other may not. This way, instead of one page with all the backlinks, it is split between two or more pages. No bueno."
It’s easy to fall into the duplicate content trap, mostly because most organisations never really think about their content strategy in the right way.
You need to understand what you are doing and how to ‘control’ it. Otherwise you might be going 90 on a highway where you are supposed to go 50, but no-one told you. It hurts even though you didn’t know.
Pay attention to this!
Did you know that 25-30% of the web is duplicate content, and that's okay! It's not going to get you penalized and while I firmly believe you should specify how you handle the duplicates, if you don't do anything then Google has many ways they try to solve the duplication issues for you. I wouldn't stress over it too much unless you're doing something that could cause major problems like scraping content from other websites.
Consolidating duplicate content is not about avoiding Google penalties. It is about building links. Links are valuable for SEO performance, but if links end up in duplicate pages they don’t help you. They go to waste.
Duplicate is a huge issue for many legacy platforms that are setup to heavily to rely on parameters for internal page structure. Duplicate content is also an issue for newer platforms such as WordPress with /tag/ pages which are often best NoIndexed from the start.
Duplicate content can also easy happen with poor setup of hosting infrastructure that makes it possible to have "caSe sensitive" URLs creating literally millions of duplicate content pages which can be compounded by mixed case usage on internal links. Google Search Console offers URL parameter handling to reduce duplicate content created through parameters. Duplicate content is also very common for eCommerce sites that have the same product in multiple categories or very similar products with only slight variation such as "blue socks" and "dark blue socks".
Big sites often have a large number of templated pages, the issue is that these pages don't often receive traffic because Google is smart enough to understand basically it's the same content. The biggest issue around duplicate content is that Google misunderstands the context and you get visitors landing on the wrong page. This happened in the past with a client where Google couldn't understand the difference between London, UK and London, Ontario, Canada because the content was 85-90% similar.
One problem for many SEOs is they don't use the websites analytics data to understand how much traffic is going to this duplicate content. You want to ensure you don't cull duplicate content too aggressively unless you understand that there will be little traffic impact based on web analytics data.
Duplicate content may be holding back your SEO performance.
Why not check if your website is suffering from duplicate content?
Having duplicate content can hurt your SEO performance, but it won’t get you a penalty from Google as long as you didn’t intentionally copy someone else’s website. If you’re an honest website owner with some technical website challenges, and you’re not trying to trick Google, you don’t have to worry about getting a penalty from Google.
If you’ve copied large amounts of other people’s content, then you’re walking a fine line. This is what Google says about it:
“Duplicate content on a site is not grounds for action on that site unless it appears that the intent of the duplicate content is to be deceptive and manipulate search engine results. If your site suffers from duplicate content issues, and you don’t follow the advice listed above, we do a good job of choosing a version of the content to show in our search results.”
Duplicate content is often due to an incorrectly set up web server or website. These occurrences are technical in nature and will likely never result in a Google penalty. They can seriously harm your rankings though, so it’s important to make it a priority to fix them.
But besides technical causes, there are also human-driven causes: content that’s purposely being copied and published elsewhere. As we’ve said, these can bring penalties if they have a malicious intent.
Non-www vs www and HTTP vs HTTPs
Say you’re using the www subdomain and HTTPs. Then your preferred way of serving your content is via
https://www.example.com. This is your canonical domain.
If your web server is badly configured, your content may also be accessible through:
Choose a preferred way of serving your content, and implement 301 redirects for non-preferred ways that lead to the preferred version:
URL structure: casing and trailing slashes
For Google, URLs are case-sensitive. Meaning that
https://example.com/url-A/ are seen as different URLs. When you’re creating links, it’s easy to make a typo, causing both versions of the URL to get indexed. Please note that URLs aren't case-sensitive for Bing.
A forward slash (
/) at the end of an URL is called a trailing slash. Often URLs are accessible through both variants here:
Choose a preferred structure for your URLs, and for non-preferred URL versions, implement a 301 redirect to the preferred URL version.
Index pages (index.html, index.php)
Without your knowledge, your homepage may be accessible via multiple URLs because your web server is misconfigured. Besides https://www.example.com, your homepage may also be accessible through:
Choose a preferred way to serve your homepage, and implement 301 redirects from non-preferred versions to the preferred version.
In case your website is using any of these URLs to serve content, make sure to canonicalize these pages because redirecting them would break the pages.
Parameters for filtering
Websites often use parameters in URLs so they can offer filtering functionality. Take this URL for example:
This page would show all the black toy cars.
While this is fine for visitors, it may cause major issues for search engines. Filter options often generate a virtually infinite amount of combinations when there is more than one filter option available. All the more so because the parameters can be rearranged as well.
These two URLs would show the exact same content:
Implement canonical URLs—one for each main, unfiltered page—to prevent duplicate content and consolidate the filter-delivered page’s authority. Please note that this doesn't prevent crawl budget issues. Alternatively, you could use parameter handling functionality in Google Search Console and Bing Webmaster Tools to instruct their crawlers how to deal with parameters.
A taxonomy is a grouping mechanism to classify content. They are often used in Content Management Systems to support categories and tags.
Let’s say you have a blog post that is in three categories. The blog post may be accessible through all three:
Be sure to choose one of these categories as the primary one, and make the others canonicalize to that one using the canonical URL.
Dedicated pages for images
Some Content Management Systems create a separate page for each image. This page often just shows the image on an otherwise empty page. Since this page has no other content, it’s very similar to all the other image pages and thus amounts to duplicate content.
If possible, disable the feature to give images dedicated pages. If that's not possible, the next best thing is to add a meta robots noindex attribute to the page.
If you have comments enabled on your website, you may be automatically paginating them after a certain amount. The paginated comment pages will show the original content; only the comments at the bottom will be different.
For example, the article URL that shows comments 1-20 could be
https://www.example.com/category/topic/comments-2/ for comments 21-40, and
https://www.example.com/category/topic/comments-3/ for comments 41-60.
Use the pagination link relationships to signal that these are a series of paginated pages.
When it comes to localization, duplicate content issues can arise when you’re using the exact same content to target people in different regions who speak the same language. For example: when you have a dedicated website for the Canadian market and also one for the US-market—both in English—chances are there’s a lot of duplication in the content. Google is good at detecting this, and usually folds these results together. The
hreflang attribute helps prevent duplicate content. So if you're using the same content for different audiences, be sure to implement hreflang.
Indexable search result pages
Many websites allow searching within the website. The pages on which the search results are displayed are all very similar, and in most cases don’t provide any value to search engines. That’s why you don’t want them to be indexable for search engines.
Prevent search engines from indexing the search result pages by utilizing the meta robots noindex,follow attribute. And also in general, it’s a best practice not to link to your search result pages.
In case of a large amount of search result pages that are getting crawled by search engines it's recommended to stop search engines from accessing them in the first place using the robots.txt file.
Indexable staging/testing environment
It’s likewise a best practice to use staging environments for rolling out and testing new features on websites. But these are often incorrectly left accessible and indexable for search engines.
Use HTTP authentication to prevent access to staging/testing environments. An additional benefit is that you’re preventing the wrong people from accessing them too.
Avoid publishing work-in-progress content
When you create a new page that contains little content, save it without publishing it yet—often it will provide little to no value.
Save unfinished pages as drafts. If you do need to publish pages with limited content, prevent search engines from indexing them: use the meta robots noindex attribute.
Parameters used for tracking
Parameters are commonly used for tracking purposes too. For instance when sharing URLs on Twitter, the source is added to the URL. This is another source of duplicate content. Take for example this URL that was tweeted using Buffer:
It’s a best practice to implement self-referencing canonical URLs on pages. If you’ve already done that, this solves the issue. All URLs with these tracking parameters are canonicalized by default to the version without the parameters.
Sessions may store visitor information for web analytics. If each URL a visitor requests gets a session ID appended, this creates a lot of duplicate content, because the content at these URLs is exactly the same.
For example, when you click through to a localized version of our website, we add a Google Analytics session variable like
https://www.contentking.nl/?_ga=2.41368868.703611965.1506241071-1067501800.1494424269. It shows the homepage with the exact same content, just on a different URL.
Once again—it’s a best practice to implement self-referencing canonical URLs on pages. If you’ve already done that, this solves the issue. All URLs with these tracking parameters are canonicalized by default to the version without the parameters.
When pages have a print-friendly version at a separate URL, there are essentially two version of the same content. Imagine this:
Implement a canonical URL leading from the print friendly version to the normal version of the page.
Landing pages for paid search
Paid search requires dedicated landing pages that target specific keywords. The landing pages are often copies of original pages, which are then adjusted to target these specific keywords. Since these pages are very similar, they produce duplicate content if they are indexed by search engines.
Prevent search engines from indexing the landing pages by implementing the meta robots noindex attribute. In general, it’s a best practice to neither link to your landing pages nor include them in your XML sitemap.
Other parties copying your content
Duplicate content can also originate from others copying your content and publishing it elsewhere. This is in particular a problem if your website has a low domain authority, and the one copying your content has a higher domain authority. Websites with a higher domain authority often get crawled more frequent, resulting in the copied content being crawled first on the website of the one that copied the content. They may now be perceived as the original author and rank above you.
Make sure that other websites credit you by both implementing a canonical URL leading to your page and linking to your page. If they’re not willing to do so, you can send a DMCA request to Google and/or take legal action.
Copying content from other websites
Copying content from other websites is a form of duplicate content too. Google has documented how to best handle this from an SEO point of view: linking to the original source, combined with either a canonical URL or
a meta robots noindex tag. Keep in mind that not all website owners are happy with you syndicating their content, so it's recommended to ask for permission to use their content.
Using ContentKing, you can easily find duplicate content by checking whether your pages have a unique page title, meta description, and H1 heading. You can do this by going to the Issues section and opening the “Meta information” and “Content Headings” cards. See if there are any open issues regarding:
With ContentKing you can easily test whether your website suffers from duplicate content issues. You’ll be up and running in 20 seconds.
If you’ve got a small website, you can try searching in Google for phrases between quotes. For instance, if I want to see if there are any other versions of this article, I may search for “Using ContentKing, you can easily find duplicate content by checking whether your pages have a unique page title, meta description, and H1 heading.”
Alternatively, for larger website you can use a service such as Copyscape. Copyscape crawls the web looking for multiple occurrences of the same or nearly the same content.
If you didn’t intentionally copy someone’s website, then it’s very unlikely for you to get a duplicate content penalty. If you have did copy large amounts of other people’s content, then you’re walking a fine line. This is what Google says about it:
“Duplicate content on a site is not grounds for action on that site unless it appears that the intent of the duplicate content is to be deceptive and manipulate search engine results. If your site suffers from duplicate content issues, and you don't follow the advice listed above, we do a good job of choosing a version of the content to show in our search results.”
Yes, because by fixing the duplicate content issues you’re telling search engines what pages they should really be crawling, indexing, and ranking.
You’ll also be preventing search engines from spending their crawl budget for your website on irrelevant duplicate pages. They can focus on the unique content on your website that you want to rank for.
There’s no one good answer to this question. However:
If you want to rank with a page, it needs to be valuable to your visitors and have unique content.
If you want to keep reading about Duplicate Content, we recommend checking out these resources: