If there are multiple instances of the same document on the web, the highest authority URL becomes the canonical version. The rest are considered duplicates.
- Inbound links pointing towards duplicates are inverted towards the canonical URL
- This is called “link inversion”
Example
Let’s say, today you publish a document on your website and in a few days a few small scrapers copy your page. You’re the higher authority and still count as the canonical document. Your URL shows up in Google’s search. All other URLs are considered duplicates and their links are counted towards your URL. So far so good. Imagine then, a week after that Google picks up the exact document on a website with higher authority than yours. What happens? You’re the duplicate now and your inbound links count towards the new canonical URL of that document.Background
Almost a decade ago, Google realised they no longer meet their user’s expectations. Their results were stale and lagging behind what was really happening on the web. So, in 2010, Frank Dabek and Daniel Peng tackle the problem of speed and freshness by retiring the traditional batch-based indexing system powered by MapReduce. They introduce a completely new concept which allowed them to transform large datasets progressively, utilising numerous small, independent mutations. They called it Percolator. Full Paper: Large-scale Incremental Processing Using Distributed Transactions and Notifications.Percolator & Caffeine
- Percolator is an incremental processing system which prepares web pages for inclusion in the live index
- Caffeine is a percolator-based indexing system
Google officially announced it in June that year.“We have built and deployed Percolator and it has been used to produce Google’s websearch index since April, 2010.”
Page 13, 5. Conclusion and Future Work
Ranking Complexity
Of course things get a bit more complex when dealing with partial content and there’s a multitude of other signals which may influence rankings including content visibility, personalisation, location, device, timing, search context and intent. That said, Google works at scale and ultimately their search quality team and engineers care about the end user first even if the publisher is somewhat disadvantaged in the process, that’s not as bad as if it was the other way around.Nothing New
Two years after Caffeine was released, I demonstrated this feature in a controlled set of experiments including Rand’s blog (with the permission of all included parties). As a reward, Google penalised me.Scepticism
Whenever I run an experiment, there will always be people who tell me that it’s impossible to test Google because there’s just too many variables. These are the people who would also have a hard time accepting they’re wet if I poured a bucket of water on their heads. One such bucket is the fact that “link inversion” isn’t some concept SEO people read in a research paper which Google may or may not use in practice. When triggered, inverted links from other domains actually show up in your Search Console.Follow-Up Tests
Before publishing this article I conducted a few quick tests and successfully took over as the canonical result every time. I used Search Console to submit the new page to index and take over the original content publisher. It took me 30 seconds. A week after the test the links from the other domain showed up in my Search Console as if they were mine.I have a client that has a PDF on their site. They are not the original business to feature it, many people are distributors for this product line. I noticed in GSC that they are credited with incoming links, b/c the PDF exists on other sites.
— John Locke (@Lockedown_) October 11, 2018
Other Factors
I recently did another test to see if content behind tabs and accordions would rank as well as the one that’s visible. Google later stated otherwise and mocked me a little bit calling me the “authority”. So in the spirit of friendly banter, I took one of their “authority” assets (Google Scholar) that uses tabs and outranked its content while citing the source URL on my page. Query: https://www.google.com.au/search?q=%22because+documents+with+the+same+title+are+often+considered+duplicates%22Results
Note: One of my colleagues form Europe reports that he doesn’t see Deyan link, instead it’s filtered out as a duplicate. My guess is that geo-location may play a role here. I’m hoping Google won’t slap me with a penalty for this because they’re literally asking for examples and I just made one to illustrate a situation where an original publisher may not rank as well as the one using their content. But sure, if they say it’s not the tabs on that page that’s enabled my little demonstration I believe them. It would be nice to know exactly what did though, so I can do my job and advise clients with confidence.Previous Article
[SEO Test] Tabs and Accordions not OK in Mobile-First
Next Article