It has been about 16 years since Larry Page and Sergey Brin wrote “The Anatomy of a Large-Scale Hypertextual Web Search Engine”, the famous Backrub paper. After all these years I still find SEO bloggers writing complete nonsense about the paper. And there is really no excuse for that. Anyone can read the paper. You shouldn’t be relying on what other people have written about it over the years. So why is it so hard for people to get their facts straight?
Well, here is yet another blog post about PageRank, but I’ll see if I can’t keep the facts straight and provide some reliable sources.
The Backrub Paper is NOT About PageRank
Let’s see if we can set the record straight on this point. Only a relatively small portion of the paper actually deals with PageRank. Furthermore, the paper never associates PageRank with topic, topicality, or relevance. Here is an outline of the paper’s sections.
- 1. Introduction
- 1.1 Web Search Engines — Scaling Up: 1994 – 2000
- 1.2. Google: Scaling with the Web
- 1.3 Design Goals
- 1.3.1 Improved Search Quality
- 1.3.2 Academic Search Engine Research
- 2. System Features
- 2.1 PageRank: Bringing Order to the Web
- 2.1.1 Description of PageRank Calculation
- 2.1.2 Intuitive Justification
- 2.2 Anchor Text
- 2.3 Other Features
- 3 Related Work
- 3.1 Information Retrieval
- 3.2 Differences Between the Web and Well Controlled Collections
- 4 System Anatomy
- 4.1 Google Architecture Overview
- 4.2 Major Data Structures
- 4.2.1 BigFiles
- 4.2.2 Repository
- 4.2.3 Document Index
- 4.2.4 Lexicon
- 4.2.5 Hit Lists
- 4.2.6 Forward Index
- 4.2.7 Inverted Index
- 4.3 Crawling the Web
- 4.4 Indexing the Web (Parsing, Indexing Documents into Barrels, Sorting)
- 4.5 Searching
- 4.5.1 The Ranking System
- 4.5.2 Feedback
- 5 Results and Performance
- 5.1 Storage Requirements
- 5.2 System Performance
- 5.3 Search Performance
- 6 Conclusions
- 6.1 Future Work
- 6.2 High Quality Search
- 6.3 Scalable Architecture
- 6.4 A Research Tool
- 7 Acknowledgments
- References
- Vitae
- 8 Appendix A: Advertising and Mixed Motives
- 9 Appendix B: Scalability
- 9.1 Scalability of Google
- 9.2 Scalability of Centralized Indexing Architectures
Now here are some (almost) interesting factoids about this paper.
PageRank is used 35 times in a total of 10,471 words. 19 of those uses occur in sections 2.1, 2.1.1, and 2.1.2.
Location was used even in this earliest of Google algorithms.
Font Size was important: “Words in a larger or bolder font are weighted higher than other words”, according to section 2.3.
Relevance is only mentioned 3 times in the document, the first of which reads: “In particular, link structure [Page 98] and link text provide a lot of information for making relevance judgments and quality filtering. Google makes use of both link structure and anchor text (see Sections 2.1 and 2.2).” The third use reads: “Finally, the use of proximity information helps increase relevance a great deal for many queries.” I leave it to you find out what the middle use actually relates to (it AIN’T PageRank).
Ranking is used a total of 12 times in the document, 1 of which is found in the “References” section. PageRank is described as a “quality ranking”. They describe the ranking system in two paragraphs (all emphasis is mine):
Google maintains much more information about web documents than typical search engines. Every hitlist includes position, font, and capitalization information. Additionally, we factor in hits from anchor text and the PageRank of the document. Combining all of this information into a rank is difficult. We designed our ranking function so that no particular factor can have too much influence. First, consider the simplest case — a single word query. In order to rank a document with a single word query, Google looks at that document’s hit list for that word. Google considers each hit to be one of several different types (title, anchor, URL, plain text large font, plain text small font, …), each of which has its own type-weight. The type-weights make up a vector indexed by type. Google counts the number of hits of each type in the hit list. Then every count is converted into a count-weight. Count-weights increase linearly with counts at first but quickly taper off so that more than a certain count will not help. We take the dot product of the vector of count-weights with the vector of type-weights to compute an IR score for the document. Finally, the IR score is combined with PageRank to give a final rank to the document.
For a multi-word search, the situation is more complicated. Now multiple hit lists must be scanned through at once so that hits occurring close together in a document are weighted higher than hits occurring far apart. The hits from the multiple hit lists are matched up so that nearby hits are matched together. For every matched set of hits, a proximity is computed. The proximity is based on how far apart the hits are in the document (or anchor) but is classified into 10 different value “bins” ranging from a phrase match to “not even close”. Counts are computed not only for every type of hit but for every type and proximity. Every type and proximity pair has a type-prox-weight. The counts are converted into count-weights and we take the dot product of the count-weights and the type-prox-weights to compute an IR score. All of these numbers and matrices can all be displayed with the search results using a special debug mode. These displays have been very helpful in developing the ranking system.
The Backrub Algorithm is NOT the Current Google Algorithm
Google completely redesigned the algorithm in 2001 (in fact, that work is credited to Amit Singhal). According to the same source, Update Fritz came out in the Summer of 2003, allowing Google to update the index continuously (ending the monthly Google Dance).
In December 2005 Google rebuilt its infrastructure with Bigdaddy, which allowed for faster and more extensive indexing of the Web. Although Matt Cutts implied that Bigdaddy was not aimed at improving the relevance and ranking algorithms, he solicited input from Webmasters on the quality of Bigdaddy’s results (presumably because it was crunching a lot more data than earlier lives of Doctor Google).
In 2009 Google rolled out the Searchology Update, which wasn’t an update so much as enhancements to the user interface and extensions to what could be searched. Perhaps the most lasting impact of Searchology 2009 was its introduction of “Rich Snippets”.
Starting in late 2009 and extending well into 2010, Google openly discussed its new Caffeine infrastructure. Caffeine went live in June 2010, providing “50 percent fresher results for web searches … and … the largest collection of web content” that Google had offered up to that time.
In February 2011 Google rolled out the Panda algorithm, which added a new quality score to the mix which for the first time since 1998 had a PageRank-like impact on Google’s search index quality.
In April 2012 Google rolled out the Penguin algorithm, which automated the process of downgrading sites for using on-page keyword stuffing and spammy links to influence Google’s rankings.
Finally, in August 2013 Google gave us “Hummingbird”, which was another complete rewrite of many of the things that came before it.
It’s important to keep these game-changing algorithm updates in mind when looking at Google’s original design paper, because we’re not living with the same Google that the original paper described. That Google no longer exists, although it has many descendants.
It’s really not even appropriate to speak of “the algorithm” because that is truly a collection of many algorithms.
Domain Authority Is No Substitute for PageRank
One clear sign that the SEO community has lost its way in search engine algorithm analysis is the frequent mention of “Domain Authority” on blogs and forums. Domain Authority is a Moz thing, not a Google thing. It reflects a belief among many SEOs that Google somehow attaches importance to a domain, a value that can be quantified in some way. I used to write about “child inheritance” myself because of the way you could watch the Toolbar PageRank value decrease as you stepped down into a Website, folder-by-folder.
To me child inheritance was more about how quickly new content on a previously established domain would rank well in Google versus how long it took new content on a new site to rank. What was being inherited, in my view, was the flow of PageRank, some ghostly valuation based on domain name.
But there is the whole “trust” issue that Google has alluded to. In August 2006 Matt Cutts addressed a well-documented deindexing on his blog by noting that
By the way, it looks like the primary issue with the Windows Live Writer blog was the large-scale migration from spaces.msn.com to spaces.live.com about a month ago. We saw so many urls suddenly showing up on spaces.live.com that it triggered a flag in our system which requires more trust in individual urls in order for them to rank (this is despite the crawl guys trying to increase our hostload thresholds and taking similar measures to make the migration go smoothly for Spaces). We cleared that flag, and things look much better now.
So there it is, a funny expression: hostload thresholds. What does that mean? Is it “Domain Authority”? I don’t think so, but it’s something similar in concept.
In addition to Domain Authority you’ll see people talk about TrustFlow and some other non-Googley measurements. These metrics provide no value or insight into the workings of Google’s algorithms. They are, however, more reliable as secondary opinions of Website link-based value than, say, what you or I could determine by scrolling through search results and lists of backlinks.
But you should understand that Domain Authority and Trustflow and other non-Googley metrics are being manipulated by link spammers (just as PageRank is). What you don’t hear is how much anti-spam effort the companies behind these external metrics are putting into making the metrics reliable. Spammers boast some pretty high inflation rates on these metrics, so as independent metrics become more popular they will be gamed. I am sure the metrics publishers are aware of that but they should probably pay more attention to what is going on.
Subdomains Do Not Hurt Your PageRank
PageRank does not stop flowing at the boundaries of your root domain. Nor does it stop flowing at the boundaries of your subdomain. And just as Google has debunked the idea that 301 redirects diminish PageRank, now we’re seeing more SEO bloggers warn their readers against using subdomains.
This is dangerous, ill-informed advice. Subdomains are a perfectly fine option for most Websites. The fact you’re publishing some content on a subdomain and some content on your root domain doesn’t mean you are splitting your PageRank. It will make no difference where you publish your content because the PageRank flows through your links. I have written about subdomains for SEO extensively through the years and subdomains are NOT bad for SEO.
A subdomain can be penalized independently of a root domain. So what? So can a deep folder in your Website. Search engines have acknowledged that they can apply precisely targeted penalties to only small portions of otherwise well-behaved Websites. By the same token, if you have a site that has been penalized or downgraded you can use a subdomain to build new content that isn’t affected by whatever the search engine did to your main site. I don’t mean move your site to a subdomain; I mean develop a whole new site on your subdomain and work with that.
The next time you see an SEO blogger advise readers NOT to use subdomains for blogs or whatever, roll your eyes and move on.
PageRank Does Not Stop Flowing
You cannot hoard PageRank. Spammers still try to do that. Their blatant refusal to link out to other sites (or their insistence on using “rel=’nofollow'” on such links) is a sure sign they don’t understand PageRank.
Unfortunately many large companies also refuse to link out to other sites. Their reasons may not be about PageRank. Many merchants to this day feel that they will lose business if the link to other sites. That’s a silly notion but it’s one that will probably never die.
You cannot force people to remain on your Websites any longer than they wish, although I have seen some pretty blatant attempts to trap people on websites with popups and other funky tricks. If your Web marketing is focused on preventing people or PageRank from leaving your site then you must have no confidence in your ability to persuade them to do business with you.
People come and go. That is a fact of life. PageRank comes and goes. That is a fact of links.
No one should be thinking about how to keep the PageRank on their site. You can’t keep it, you don’t keep it, and you won’t keep it no matter how clever you think you are.
We Should Not Have to Write Articles Like This
Here in 2014 there should be so much good information on the Web about PageRank that no one should be confused about it. Unfortunately the search engines will rank nonsense just as quickly and easily as they will rank useful information. That’s a fact of search.
The marketer who ignores the nonsense has an advantage over the “great post” readers who don’t stop to check the statements of fact that SEO bloggers make.
It’s not hard to find the real facts about PageRank. You owe it to yourselves to do the research and read the legitimate sources OR to ignore all these discussions in the first place. Many a Website has built great success without thinking about PageRank. That is probably the most valuable lesson in SEO of all.
Read More about Search Engine Optimization
How Long Does It Take Google to Credit A Website with Links?
Natural Backlink Profile: Endless Ways to Build One
Website Not On Google? Why Some Internal Pages Aren't Indexed
RankBrain and Neural Matching: What Is the Difference?
Follow Reflective Dynamics |
Click here to follow Reflective Dynamics on Twitter: @refdynamics. Click here to follow SEO Theory on Twitter: @seo_theory. Reflective Dynamics' RSS Feed (summaries only) |