A Statistical and Experimental Analysis of Google's Florida Update |
by Leslie Rohde December 10, 2003 |
Since the new Google index took effect on November 15th there has
been a running dialog among webmasters and SEOs chronicling the changes and
theorizing what the new algorithm is or is intended to accomplish. In
all cases I have seen and every case forwarded to me by customers the
theories espoused have been either anecdotal or completely speculative.
But search engine ranking is a science, not an art,
and metaphor, speculation and anecdote just don't cut it.
Understanding SEO
demands a scientific approach where an hypothesis is advanced,
experiments conducted, and the results analyzed. OptiLink was born of
precisely this process and has as its singular purpose the study of
modern search engine positioning.
This document summarizes my testing of the Florida
update using some four dozen example pages provided either by customers
or gleaned directly
from search results. I have also conducted a number of broad
statistical studies using my own tools augmented by scroogle.org, the
fine work of Daniel Brandt. The document ends with some final
thoughts and recommendations.
There are three things that made Google unique among search engines.
First, they have done a great job of system design to create an appliance that is truly scalable. Their success at providing superior results would be of no import without their first rate use of computer technology.
Second, the Page Rank (PR) algorithm, when rationally incorporated into the overall positioning mix, is an innovation that has very positively impacted the results they produce. No technique is perfect, but PR allows Google to avoid problems that still plague other engines.
Finally, the extensive use of citation analysis, or what I have
named Link Reputation, set Google apart in a major way. Without the
extent of link analysis that Google performs, it would not have
produced the superior results that made it a whirlwind success.
But no technique is perfect, and even the mighty Google can be
"manipulated".
There have been a number of published complaints recently about
people
manipulating Google results. This should not be surprising since Google
controls something more than 75% of all website traffic. Such a
concentration of power will
obviously be the target of study, reverse engineering, and ultimately
manipulation.
The pre-Florida Google is distinguished as a search engine that uses a small number of very powerful concepts. Inktomi, by way of contrast, uses a plethora of relatively weak ideas. This difference makes Google both easier to optimize for, and better at providing relevant results. Positioning at Google depends primarily upon:
Little else significantly impacts position, so if Google needed to "crack down" on some sort of abuse, the algorithm change must effect one or more of these criteria to be effective. The most likely of these by far is link text. Google is founded on the concept of "citation analysis," which in Web terms is just linking. Citation analysis however never considered the notion of "reciprocal citation" or webs of "papers" engaged in "mutual citation" - practices that are impossible in print, but common on the web. This type of linking is open to certain kinds of abuse which Google might indeed want to curb. This "bad linking" consists of these types:
My own view is that these are not problems that were so rampant as
to require a 30%+ change in the index - Google's opinion may indeed
differ from my own. If so, there are some relatively simple ways Google
could steer their existing (well, pre-Florida) algorithm to account for
these (so-called) "abuses":
Only the first effect can be supported from
current data as none of my examples show any loss of PR and none have
been banned. If they have modified the handling of links, it is only
the Link Reputation measure that has been discounted.
Of course, they could just layer a new algorithm on top of the old
one as a post-process or filter, leaving the original unchanged.
Anecdotally, the Florida update stinks (IMHO). For example, the search for "buy ambien" returns an internal page from linuxhq.com at position 16. This page has no connection with the search except for a paid link at the bottom of the page. The same page also ranks at 26 for "buy soma" for the same reason. These terms do not appear anywhere else on the page, nor do they appear in any link text referring to the page. This effect is new with Florida. The pre-Florida results do not list linuxhq.com for either term in the first 1000 results. Incidentally, Google returns the online pharmacy pillsfast.com as "similar to" this internal page at linuxhq. Now that's just silly.
Another example is fruit basket. The results make no sense at all
until you understand who the anime "fruits basket" characters are.
These several pages, some of them non-English, have no instances
of the search terms anywhere on the page, have almost no link text
matching the
search, and have lower PR than some lower positioned pages that are
actually on topic. The current top ranked page was listed at 751
pre-Florida.
But as bad as these examples are, they do not begin to show just how bad and how widespread the Florida change really is. Thanks to scroogle.org and a truly inexplicable "feature" introduced along with Florida, we can see what appears to be the pre-Florida results along side the new Florida index. And oh what an ugly picture it makes.
Using the 4856 searches in the scroogle.org hitlist we find that the
average change in results measures 54. What this means that 54 of the
top
100 results in Florida differ from pre-Florida. That is a 54% change in
Google's top 100 results!!
Are we to believe that 54% of prior listings were spam? I use Google
extensively to research both work and hobby topics and my experience of
the pre-Florida results was very good. Also, as part of what I do in
the
design of software for webmasters, I do a lot of searches conceived
solely to
better understand the Google ranking algorithms. I can not think of a
single case where half the results were spam, or lacked relevance. But
that's just me. Maybe Google's new notion of good listings is 54%
different than their prior
notion? How could this be? The results were already the best on the
planet, hands down, why would they choose to make such a large change
all at once?
Now, the scroogle hitlist is not a random sample since it is only us
webmasters entering our pet terms, but using the wordtracker 500 word
list paints a picture that isn't much better. The precise numbers I
computed for the
wordtracker lists are these:
Granted, these numbers are somewhat smaller than 54%, but is that
much
of an improvement? So the results are only a third changed instead of
half? That's a win??
By the way, keep in mind three significant problems with these
numbers:
The net effect of these source problems is probably a wash, leaving
our real Florida change at something between 30 and 50 percent. This is
completely without precedent. You could change the search engine
completely, like buy Altavista let's say, and have a change in results
only slightly larger than this. To call this an "update" is
Euphemistic.
If intentional, which I am still not convinced of, this is a complete
overhaul.
But wait, it get's even worse. The percentage change, 22, 54, or
whatever, only considers the first 100 results. The carnage goes way
deeper than that. In any rational "update" we should see a more or less
gaussian distribution of position changes for the pages we look at. In
English, some pages should move a little, others should move a lot, and
others still somewhere in between. But what characterizes this update
is something almost binary in its effect. Almost every single page I
have under study moved from a top 10 position to somewhere deeper than
the 1000 result limit Google will let you see. These sites went from
top to gone literally overnight. In one case, the webmaster records
1000 unique visitors from Google on Friday and 70 (!) on Saturday.
In every single case these dropped pages have the following key attributes in common:
These pages run the gamet in
terms of size, style, how they are optimized, and how much they are
optimized. Given the wide range of pages impacted, together with the
huge impact Florida has had, it should be straightforward to find the
key feature or combination of features that Florida introduced. As it
happens, this is not the case.
It has been proposed that the new algorithm is applied only to
"commercial" searches or that the extent of change in the index is
related to the monetary value of the search. Sadly, there is some
considerable evidence to support this idea, but the correlation between
"cost" and "carnage" is far from perfect. Furthermore, this still begs
the question: in those cases where the new index has been substantially
overhauled, what algorithm is being used to rank the results that have
survived or appeared? The point of this work has been to try to answer
this question through direct measurements of the Google results and the
pages indexed therein.
There are two characteristics of search engines that guarantee that
it
is always (theoretically) possible to discover the positioning
algorithm.
First, there are way more searches and way more pages than there could be factors that affect positioning. This is a version of the many variable problem. So long as you have enough data points, you can solve for the large number of unknowns.
But more important than that, the search engine is not an arbitrary function, it has a well known purpose that guides us to look for the right sorts of relationships and skip other theoretically possible but non-purposeful relationships. Clearly, simpler algorithms are easier to reverse-engineer than are more involved algorithms. Google's pre-Florida algorithm was actually elegantly simple - post-Florida appears neither simple nor elegant.
Discovering the algorithm involves about 1% technique and 99% good 'ol hard work.
We will measure whatever we can measure about top ranked pages, lower ranked pages, and the newly unranked pages and try to find features that differentiate one page from another. This is a discriminate, a function that discriminates a top ranking page from one that is not. For this "update" we are looking for one or more functions that are nearly binary in their influence on positioning. Small changes in ranking are of no interest in light of the large number of pages that have simply dropped from the results altogether.
The nature of search engines and their ultimate purpose will guide
us in what to look for, but it also allows us to make a useful
simplifying assumption. The core ranking algorithms that search engines
use have all been, and for practical reasons will probably continue to
be, composed as a linear combination of independant factors. Lapsing
once more into English, this means that positioning is computed where
each factor is multiplied by a "weight" and added together. This final
result is the "score" that determines where the page gets placed in the
results. If true,
this characteristic allows us to view each feature or factor as an
independent
influence.
Owing to the nearly binary nature of the change Florida wrought, we
can focus our attention on finding one of four conditions as shown in
the table. If we can show that a proposed penalty feature is present
in pages dropped from the results while simultaneously being absent
from top ranked pages, we will have convincing evidence that the
penalty is indeed a factor. This does not prove the matter, but with a
sufficiently large number of pages that match this behavior, our
confidence approaches certainty. As is usually the case, disproving a
point is easier than its proof. We can readily discount most proposed
penalty features by simply finding a single case of a top ranked page
that exemplifies the feature. If the feature is absent from dropped
pages, this is likewise a counter argument to the hypothetical penalty.
|
|
||||||||||||||||||||||||
Even in the best of cases this type of work is painfully tedious, but automated software tools can make it at least bearable. Doing the research presented here by hand would take many weeks. With OptiLink and some additional custom-built tools, it took "merely" days. If you are already an OptiLink licensee, then you will recognize many of the techniques presented below, and can reproduce a number these results using your own copy of software. Some of the more involved proposals did require modification of the existing software and construction of additional tools. Many of the new features added to OptiTools for these analyses may have continuing value so they will be rolled into the production copy of software later this month.
Summary: The use of a large number of highly similar text links that use text contained in the title tag activates the penalty.
Example search: search engine optimization. Changed 41% by Florida.
Procedure: OptiLink
was run on each of the top 10 results for the
search and the results tabulated below. These are compared to the page
at #404 that was ranked at #2 pre-Florida. Some of the results
currently shown moved substantially during the three weeks immediately
following Florida. This particular snapshot was done December 8, 2003.
Notice how well the link text and page title of the top ranked page are
optimized. See also the relatively predictable arrangement of Page
Rank, Topic, Reputation, and Title in these results. This is what we
expect to see when using OptiLink.
But then there's that last row. This page clearly does not belong in
that position when only the data shown are considered. Moreover, the
link text and title for this dropped page is far less optimized than
the current top results. This example directly contradicts the
proposed penalty.
| Position | URL | Popularity | Page Rank | Topic | Reputation | Title |
| 1 |
www.seoinc.com | 556 | 7 |
7, 7, 5 |
93,93,92 |
Search Engine Optimization Search Engine Placement |
| 2 |
www.bruceclay.com/web_rank.htm | 160 |
7 |
2,2,1 |
69,69,60 |
bruceclay.com - Search Engine Optimization, Ranking, Web Site Promotion, Free Advice and Placement Services |
| 3 |
hotwired.lycos.com/webmonkey/01/23/index1a.html | 93 |
6 |
2,1,1 |
46,39,39 |
|
| 4 |
www.submit-it.com/subopt.htm | 171 |
6 |
5,4,4 |
26,23,14 |
|
| 5 |
www.submittoday.com | 289 |
6 |
7,6,5 |
43,43,43 |
Search Engine Optimization and Search Engine Submission Services - Submit Today |
| 6 |
www.scrubtheweb.com/abs | 170 |
6 |
4,2,2 |
1,0,0 |
Search Engine Optimization and Site Promotional Tools |
| 7 |
www.websitepromotionsoft.com | 129 |
6 |
3,2,2 |
40,40,40 |
Web Site Promotion - Meta Tag Keyword Generator - Search Engine Optimization Tool |
| 8 |
www.topseo.com | 210 |
5 |
9,7,3 |
35,35,35 |
Search Engine Optimization Search Engine Submission Ranking - Placement - Positioning - Registration |
| 9 |
searchenginewatch.com | 11900 |
8 |
7,3,0 |
16,16,0 |
Search Engine Watch: Tips About Internet Search Engines & Search Engine Submission |
| 10 |
www.positionresearch.com | 101 |
5 |
6,4,2 |
66,66,65 |
Search engine optimization company for top search engine placement - Position Research |
| 404 |
www.positioned1.com |
14500 |
8 |
3, 3, 2 |
58, 58, 23 |
Search Engine Optimization Company - Search Engine Ranking |
Conclusion: There is no penalty imposed for highly regular link text that matches the title text of the target page.
Summary: The Link Repuation effect of having multiple links from the
same Class-C IP block as the target page is discounted. A PageRank
effect could also be imposed, but no PR decrease has been observed in
any of the dropped pages, so only a Link Reputation effect is
considered here. This penalty would also discount multiple links from
the same page or same domain. The simple form of this penalty would
discount links from the same IP, that is, internal linking, but simple
cursory searches show that this is not the case, so the somewhat more
advanced version of this penalty is the only workable hypothesis.
Internal links are counted fully, and external links from a different
IP block are counted fully, with only the external links from within
the same Class-C IP block being discounted.
Example search: cell phones. Changed 75% by Florida.
Procedure: For this experiment we can start by reusing the results
from the test of hypothesis 1. Employ the Domains view in OptiLink to
inspect the IP addresses of the linking pages for the top ranked page
for cell phones. The table summarizes these numbers. This page ranks
almost entirely on the basis of same Class-C linking. The one possible
exception we can suggest here is the use of sub-domains. It might be that the use of
sub-domains will qualify these external Class-C links as internal
links, and therefore not trigger the penalty.
| Link Source |
Link Count |
| Internal Links |
1026 |
| Same Class-C |
696 |
| Different Class-C |
120 |
| Total |
1842 |
A second round of tests was conducted using a modified copy of
OptiLink that adds
the ability to recompute the Compare view excluding
linking from within the same Class-C block of IP's. A variety of
dropped pages were then used as examples. Many of the dropped pages do
have significant same Class C linking, but just as many do not. This is
what we should expect of a non-correlated feature.
This procedure was also applied to the cell phones #1 site. The Link
Reputation from all sources is 99%, whereas the Link Reputation
discounting same Class C links is only 1%. If the same Class C links
were discounted, this page could not possibly compete with the other
pages in the top 10.
Conclusion: Linking within the same Class-C block of IP's is not
being discounted.
Summay: Sites that link to one another will have the links between their pages discounted. Only a Link Reputation discount is considered since no PR reduction is observed in any of the dropped pages.
Example search: baby furniture. Changed 72% by Florida.
Example page: www.baby-furniture-4-u.com
Procedure: Pre-release copies of OptiLink and OptiSpider were
used in
this analysis. OptiSpider was used to spider the example page and report
the domains for all the pages the site links to. A Link Reputation
analysis was then run on the example page using OptiLink with the
domain list from OptiSpider used to exclude links from these domains. The
resulting Link Reputation was then compared to the Link Reputation
without this filtering of linking partners. There are more than 1000
links to the example page and only 100 of them come from linking
partners so this is a relatively minor effect. Moreover, the Link
Reputation with and without these partners is nearly unchanged. This is
not nearly a big enough effect to explain the movement of this page
from #1 to nowhere in sight.
Conclusion: This example used is but one of numerous examples which likewise illustrate the lack of correlation between reciprocal linking and Florida position.
Summary: Goggle is using the hilltop agorithm for high frequency queries
Example search: free cell phone. Changed 83% by Florida.
Procedure: A set of "expert pages" is required for this analysis. The members of this can not be known for certain, but they should theoretically rank relatively well for the example search and/or closely related searches. To gather this set some prerelease software was modified. This list was then used as the set of candidate linking page in a production copy of OptiLink. The "expert set" used for this experiment consists of the 1700 unique pages that are returned for the three queries free cell phone, cell phone, and cell phones.
The current top page, www.longdistanceworld.com/cellular-phones/ is not linked at all from within the chosen expert set whereas one of the pages dropped from the results does. This is not very encouraging. It is possible that the expert set was simply chosen incorrectly, but both the hilltop and localrank algorithms should be placing some of the expert pages in the results selected.
Moreover, there are some other hints that the proposed algorithms are not responsible for these results. As an example, result #519 is the page http://www.dailynorthwestern.com/vnews/display.v/SEC/CITY. This is an internal page for a university newspaper that is not about cell phones, free or otherwise. The only relationship it has with the search terms is a sponsored link with the link text "Free cell phones : Cell phone" (view the cache, as the link has since changed). The target of this link is a page that is about free cell phone plans, but it is one of the pages dropped from the results.
Conclusion: Local re-ranking of results either within the result set or via a fixed set of expert pages does not appear to explain Florida results.
There are a number of experiments conceived, but not yet done. These
involve either extensive changes to existing software, or the wholesale
creation of new programs. This research will be on-going until the
Florida algorithm is understood.
There is some evidence that Florida is doing some form of topic comparison between linking pages. Previously, only the link text was considered in the association between linking page and target page. This may have changed, so I am currently building software to study it.
It may also be that Florida has dramatically increased the
importance of on-page analysis. The evidence of this is not very
solid, but it is worth investigating. Doing this would allow Google to
"back-door" the notion of a "hub" by just increasing the importance of
the link text that points to other pages - the conceptual inverse of
Link Reputation. I have seen a number of results that could be
accounted
for in this way.
Finally, there are several more complex explanations of Florida
changes that involve the analysis of a page's linking topology. Such
measurements could certainly be made during the
computation of Page Rank and they could indeed have far reaching
ranking impacts that
would be difficult to explain with existing tools.
When I first saw Florida, I decided it was a bug, or just a really bad decision, and would be rolled back before any of us could figure it out. Ah, but if only that were true. It appears instead that Google is sticking with this one, despite its many warts. So, here are my best recommendations on how to live with Florida.
Complain to Google - ask them "what you did wrong" and why you are
no longer a favored result. This might be about like shouting into a
well, but you certainly won't get an answer if you don't at least ask.
If you can, take an early holiday and call me next year. This is what I told one of my long time customers and clients on November 18th, and I don't yet have any cause to say different. If your sites have done as well as his, then you should consider doing the same. If you just can't stand that much vacation, then by all means keep changes to your existing sites modest. Making big changes with no real idea about where to go is thrashing, not optimizing.
Optimization is not dead. The Florida "update" is indeed an extreme
change, and we will all be flying a bit blind for a few weeks, but this
is just a storm, not the Apocalypse. We will eventually (sooner than
later, I think) discover what the new algorithm expects and build tools
and methods to help us to again consistently achieve top ranking.
SEO is ultimately an "arms race." The most that any search engine company can succeed in doing is weeding out the lesser players and closing up the simple loopholes. For those that are willing and able to make the investment in time and tools, SEO will not die until five minutes after the last search engine expires. The grand irony of making SEO more difficult, is that while the number of successful SEOs may indeed diminish, those that survive are stronger and exponentially more difficult to defeat.
Link text is neither dead nor turned off. Link Reputation analysis
is still valuable, but there do appear to be other factors at work as
well. If
there really is an OOP, then precise analysis of link text is even more
important now than it was before. Pre-Florida, you could not have "too
much" Link Reputation. Post-Florida, I'm not 100% sure, and I will not
be sure until I have a set of factors that yield a high numerical
correlation with ranking. This is precisely the approach I took with
the initial design of OptiLink,
and is the same plan I have embarked on
to incorporate the changes introduced by Florida. This will obviously
involve additions and modifications to the OptiTools products.
I am developing a number of interim recommendations for how to use
the OptiLink and OptiSpider
outputs until Florida is better understood.
These guidelines will be posted to the FAQ just as soon as possible.
Making use of the ranking benefits of Link Reputation still require the
same tools and techniques, but the rules for how to use the analysis
may have changed.
All sizes and styles of website design have been more or less
equally hit, so you can stay with whatever approach you favor. If you
are starting from scratch today, I would recommend a network of smaller
sites rather than a single larger site. Having your one single large
site
plummet in ranking is devastating to the bottom line. With a network of
smaller sites however, some may survive major algorithm changes while
others do not, doing less total damage. This assumes of course that the
sites are registered, hosted, and
interlinked correctly. The best practices known are described in
Michael Campbell's Revenge
of the Mininet ebook.
Finally, stay the course. Don't throw away everything you learned about optimization just because of the Florida "overhaul". Be assured that Google did not throw away everything they know about ranking pages, so the majority of good SEO practice has likely remained unchanged. Once we understand Florida better, we will probably see additions to our SEO techniques, but not big modifications.
|
Straight Talk on Search Engine Optimization Natural Search Engine Optimization Guide Search Engine Friendly Ways to Build Link Popularity PalmInfocenter: Sony Introduces Two Models with Keyboards, MP3 Players Buying TextLink Ads Your Search Engine Optimization Strategy: Make Love, Not War How Text Link Ads Can Help Or Hurt Your Link Popularity (PR Web) Jumping Into the Search Engine Optimization Game Reciprocal Linking is Dead global search engine optimization Search Engine Optimization Marketing - Search Engine Marketing Optimization :Search Engine Optimization: Mac versus Windows? Search Engine Marketing Optimization :Straight Talk on Search Engine Optimization Search Engine Marketing Optimization :search engine optimization sydney Search Engine Marketing :Search Engine Marketing Optimization :Exclusive Search Engine Optimization Strategies & RSS: Interview with Lee Odden, TopRank Deep Link Analysis |