Wednesday, October 14, 2009

Google indexing and how I think it works

These ideas came about after a lot of research into data replication for a client and from listening to search experts from google and microsoft, obviously there is a lot of guesswork in there as well.

I could be wrong, hey I could be right, we will probably never know.


I started trying to answer this question at loving tech forums - http://www.lovingtech.net/forums/thread-duplicate-content-on-article-sites-and-your-site - about duplicate content and my answer slowly ballooned into more of a tech article than a quick reply.


There are hundreds (maybe thousands) of google bots out there trawling the pages to generate a massive list of URLs - example: one of my sites has had 400 hits from googlebot today in 12 hours from about 8 different IP addresses so 8 different datacenters.

Google has hundreds of datacenters around the globe so when we do a search we get the 'best' DC for where we are/least busy etc and these DC's can have a DIFFERENT set of results to look into for us

Don’t belive me? - Goto http://www.seochat.com/seo-tools/multiple-datacenter-google-search/ and do a search for [google datacenter] (remove the []), select the 35 from the dropdown and on the results page you will see each datacenter's results for the search term, scroll down and you will see that the results are the sort of the same and the number of results is different

Results 1 - 10 of about 7,890,000

Results 1 - 10 of about 7,900,000 etc

Why the difference? – now the fun stuff

Each Googlebot stores it list of found URLs in it’s own datacenter and from time to time this massive list is taken offline and the pages are cached locally so they can be crunched by a load of servers inside this DC this builds a new search index that it will use to show us the results for our search make sense?

Doing it this way makes real sense form an IT/hardware point of view, it means you can do this massive amount of data analysis at anytime without affecting the user who is trying to search for free a delivery pizza in their area.
Because the pages are cached locally in the DC they can be compared to older versions or instantly be seen as new pages, flagged as duplicate etc

It also means that the new index can be tested with

Obviously this is where the magic happens – throw in some google algorithms, hop on one foot, do the secret handshake, close one eye, do a little dance, make a little love – you get the picture. Out pops an update to the google index, still steaming ready to be pushed to the live index.

All of this new indexed data is made available to the DC is and mixed up with the old data – hence the number of results can be different across the DC’s

The big data replication bit comes next, this is a data nightmare when you think about the volume…I mean millions or billions of rows of data that has to be the checked etc.

On a scheduled basis every DC will slowly replicate it’s data and propagate it’s changes to every other DC, but because this happens completely independently to the offline data crunching then the DC’s will never ever be 100% the same unless everyone on the internet does not update their site for X amount of time to give google a chance to catch up and I doubt this will ever happen.


So my answer to the question;

One thing that makes me think that the duplicate content thing is tricky to track is the fact that there is no way that the ‘real first’ copy of any content can be tracked as every googlebot talks to different DC’s and spiders pages at different times, is stored locally at different times and is indexed at different times etc.

The problem with duplicate content is that because EVERY page fights against EVERY page for ranking then if you are psoting the same content across multiple sites (not a good idea) then the site that google ‘see’s as the most appropriate will probably get the higher result rankings.

This is not an exact science so anything could happen, but I think the best bet is to play safe and try and avoid duplicate content.

More stuff to get the brain working;

Floating DC’s http://news.cnet.com/8301-11128_3-10034753-54.html

DC n a box http://perspectives.mvdirona.com/2009/04/01/RoughNotesDataCenterEfficiencySummitPosting3.aspx

1 comment:

CMS Solutions said...

CMS Solutions could really give aide to those people that have more tasks than everybody else. availing this CMS Solutions could get your life run a little smoother.Thanks for this very informative review. This seems to be very interesting, and very helpful for the readers.
Keep on posting!
CMS Solutions