web engineer - chillfire: October 2009

This problem has annoyed for a very long time, you can add a paypal buy now button to any website, you login to your paypal account and fill in a few boxes and presto out spits a few lines of html to get you going.

<form action="https://www.paypal.com/cgi-bin/webscr" method="post">
<input type="hidden" name="cmd" value="_s-xclick">
<input type="hidden" name="hosted_button_id" value="9225183">
<input type="image" src="https://www.paypal.com/en_US/GB/i/btn/btn_buynowCC_LG.gif" border="0" name="submit" alt="PayPal - The safer, easier way to pay online.">
<img alt="" border="0" src="https://www.paypal.com/en_GB/i/scr/pixel.gif" width="1" height="1">
</form>

Which when added to my page directly (or via my CMS) just doesn't work - I won't go into the reasons why right now.

After many hours of trying to find a simple solution so my clients are able to add these adhoc buttons via the CMS, I realised I was trying to over complicate things by writing loads of code and generally wasting precious development time.
Taking a few steps back and going throguh the whole process again made me realise that when you create these buttons on paypal there is a second tab above the html window that says 'Email' when you view this tab you get this;

https://www.paypal.com/cgi-bin/webscr?cmd=_s-xclick&hosted_button_id=123456789

which you can email out to people to pay for an item - can you see what it is yet?

SOLUTION:

Simply add an href using the 'Email' link with the paypal button inside...

<a href="https://www.paypal.com/cgi-bin/webscr?cmd=_s-xclick&hosted_button_id=9225183" target="_blank"> <img border="0" alt="" src="https://www.paypal.com/en_US/GB/i/btn/btn_buynowCC_LG.gif" /></a>

Sometime the answer is staring you right in the face.

These ideas came about after a lot of research into data replication for a client and from listening to search experts from google and microsoft, obviously there is a lot of guesswork in there as well.

I could be wrong, hey I could be right, we will probably never know.

I started trying to answer this question at loving tech forums - http://www.lovingtech.net/forums/thread-duplicate-content-on-article-sites-and-your-site - about duplicate content and my answer slowly ballooned into more of a tech article than a quick reply.

There are hundreds (maybe thousands) of google bots out there trawling the pages to generate a massive list of URLs - example: one of my sites has had 400 hits from googlebot today in 12 hours from about 8 different IP addresses so 8 different datacenters.

Google has hundreds of datacenters around the globe so when we do a search we get the 'best' DC for where we are/least busy etc and these DC's can have a DIFFERENT set of results to look into for us

Don’t belive me? - Goto http://www.seochat.com/seo-tools/multiple-datacenter-google-search/ and do a search for [google datacenter] (remove the []), select the 35 from the dropdown and on the results page you will see each datacenter's results for the search term, scroll down and you will see that the results are the sort of the same and the number of results is different

Results 1 - 10 of about 7,890,000

Results 1 - 10 of about 7,900,000 etc

Why the difference? – now the fun stuff

Each Googlebot stores it list of found URLs in it’s own datacenter and from time to time this massive list is taken offline and the pages are cached locally so they can be crunched by a load of servers inside this DC this builds a new search index that it will use to show us the results for our search make sense?

Doing it this way makes real sense form an IT/hardware point of view, it means you can do this massive amount of data analysis at anytime without affecting the user who is trying to search for free a delivery pizza in their area.
Because the pages are cached locally in the DC they can be compared to older versions or instantly be seen as new pages, flagged as duplicate etc

It also means that the new index can be tested with

Obviously this is where the magic happens – throw in some google algorithms, hop on one foot, do the secret handshake, close one eye, do a little dance, make a little love – you get the picture. Out pops an update to the google index, still steaming ready to be pushed to the live index.

All of this new indexed data is made available to the DC is and mixed up with the old data – hence the number of results can be different across the DC’s

The big data replication bit comes next, this is a data nightmare when you think about the volume…I mean millions or billions of rows of data that has to be the checked etc.

On a scheduled basis every DC will slowly replicate it’s data and propagate it’s changes to every other DC, but because this happens completely independently to the offline data crunching then the DC’s will never ever be 100% the same unless everyone on the internet does not update their site for X amount of time to give google a chance to catch up and I doubt this will ever happen.

So my answer to the question;

One thing that makes me think that the duplicate content thing is tricky to track is the fact that there is no way that the ‘real first’ copy of any content can be tracked as every googlebot talks to different DC’s and spiders pages at different times, is stored locally at different times and is indexed at different times etc.

The problem with duplicate content is that because EVERY page fights against EVERY page for ranking then if you are psoting the same content across multiple sites (not a good idea) then the site that google ‘see’s as the most appropriate will probably get the higher result rankings.

This is not an exact science so anything could happen, but I think the best bet is to play safe and try and avoid duplicate content.

More stuff to get the brain working;

Floating DC’s http://news.cnet.com/8301-11128_3-10034753-54.html

DC n a box http://perspectives.mvdirona.com/2009/04/01/RoughNotesDataCenterEfficiencySummitPosting3.aspx

web engineer - chillfire

Tuesday, October 27, 2009

PayPal Buy Now button on an asp.net page

Wednesday, October 14, 2009

Google indexing and how I think it works