Wednesday, October 14, 2009
Google indexing and how I think it works
I could be wrong, hey I could be right, we will probably never know.
I started trying to answer this question at loving tech forums - http://www.lovingtech.net/forums/thread-duplicate-content-on-article-sites-and-your-site - about duplicate content and my answer slowly ballooned into more of a tech article than a quick reply.
There are hundreds (maybe thousands) of google bots out there trawling the pages to generate a massive list of URLs - example: one of my sites has had 400 hits from googlebot today in 12 hours from about 8 different IP addresses so 8 different datacenters.
Google has hundreds of datacenters around the globe so when we do a search we get the 'best' DC for where we are/least busy etc and these DC's can have a DIFFERENT set of results to look into for us
Don’t belive me? - Goto http://www.seochat.com/seo-tools/multiple-datacenter-google-search/ and do a search for [google datacenter] (remove the []), select the 35 from the dropdown and on the results page you will see each datacenter's results for the search term, scroll down and you will see that the results are the sort of the same and the number of results is different
Results 1 - 10 of about 7,890,000
Results 1 - 10 of about 7,900,000 etc
Why the difference? – now the fun stuff
Each Googlebot stores it list of found URLs in it’s own datacenter and from time to time this massive list is taken offline and the pages are cached locally so they can be crunched by a load of servers inside this DC this builds a new search index that it will use to show us the results for our search make sense?
Doing it this way makes real sense form an IT/hardware point of view, it means you can do this massive amount of data analysis at anytime without affecting the user who is trying to search for free a delivery pizza in their area.
Because the pages are cached locally in the DC they can be compared to older versions or instantly be seen as new pages, flagged as duplicate etc
It also means that the new index can be tested with
Obviously this is where the magic happens – throw in some google algorithms, hop on one foot, do the secret handshake, close one eye, do a little dance, make a little love – you get the picture. Out pops an update to the google index, still steaming ready to be pushed to the live index.
All of this new indexed data is made available to the DC is and mixed up with the old data – hence the number of results can be different across the DC’s
The big data replication bit comes next, this is a data nightmare when you think about the volume…I mean millions or billions of rows of data that has to be the checked etc.
On a scheduled basis every DC will slowly replicate it’s data and propagate it’s changes to every other DC, but because this happens completely independently to the offline data crunching then the DC’s will never ever be 100% the same unless everyone on the internet does not update their site for X amount of time to give google a chance to catch up and I doubt this will ever happen.
So my answer to the question;
One thing that makes me think that the duplicate content thing is tricky to track is the fact that there is no way that the ‘real first’ copy of any content can be tracked as every googlebot talks to different DC’s and spiders pages at different times, is stored locally at different times and is indexed at different times etc.
The problem with duplicate content is that because EVERY page fights against EVERY page for ranking then if you are psoting the same content across multiple sites (not a good idea) then the site that google ‘see’s as the most appropriate will probably get the higher result rankings.
This is not an exact science so anything could happen, but I think the best bet is to play safe and try and avoid duplicate content.
More stuff to get the brain working;
Floating DC’s http://news.cnet.com/8301-11128_3-10034753-54.html
DC n a box http://perspectives.mvdirona.com/2009/04/01/RoughNotesDataCenterEfficiencySummitPosting3.aspx
Monday, September 14, 2009
Live writer
Well so far (only a few characters in though) so good.
Live writer (part of the Live essentials download) is pretty simple to use, install/download takes about 3-5 minutes (no restart needed) and once I remembered my password it all loaded nice and fast.
Obligatory screenshot:
I am impressed by the way it works, there are enough functions to keep someone like me happy, you can set daes to future publish blogs, copy/paste images straight in, preview/edit the blog etc.
The interface doesn’t look like is has been touched with the Office 2007 brush, very basic.
I wonder why they can’t have this functionality directly in word/OneNote etc?
Score 8/10
UPDATE:
Take off at least 4 from the 8/10 score as you cant upload images directly you blogger.com, guessing if it was a MS bloghost it would work though.
Wednesday, August 26, 2009
Thursday, August 20, 2009
short URL's made easy
With many of us on twitter, tweeting about other websites can be difficult as you are limited to the number of characters you can use, so I went on a search for a good fast and free URL shortening site and found http://shor7.net/ where you can simply type in the link you want to ‘go to’ and the short (or shor7) title for this page, which gets turned into something like http://shor7.net/seo which will send you to http://www.chillfire.co.uk/seo/default.aspx, best bit is that it is SEO safe so you get all the strength of the site its on passed to on!
Shor7.net will also email you a link to view the stats for your shor7’s!
Goodluck with the tweets!
Saturday, July 11, 2009
Hot August fringe festival London
The first annual Fringe Festival at the Royal Vauxhall Tavern (www.theroyalvauxhalltavern.co.uk) is taking place from 27 July - 28 August 2009 featuring the best of London's alternative performance scene.
Performers include Rosie Wilby, Sarah-Louise Young, Dicke Beau, and David Hoyle, to name just a few
See up to four shows each night including cabaret, magic, music, theatre, sketch comedy and more.
£7 entry per show.
Shows start at 6:30pm and run through to 12am.
Check out www.hotaugustfringe.com for more details.
Monday, March 30, 2009
wemoot - it's alive!

When asked what wemoot is I normally say "It's like Faceook with brains", don't get me wrong old facebook does a great job, but it doesn't have any 'information' in it, it's great for keeping in touch with friends, but I can't learn anything from it.
And that's why I thought Jorge's idea was a fantastic one and one I know will have great success and possibly change the way a lot of people interact with content.
I suppose the tech name for wemoot would be a content distribuition platform, with relted discussions and personal messaging. more wemoot tech stuff here.
The test site went live on the 9th of Febraury 2009 (929) and thanks to all the popel testing the full version went live last week!
If you have ever wanted to share some information you think others would be interested in or just want to throw some ideas out to a communtiy of people register and become part of the internet's coolest new cultural communitiy
www.wemoot.com
Sunday, March 29, 2009
Open Space Code - a good day out
As a sole developer it can be really hard to 'see' what is going on in the industry, especially when all of my projects are based loosely around the same technology and 'patterns & practices'.
The day started with a short group planning session, where everyone is given the oppurtunity to put forward topics to discuss.
Following a brief democratic process we had our morning topics inplace - I was off to learn about BDD - Behaviour Driven Development
Thanks to Ian Cooper for filling us in on BDD, the topic was covered well and I got 2 main things out of the 2 hours or so.
1- I need to spend a fair chunk of time learning more about testing
2- I should probably take some time out to move some of what I do to c#
We then all headed off to lunch at Nandos!
The afternoon started with a quick planning session to decide what was going to be discussed over the afternoon session - I went a long to the DSL - Domain Specific Langauge talk.
Which although is probably not something I will be doing in the near future it is defeinitely a part of software development that I will have to use.
The day finished off with everyone back together to discuss the day, pros and cons etc.
Then something I have been looking to do for quite awile - I got to play on one of the few 'Microsoft Surface' devices in the world!
This is a cool piece of technology, if I had a spare £12k I would really like one, not sure what I would use it for, but there would never be a shortage of conversation starters (for those who haven't seen/heard about Surface, this one was an interactive coffee table with brains).
Special thanks to Alan Dean for organising a great event and to cochango for letting us use their really nice offices and allowing some of us to play on the Surface!
If you are a software developer and live in London and are keen to see how other developers do what they do, these events are what you are looking for!
Friday, January 16, 2009
be carfeul of skype when editing your website!
This was easily fixed by turning the skype browser button OFF and removing the skype injected code from the editor before I saved the HTML again.
It's interesting to note that when this skype bar is turned on it has rendering control over every page you view, that is it can inject its own html/javascript into the page as its being rendered by your faithful browser. - is this some kind of loop hole int he broswer security?
So if you do use a web based wyziwyg editor, be careful of toolbar injectors, they can create a LOT of code that you will never need to use.
Tuesday, December 02, 2008
Dundas is reborm as MSchart
And they look really good, oh did I mention they are free......
http://reddevnews.com/news/article.aspx?editorialsid=10419
cheers,
Monday, November 24, 2008
moving a database from sql 2005 to sql2000.....
The only structural changes were via the aspnet provider script + the odd new table/stored proc etc
However when I came to move the DB back to the live server all hell broke loose. Well I might be exaggarating a bit there, but it felt like it.
To cut a long story short I followed these steps to get it all to work;
1- Script Database as .... Create to ...
2- Run the installProviders.sql for aspnet
3- Task ... Generate scripts ... (then manually edit this to remove any of references already created by the installProviders.sql above, also had to manually remove a number of the unusual SQL2005 extra bits)
4- Add required sql users
5- I used Redgate SQL Compare (trial edition) to move the actual data across (the SQL 2005 export tool failed)
6- crossed fingers and it seems to have worked...
It seems annoying that even if you set the compat to SQL2000t hat its still won't work, but I know now. ha ha.
As soon as possible I think the live server will get an upgrade....
Thursday, November 06, 2008
Startup Zone
If you are an IT service provider like us, you can add you profile for others to find you, or if you are after some IT services there are loads of company profiles to look through.
Best of all if you are a IT type person and use microsoft products you can get tthem free if you register!!!
http://www.microsoftstartupzone.com/pages/home.aspx
Saturday, November 01, 2008
New Google Analytics Feature: Event Tracking
Now it seems that is about to change, well at least for some sites they have selected in the pilot, keep an eye out as the launch of this to every other account may not be advertised and the functionality may just 'appear', so to make sure you can access this as soon its available it moght be worth updating your sites to use the new urchin javascript!
cheers,
craig
Greetings from Google Analytics,
We are happy to let you know that a new feature called Event Tracking is now available in the following Google Analytics profiles: www.visalogic.net. Please note that you are receiving this email update since you are an 'Admin' for the profiles listed above.
When you log in to these profiles, you will see a new set of reports called "Event Tracking" under the Content section. As posted on our blog, this is a limited release currently available only to select profiles.
Event Tracking allows you to track interactions with Web 2.0 style content such as Flash, AJAX, Adobe Air, Silverlight, social networking apps, etc. It essentially allows you to track interactions beyond just pageviews.
To use Event Tracking, you will need to upgrade your site to use the new ga.js javascript. Detailed instructions on how to set up Event Tracking on your site are available on our newly launched CodeSite. To find your ga.js code snippet, edit the settings for your profile and click the "Check Status" link on the upper right corner of the page.
Sincerely,
The Google Analytics Team
Wednesday, October 29, 2008
ciao to microsoft
Apparently the wheels had only just started moving so he had no idea of what the merger/acquisition would mean for us in the real world, but they had been informed MS would be looking after the data center side of things and investing into the infrastructure, interesting for MS to go into paid user based reviews?
This will be interesting as they will now know (before anyone else does I suppose) what the average person thinks about just about every product around, nice info to have if you are looking to launch your own.....?????
Monday, October 27, 2008
even the big boys can have problems
The link redirects to http://www.microsoft.com/en/gb/ and not http://www.microsoft.com/en/gb/default.aspx
oh dear...

.
cms woes
I have been watching with interest how my two main websites have been performing in google. My little blog, with it's page rank of 0/10 continually pops up very high in searches and has decent traffic to it but the main site with it's (much larger!) page rank of 1/10 doesn't feature much in searches but is still getting a fair bit of traffic.
I don't really like the joomla look of the main site but it is simple to update and refresh. Is that important to google & co?
My question really is, would I be better converting the main site into a wordpress one like the blog? would it help it perform better in google & co?
Would I be better to combine the two somehow to economize on effort?
I have always thought the way my wordpress blog and the joomla site name the pages when written was helpful to the search engines. Is this true? or would I be better using just numbers for posts like lots of blogs seem to instead?
Is this the kind of stuff an SEO expert helps with - never having known one, or asked one, until now? I have just relied on reading bits and pieces on things like this and not got very far
and my response,
Hi,
my 2 pence (as a CMS developer/seo guy), one thing that I have noticed with google (and other SE's) is that although the CMS you use doesn't have much affect there are things that it does that can do.
- Speed of rendering, the page has to be delivered to the spider quickly, remember web spiders dont 'see' the images so ignore them (thats a usability issue), but some CMS, build their pages slowly, or the databse they connect to is slow etc etc. I have used this before and found it good to help tune my CMS's in the past http://www.websiteoptimization.com/services/analyze/
- Page Errors, this can really harm your site. Every CMS should display some kind of user message if an error occurs, but what if the error only happens when a spider hits the page? I would advise using google webmaster tools and Yahoo site explorer with your site, at tleast this way you can see any errors that the spiders are showing up.
- Fancy controls/Ajax etc, My CMS uses very little ajax to display content, and there are no fancy javascripted menus/controls anywehere to be seen, its all plain old HTML, rememebr the spiders on read text, so how it looks is irrelevant, the best looking site can get 0 search results. Ajax and javascript can is used incorrectly actually hide your content (if used well it can help your content too) from the spiders
- HTML structure, this is the biggest problem with any web page (not just CMS based ones), the page must follow the basic HTML structure. So Meta tags, CSS and javascript in the HEAD area (except google analytics move them to the bottom of the page), content in the BODY area, Header tags in the correct order on the page, H1 at the top, then H2 etc etc, use P tags for each section of relevant content, use UL/LI for lists, etc etc.
A spider will index the structured content faster if its in the right place.
Lots of info, I am sure some will disagree, however these are all things I have had to look into over the last 2 years of developing OCRE and the best bit is every site that uses it has been indexed by google pretty quickly and all have relatively good search results.
hope that helps
craig
Friday, October 24, 2008
ah the joy of cross browser CSS...
Now dont get me wrong IE7 is better than IE6 for consistency, and it does a lot of things the same as Firefox, but someone really needs to work out an industry standard.
Anyway my latest problem was caused by a padding issue that only affected IE7, the fix was easy, but as all searches on the interent, it takes a few ties to actually find the real answer.
I found it here - IE7 css padding problem?
And it works, although it wont validate I dont care, as long as it looks the same on every browser and is accessible!
cheers,
craig
Tuesday, October 07, 2008
asp.net Code snippet mangement for multiple development machines
So I have started using Mark Manella Code snippet add in on all of my machines, this is an amazing extra for VS2008.
As the code snippets are all stored in the same relative folder, I can just update this folder on each machine.
For this I use syncToy 2.0 its a cool syncing app, very easy to use, and will sync folders across the network/pc.
Now there are probably other ways to do all of this, like running bat file to copy the snippets on login etc etc, but I like to know when the files are being updated.
cheers,
craig
Sunday, October 05, 2008
server 2008 64 bit - install 1st go
Platform install was aroun 30mins, server roles 15mins, Active directory including all the AD prep stuff on the current DC took about 20 mins.
Now installing Exchange 2007 (+ SP2 for Exchange 2003 on my old mail server first), so far its taken 45 mins and looks like it will be another 15mins or so.
Its been awhile since I have installed a mail server and had no time for a dozen cups of coffee, fingers crossed when I move the mailboxes over (with the inbuilt 'move mailboxes' tool it'll all go as well...
importing Excel to SQL2005
I had 200 rows in an excel 2007 (this will work for most excel versions) worksheet I needed to insert into a sql table I was building.
You used to be able to use DTS in SQl 2000, but there doesn't seem to be any easy way to bring in this type of data.
First stop: http://forums.microsoft.com/MSDN/ShowPost.aspx?PostID=1381738&SiteID=1 with no luck, just odd SQL errors and I had no time to find the solution unitl...
This is where I found an answer that I could work with: http://www.mssqltips.com/tip.asp?tip=1430
Now this wont suit everybody as it wont work 100% of the time with tables that already have a Primary Key, I had to delete all other table columns except the title column I was after. Once the data is pasted in, I re-added the columns back in again.
I was lucky I had only just started using this table....its always annoying to have a linked tabel used in and around the database that you then have to go and wipe/re-fill with new rows.
cheers,
craig
Wednesday, October 01, 2008
seasick!!!!
Wicked blues/rock music, foot stomping stuff
seasicksteve.com
He is playing the Albert Hall tonight in London!