In this article, RTC Magazine reports that the high availability (HA) systems market is moving rapidly from an in-house proprietary systems approach to a commercial-off-the-shelf (COTS) direction, making such solutions available to a wider range of applications. In comparing the two alternatives, it is driving quite a lot of comment.

One comment is from my colleague Dara Ambrose. Dara points out that another important factor to consider when implementing a HA solution is the cost of validating that it can withstand faults and continue to perform as expected.

In his comment he explains why pushing as much of the HA solution into COTS components as possible, can greatly reduce this investment.


A great article on whether the cost of converting to Active Active is worth it in Availability Digest last month. Well worth reading.

Active Active is an architecture where two nodes transact and keep themselves up-to-date. Well, that’s the theory anyway. A device or user sends a “transaction” to one of the nodes and if it doesn’t get a response then the request is sent again to the other node, which, ideally, is situated somewhere else, geographically.

The author has done a good job of discussing the costs associated with altering code etc to make it all work. The interesting phrase, of course, is “……..virtually eliminating……..downtime”.

If the nodes contain any form or database (e.g. an account balance in a credit card or ATM type system), then it is possible that the balance on one node is different from the other. This brings the challenge of trying to work out which one is correct. This is called “split brain syndrome”. It’s a ‘mare’ to put right.

Imagine you go to an cashmachine to withdraw some money. The request fires off to NODE-1. NODE-1 updates your account balance then the instruction goes back to the ATM to dispense your money. On the way there is a failure.Now, the ATM knows it hasn’t got a response so requests the same (after a suitable timeout) of NODE-2.

At this stage no one knows whether NODE-2 has the old or the new balance. If you originally requested £100 and you had a balance of £1000, then the new balance should be £900.The ATM requests £100 again so you now have a balance of £800 but you have only got £100 out! Think what happens if your balance was only £100 in the first place. No money!

The alternative Scenario is NODE-1 gives you the money then dies before NODE-2 is updated. Then if you go to the ATM again, the ATM fires a request off to NODE-B and the original account balance of £1000 is still there. Then you can get a second lot of £100 out (so now you have £200) but the account balance has only been debited by £100. Bingo!

In the example above, I concede that I am over simplifying what happens and the checks and balances that go on between the devices – but if you probe various vendors who have architected active/active type solutions, there is evidence to suggest that the above can happen. My verdict?

Active/Active is great for planned downtime (shift all transactions to one of the nodes whilst the other goes into maintenance mode) and as an ‘in-place’ automatic data recovery facility. For ultra high availability in the first instance though, use Fault Tolerant technology!


Having made my predictions for the year to come before the recent inclement weather, I’ve been doing a bit of thinking. I wanted to join other pundits in making my forecast for the decade. Yet, since most people can’t even decide what to call the decade, and since I myself keep failing to cross the first hurdle (believing that the infrastructure to make anything positive happen will be in place) I’ve decided not to.

Instead this post is a friendly reminder of what seems to be becoming my mantra: only an ultra high availability infrastructure will ensure that the expected service levels of the new decade are met

I guess the inclement weather has taught all of us a thing or two over the last few weeks. It may have been the worst for the last 30 years, but it still amazes me how much disruption a natural event is causing.

I remember family stories (the good old days when grandfathers had to walk to work) where the snow was so high, you found yourself walking over hedges, ending up in the middle of a field and not realising where you were (or perhaps the beer was stronger back then).

I admit it has been pretty cold – the car thermometer said -11 where I live yesterday, but I remember -18 as short a while ago as 1995.

Since making my predictions last year, these allegedly extreme conditions have affected me in many ways: flights delayed/cancelled; roads closed; a power cut; my bank deciding to shut all my accounts in error. Connecting all of this is a common theme: the fragility of our infrastructure.

Link this fragility to aggressive “efficiencies” that have been put in place by various organisations due to extreme conditions and it is beyond fragile. It’s broken.

Technology can help people during times of extreme weather. We can fall back on “technology” to, for example, research information on the weather/flights/roads and to work, bank and shop  from home. Often, however, service providers’ IT infrastructures simply can’t take the strain.

Visitors to National Rail Enquiries, the Highway Agency and the RAC’s web sites, for example, found this out last week, according to the Daily Telegraph.

If, in this day and age, snow can still cause havoc for travel websites, then I fear for the government’s plans for 2014 becoming a reality and am certainly not prepared to commit my predictions for the new decade to media, just in case they come true.

I am, however, interested in hearing yours, so come on, what do you want the new decade’s availability story to be? Let’s get some discussion going…


This week’s anti-Brown activities got me to thinking about Gordon’s   announcement at the end of last year regarding the need to ensure all public services are online by 2014. Putting the Frontline First, as the report which launched this initiative is called, makes sense, but only if a an infrastructure is in place that can guarantee excellent customer satisfaction.

The Putting the Frontline First report explains how, if all services are put online, the government hopes service will be improved and staff will be freer to deal with individuals.

When launching the report, Gordon said that within the next five years the government will shift the great majority of its “large transactional services to become online only”. Doing so, he hopes, will bring £3 billion additional savings to the £9 billion savings outlined in this year’s budget.

I look forward to finding out more from the Digital Britain Roadmap due at the end of 2010. The roadmap, apparently, will explain how services such as job seekers allowance, student loans and child tax credits will be making the move online.

Such important public services preparing to move online by 2014, does of course put high availability at the frontline. The knock on effect of failing to deliver less than 99.999% availability will be catastrophic, as the Department of Labour and Industry in the US discovered last month when its unemployment web site for the Pennsylvanian district went down, causing difficulties for people filing unemployment claims.

According to WNEP, those visiting the site were greeted with a message telling them that the online filing system “is operating intermittently”. This purpose of the site is to help claimants file for unemployment compensation services in Pennsylvania. The technical problem with the online system meant that many who were trying to file claims were understandably concerned and tried to get in contact via telephone to make sure they received their benefits.

As a result, further problems were created for the phone system because of the high numbers of calls being received from people who were worried their unemployment cheques would not be sent. Those who could not get through due to the high volume of calls had to wait until after 5pm and call again. Such downtime could result  in benefits not being paid on time and vulnerable people suffering severe hardships, their lives even potentially put at risk, due to circumstances beyond their control.

It’s easy to imagine similar scenarios for other governmental departments : imagine going to buy your car tax and the system is down, meaning you’re driving or keeping a car on the roadside illegally and it’s not your fault.

Of course, we don’t need to imagine what would happen in the student loan scenario, we’ve already seen some people having to give up university this year due to the current chaos.

Putting The Frontline First makes sense — but only if the infrastructure is in place.


Well here we are at that time of the year when pundits begin to review the year gone and make their predictions for the year to come. So what’s the story in world of Availability? Here’s my view:

At the risk of sounding bah humbug, what sticks out most in my mind about 2009 was the number of high profile outages for companies that really can afford (in all senses of the word) to do better:

Salesforce.com; Barclays; O2; Blackberry; Google; Virgin Media; YouTube; BBC; Vodafone; EBay; Microsoft’s Bing; London Stock Exchange; British Airways and during the busiest shopping day of the year, ToysRus (hopefully Santa has the fall out covered…)

Also worthy of mention are the various reports showing that not as many companies are deploying virtualisation in production as the pundits predicted last year.  Perhaps they are frightened of the always on world and the associated increase of risk?

Then on the good news front, there was the announcement about Avance, the breakthrough software solution set to make ultra high availability ultra highly available within the next two years.

And so to 2010, what’s on the horizon?

Clouds, clouds and more clouds. Bringing with them growth in the virtualisation and ultra high availability markets.

I believe that this will be the year when cloud computing finally passes the tipping point. Whether implemented in a private way (as predicted by Gartner) or not, the best way to deploy in the cloud is via virtualisation.  So we will see this market growing.

This, of course, will put pressure on the availability story – so the need for some damn good platforms to host it all on will grow too.

That’s it for this year from this availability advisor.

See you in 2010!


Last Friday was a bad day for airlines.  British Airways lost its website and didn’t know when it was coming back.  Another airline, Flybe, lost its call centre too.

BA’s site crashed offline at about 6am UK time and technical staff were still working out what the problem was at 11.30am, according to The Register, which reported that:

“The airline apologised to customers and suggested they hit the phone and use the call centre (0844 493 0 787) rather than trying to book tickets online.”

Unrelated, I happened to have had a flight altered by Flybe so tried to contact their call centre to sort out the mess. Usual automated response that got cut off automatically after 2 minutes.

Eventually I managed to get through to reception who refused to take a message but admitted that they were having “severe problems with phones”.

Losing a website would not have mattered a couple of years ago, but with most passengers now checking in and reserving seats online, this is likely to cause BA customers – and its airport staff – some serious hassle.

Call centres that do not work is not the answer either. The answer is to ensure all systems are available 24/7 and not to allow customer satisfaction to take a swan dive.


I came across this paper earlier this week, which shares one proven way to position business continuity management issues so that board members will listen.

It offers valuable advice and insights and a practical four step approach to reaching key decision makers quickly and with a better chance of success. I would add one more thing:

Ensure that you clearly communicate the role of a rock solid, fault tolerant, compute platform in mitigating the need to trigger disaster recovery and business continuity processes.

Doing so requires clarifying the differences in the scope, methodologies and outcomes of ultra high availability (99.999% uptime), high availability (99.99% uptime), disaster recovery and business continuity.

This paper explains each and examines how availability management, disaster recovery and business continuity support one another.

Enjoy!


I’ve long been aware that misuse of the term “Fault Tolerant” (FT) “ is laying companies open to business risk and financial losses from system downtime.  Now, research conducted by TheInfoPro shows the true impact of such misuse of terminology on the sector.

“Stop this now! Stick to clear definitions,” I cry. “Or else…”

Continue reading ‘Zero tolerance on misuse of the phrase ‘fault tolerant’’


Fault tolerance in virtualised environments doesn’t get more exciting than this, according to a recent vox pox in The Register.

I beg to differ. Rather than borrowing Trevor’s axe-based approach, I shall attempt to employ one of focus.

Fault tolerance can indeed be built into a virtualised environment such that availability is ensured. All it requires is an Ultra High Availability Server at the service layer with total duplication built in above.

It’s really that easy. Here’s a picture:

Adam suggests a cheaper alternative to fault tolerant hardware might be to invest in a blade frame and an even cheaper option, fault tolerant software.

Before going either of these routes, please consider this:

  • Some fault tolerant hardware does not require coding so there are no related costs. Such hardware happens to be of the ultra high availability delivering species too. They also happen to reduce the cost of management and downtime.
  • What about the costs of management, downtime and coding related to blade frames and software solutions?
  • At what price do you value your reputation? Any outage requiring services to customers to be restarted on alternative servers, no matter how short lived, costs your customers time, money and in some cases lives. Downtime is simply not an option.

Trevor paints a great picture of Servers B and C coming to the rescue of Server A post an axe attack. Hang on though. Aren’t we forgetting who the victims are here?

Aren’t we forgetting about the customers who are unable to access the services on server A for the split moment it went down and whose customers’ livelihoods and in many cases quite literally lives, may have been at risk? If Server A had been truly fault tolerant then it wouldn’t have needed rescuing and there would be no victims.

These things aside, this vox pox does celebrate what we all know to be the best thing about virtualisation: flexibility. It’s great that recovery times are improving in the case of failure in a virtualised environment. For the greatest flexibility however, you need to ensure an environment where restarting is not necessary in the first place. Fullstop.


A technical fault in a London Stock Exchange (LSE) server caused trading to falter on nearly 300 stocks for an hour and a half on November 9th, reported ITPro. Technical difficulties hit once again last Thursday, causing trading to halt for three and half hours, according to the BBC.

 

That’s five hours downtime in one month.

Though LSE promised a full report after the first failure, I have not seen any evidence of it, so can not comment on the causes. I can, however, take an educated guess on its effects, and of the effects of the second outage.

Or can I? Maybe this is one of those rare occurrences where downtime actually led to saving money rather than losing it. In fact, in light of the Dubai situation, maybe someone at LSE took it down on purpose?

Joking aside, this is not the first time Europe’s oldest exchange has suffered from technical problems. Just last month, a problem in a market data feed suspended some trading and a year ago systems failure caused both the UK and South African markets to cease all trading for almost seven hours.

With a switch to Millennium IT’s trading platform planned for next year, LSE needs to fully resolve and clearly communicate these issues before “trading as normal” takes on a whole new meaning…