A great article on whether the cost of converting to Active Active is worth it in Availability Digest last month. Well worth reading.

Active Active is an architecture where two nodes transact and keep themselves up-to-date. Well, that’s the theory anyway. A device or user sends a “transaction” to one of the nodes and if it doesn’t get a response then the request is sent again to the other node, which, ideally, is situated somewhere else, geographically.

The author has done a good job of discussing the costs associated with altering code etc to make it all work. The interesting phrase, of course, is “……..virtually eliminating……..downtime”.

If the nodes contain any form or database (e.g. an account balance in a credit card or ATM type system), then it is possible that the balance on one node is different from the other. This brings the challenge of trying to work out which one is correct. This is called “split brain syndrome”. It’s a ‘mare’ to put right.

Imagine you go to an cashmachine to withdraw some money. The request fires off to NODE-1. NODE-1 updates your account balance then the instruction goes back to the ATM to dispense your money. On the way there is a failure.Now, the ATM knows it hasn’t got a response so requests the same (after a suitable timeout) of NODE-2.

At this stage no one knows whether NODE-2 has the old or the new balance. If you originally requested £100 and you had a balance of £1000, then the new balance should be £900.The ATM requests £100 again so you now have a balance of £800 but you have only got £100 out! Think what happens if your balance was only £100 in the first place. No money!

The alternative Scenario is NODE-1 gives you the money then dies before NODE-2 is updated. Then if you go to the ATM again, the ATM fires a request off to NODE-B and the original account balance of £1000 is still there. Then you can get a second lot of £100 out (so now you have £200) but the account balance has only been debited by £100. Bingo!

In the example above, I concede that I am over simplifying what happens and the checks and balances that go on between the devices – but if you probe various vendors who have architected active/active type solutions, there is evidence to suggest that the above can happen. My verdict?

Active/Active is great for planned downtime (shift all transactions to one of the nodes whilst the other goes into maintenance mode) and as an ‘in-place’ automatic data recovery facility. For ultra high availability in the first instance though, use Fault Tolerant technology!



One Response to “Is Active Active Worth it?”  

  1. Andy -

    I’m just now getting a little time to catch up on your blog. I read your review of my article with great interest.

    Your points are quite right. However, nothing is perfect. The problem that active/active solves is the failure of a site. If a site fails, and there is no backup, all is lost (the press is full of such stories – check out my Never Again articles in the Availability Digest). Active/active provides two or more geographically-separated sites so that you keep going no matter what. True, there is some data loss if asynchronous replication is used – typically tens to hundreds of milliseconds, but at least you are up-and-running. A solution using just co-located fault-tolerant systems is not disaster tolerant. A site disaster will take down the system. The best you can do is to replicate to a remote site. Assuming that you are using asynchronous replication, you will also lose some data. And recovery could be minutes or hours rather than seconds or subseconds as is being achieved with active/active systems.

    By the way, if you use synchronous replication, no data is lost and there is no split-brain mode. You are limited to the disatance between nodes by performance concerns (typically tens of kilometers), but that is better than having all of your eggs in one room.Of course, you’ve got to be pretty clever on how you remove the failed system from the scope of further transactions so that processing can proceed, but there are several solutions to this problem.

    Why not add an active/active option to ftServer to make it disaster tolerant and fully faiult tolerant?

    - Bill


Leave a Reply