anat0l
Enthusiast
- Joined
- Dec 30, 2006
- Posts
- 11,669
Yes, I have designed systems with even lower RPO (1 hour for some systems), but RTO still generally remains much higher. I think the lowest RTO I have seen for a true off-site DR soluition is 12 hours and that was just for the core application with other apps being 24 to 48 hours in a prioritised list.
But in the case of the recent Navitaire system failure, how long was the system outage? Unless they have a hot-standby processing centre, its unlikely the recent event would have pulled the trigger on activation of the plan to move processing to an alternate location.
Well I certainly hope that Joe is not taking his coffee into the data centre computer room environment ...
Component-level failure should be covered by local redundancy. So an outage that takes out an entire application/system for several hours should be caused by a spilled coffee or even a single hardware failure. But we all know that no matter how well you design a system, there will always be unexpected failure modes that result in unexpected outages of some form.
So the end result is that it most likely is possible to design a high-availability solution that would have provided (almost) continuous operations for this system. However, the cost to implement and operate the HA solution may not be justified against the cost and risk of such a failure. It all comes down to risk management and ultimately that is a $$$ judgement call.
It's great to talk about all of this, to a degree. Reminds me a lot of all the theory they taught us in my BInfTech about redundancy, architecture and risk management. Of course, in the days that I did my degree, management of IT infrastructure was predominantly biased towards lowest cost without much regard to reliability or minimisation of failure (e.g. single point of failure / accountability). Don't know if the culture has changed in years.
Back on topic, we'll probably never really know whether the failure could've been covered by adequate infrastructure (i.e. inadequate management of risks), or whether this was just a "Swiss cheese holes" incident. On top of that, there is the contractual obligation between Virgin and Navitaire for uptime - if Navitaire had written in anything less than 100% uptime in their service contract, that does give them a "leeway" for failure, even though we all know that failure is quite unpleasant, as just witnessed.