So lately we’ve been talking about two of my favorite words, stability and resilience. And mostly how my stance is that you can’t have one without the other – because that’s been my experience, without exception. When you take away stability, the resilience goes away because multiple components fail in parallel. When you take away resilience, a single fault takes the whole thing out.
But the whole point of stability and resilience is availability. Because the fact is that availability is paramount to all, period, no exceptions. It doesn’t matter if you’re public facing like Netflix or it’s an internal application. If people can’t use it, it’s a doing nothing more than burning cash.
The idea that cloud somehow changes the equation is, in fact, completely false. Cloud doesn’t change these basic concepts – it changes how you achieve them. That’s all. Doesn’t matter if your solution is hosted or in house; if it’s down, it’s worthless. If it’s down regularly, it’s worthless. But to really understand things, first we need to understand availability and start debunking a lot of the traditional “enterprise” cruft that’s been wrong for as long as I’ve been in IT. (That’s a long time.)
RAS is not what you think it is.
This acronym has been giving me fits for ages because people continue to get it wrong at every level. (I’m equally guilty in the past, I’ll note.) They’re convinced that RAS is an acronym for: Reliability, Availability, Serviceability.
No. Just no. Take that definition, and burn it. Please. Burn every copy of it ever. It’s just plain wrong. First of all, we’re not in the 1950’s – can any of you honestly remember the last time you worried about a vacuum tube going out causing a system fault? How about the last time you had a real concern about a modern SOI method processor simply ceasing to work in a 36 month timeframe? Reliability is so outmoded in many regards, it’s not even funny.
Availability? It’s less of a bad one, but it’s still not what it gets pushed as any more. RAS is applied to hardware. Modern hardware doesn’t perform most availability functions, and hasn’t for a long time. Dual-pathing and redundant cards, sure – but that goes in the R (I’ll get to that.) Availability is primarily a function of software and not hardware. That distinction is beyond important. Even an IBM POWER 795 can’t keep your Oracle database running in a power cut unless you’ve also got PowerHA or RAC.
Serviceability is still relevant though, and accurate as well. But we need to clearly define serviceability. Specifically, it’s the ability of you to perform repairs and corrective actions without disrupting normal operations. That last part is important. And exactly why absolutely no x86 box can pass even basic RAS tests. To be able to pass basic RAS, you need to be able to perform significant corrective actions such as replacing network cards without disrupting normal operations.
Reliability is Retrograde
As I said, when’s the last time you actually worried about a CPU failing just out of the blue enough to really worry about the MTBF? We don’t even bother with doing MTBF calculations on modern processors because it’s completely unnecessary and pointless. Either they’re dead out of the box or they’re likely to keep working infinitely.
R as Reliability is a throwback to the bygone era where we had to worry about things like bad hand solder joints on the 32 ICs making up a single processor, or the bus failing because a wire wrap came loose. Excepting hard disks, the manufacturing process for a modern system long ago applied engineering to send the MTBF through the roof for most modern silicon and solder work.
I mean, let’s get right down to the hard numbers. Intel rates the MTBF of the complicated RMM2 on the S5000 family motherboards at 72 years. That’s for a single component which is in fact a complete solid state system on its own. Micron rates the P300 SSD at 2M hours MTBF – that’s 228 years before component level failure occurs (barring all other faults obviously.) Modern DRAM ICs have MTBFs so high that many manufacturers don’t even bother. This whitepaper from Smart Modular Technologies (a large OEM DIMM manufacturer) calculated the MTBF on a DDR 1GB DIMM at 63.4 years.
Reliability in the context of RAS just doesn’t matter. The parts don’t fail like they used to. So what does the R stand for? Resilience.
In a modern system, parts tend to be either good indefinitely or they fail abruptly. There’s no gray area. Resilience is the ability of the hardware to sustain through an abrupt failure, be it a faulty card or CPU or DIMM. As the aforementioned whitepaper points out, the more DIMMs you have, the greater your chance of encountering a failure – on a logarithmic scale. With modern systems, you’re talking about 16+ DIMMs – which takes your MTBF from a base >15000 to a base <1500 – a 10x greater chance of encountering a system impacting fault. (Remember that we’re talking about single-bit uncorrectable errors. You know, the ones that even ECC can’t prevent from taking down entire systems.)
So R is for Resilience. And Resilience in the simplest terms is the ability of the system to handle those faults without coming crashing down around your ears.
Availability Isn’t In Your Hardware
I want you to repeat that over and over and over and over and over again. And then keep repeating it.
Look, I don’t care what hardware you’re running on. How much of it actually does any sort of software availability on its own? The answer is “none.” Redundant modules, processors, DIMMs, disks, and so on go under the heading of Resilience. Those are there to ensure operation despite faults – not to be confused with failure modes.
Availability is about doing useful things with the hardware. That means running applications, like it or not. Go ahead, try and provide services for a business with absolutely nothing but the bare OS and no add ons. Those applications rely on more applications to make them ‘highly available’ through failing over to other hardware, automatically restarting in the event of a crash, and so on. Presuming your HA software doesn’t break. Hint: it does. Because it’s a Single Point of Failure. (Get used to hearing me rail about those. I’m just getting warmed up.) Hardware availability is “can I order it today and have it tomorrow?”
The A is for Accessibility. And no, I don’t mean Section 508. I mean being able to get at the problem to do diagnostics and effect corrective actions without disrupting normal operations any more than they’ve already been disrupted. Think about things like front mounted swappable disks, rear accessible power supplies, rapid disassembly and assembly, and you get the basics of it. The more repairs you can do without having to disconnect anything, the higher the accessibility.
By the same token, a lack of Accessibility is very bad. Even if your system is resilient, if you have to take it down to effect repairs, the benefit of the resiliency ends up virtually wiped out. If you can’t get at the problem, how are you going to fix it? Answer: you can’t without basically ripping the whole thing out and then reinstalling it.
Serviceability is not the same as Accessibility
The biggest mistake I see being made on a regular basis is people presuming that Accessibility <> Serviceability. This couldn’t be more false if it tried. They are in fact, two distinct areas. Accessibility is being able to get at the problem to diagnose it and repair it. Serviceability is being able to fix the problem with a minimum of actual disruption.
There are a fair number of parallels, but just as many differences. Does it matter that I can get at a PCIe card if I still have to shut everything down to replace it? No – since I have to shut it down anyway, it doesn’t matter that I can get at it. If I have an N+0 multiple power supply arrangement, does hot-swap give me any benefit? None whatsoever. If my disks can be swapped, but not while the system is running, what’s the point of it other than slightly faster repairs?
So to have high Serviceability, you need things like hot plug PCI-Express, hot swap disks, N+1 redundant power supplies, sliding rails with cable management and so on.
So what’s the summary of this concept exactly?
Think of it this way: R+A enables S. Resilience keeps it running, Accessibility lets you figure out what’s wrong, Serviceability makes it possible to fix without disrupting Resilience. Sort of “R+A enables S, S+R improves R.”
So what’s this have to do with Availability? Well, we’ll get to that after I first explain (in part 2) why this RAS applies to cloud and much more importantly, how to apply it to cloud. Because it really does apply to all systems one way or another, like it or not.
Stay tuned for part 2…