Stability + Resilience, not Stability|Resilience

So @AndiMann @jamesurquhart @f3ew and I have been having a bit of a discussion on Twitter regarding resilience, change control, deployment cycles and such and I really can’t fit my thoughts on the matter into 140 characters. It is a bit of a complicated topic, but not as complicated as folks keep saying in various outlets.

First of all, the most important thing to recognize is that almost any company pushing it as an OR rather than an AND is trying to sell you hardware, software or services. Period. That’s just a truism and it’s unavoidable and something they don’t like me calling them out on. Especially when I point out that they’re trying to sell you these things whether or not you actually need them. But come on – we all already know they’re going to do that.

But that’s not what we’re here to look at as much as the AND versus OR argument. There’s a lot of folks who have gone completely overboard with this idea that if you don’t do continuous deployment, you’re doing it wrong. And the simple fact of the matter is that they’re wrong. IT is not a zero sum game, nor is it strictly OR operations. Most organizations don’t want or need continuous deployment. And many organizations (e.g. Google who likes to break their infrastructure at the expense of paying customers and products) are doing it completely wrong.

So let’s start with my definition of these two terms, which arguably skews a lot closer to the dictionary than what’s being pushed:

  • Stability
    The ability of a process or system to continue to operate normally despite potential error conditions or external changes; efforts to reduce the number of shocks the process or system is subjected to; and the capability of an organization to leave something in place without having detrimental impacts.
  • Resilience
    The ability of a process or system to survive deliberate or accidental internal or external changes (e.g. power outages, network interruption, etcetera); the capability of an organization to adapt to changes without detrimental impacts; and the capability of a process or system to adapt to unforeseen shocks.

Sounds a lot simpler that way, doesn’t it? That’s because it really can be expressed that simply. Too many people have gotten onboard with this ridiculous idea that stability is analogous to technical debt. That somehow every organization must jump into [cloud,virtualization,buzzword] with both feet or they’re basically banging two rocks together over tin cans and string. And this is where stability comes into the picture.

Stability is not a dirty word. It is not a bad thing. It does not mean ‘stale’ or ‘obsolete.’ It is in fact, absolutely critical to running a successful business, period. Companies do not succeed because they push code to production every hour or because they sling buzzwords around like it was going out of style. (Oh, how I wish it would go out of style…) Companies succeed and fail on the basis of stability – which is not aversion to change or adaptation. It’s recognizing the simple fact that downtime is not free and that change is not always good. Customers quickly abandon unreliable services unless they have no other choice.

The other bad argument is that continuous deployment somehow contributes to resilience. It does not. Resilience is a matter of building a resilient infrastructure in all regards. That means technical and personnel. As an example: using SNA over dry pair in 2012 is not technical debt and it is not bad. However, itbecomes bad because you now only have 2 people in the whole company that can fix problems with it.

By comparison, if you had converted to TCP/IP in say, 2003, you would have been reducing stability. A total rewrite of the processes involved, the introduction of external untested factors, and putting themselves at the mercy of the Internet. Hundreds to thousands of developer hours spent writing, debugging, and testing that. So the wiser choice was always to stick with SNA – it was tested, proven reliable, and they only argument for changing it is the perception that it’s old technology.

Now fast forward, and you’ve done a from-scratch rewrite to handle TCP/IP. But you’re sending the same data – does it really make sense to change the entire user interface and entry process? There are a lot of folks who argue that ‘resilience’ means “YES! You have to! Otherwise you aren’t resilient!!” Actually, the opposite is true: if you actually have a resilient process, changing the underlying networking shouldn’t affect it.

So now that we’ve got a basic understanding of my take on it, let’s dig in.

There is an idea that resilience and stability require trade-offs in one form or another. But when we look at the example above, you should notice something. No “trade-off,” no loss of stability or resilience, and no detrimental impacts. The users in the SNA example doing data entry do not need to be retrained, since the process is the same. The developers maintaining the code simply recreated the existing business process (the data entry portion) and completely discarded the SNA aspects, instead of reinventing the process.

And this illustrates exactly why change is not always a good thing. If they had thrown everything out including the data entry portion, that means having to retrain all of the users on the new interface, as well as creating a new interface. Any change to the interface is inherently disruptive. Think about what happens when you add a field to a commonly used form. What invariably occurs is that for days or weeks, people neglect to fill out that new field. Unless you disrupted the business to spend half a day going over what is essentially a minor change. Not only that, but you’ve made it much more difficult to compare to previous copies of that form.

The same thing applies in IT. If you go from logging X, Y and Z to logging X, Y, Z, and A you’ve made it more difficult and disruptive to do data comparison. “But I know when and where I made that change,” you cry, “so I can just assume” and somebody breaks out the adage about ‘assume.’ Not only that, but even when you can clearly explain why you added A, it has a ripple effect.

When you talk about adding A and removing X and repeating this process on a daily or even weekly basis, well, you’re completely deluding yourself if you think this is a good idea. Think about how often users complain bitterly about every little change Facebook makes. Stability is about avoiding these unnecessary shocks to the system. Think of it as the process by which you carefully consider whether or not a given change is actually necessary rather than likely beneficial.

Which is why ‘being able to run as-is’ is part of it. If you can’t trust the system to operate without any changes, much less without daily or weekly changes, it’s not a stable system. Period. This is why large organizations still have large scale mainframe deployments. They’re able to count on that system, day after day, to carry out it’s tasks without fail. But these systems don’t get there without resilience – redundant hardware, failover mechanisms, error tests and corrections, and processes that have been tweaked to ensure they don’t fail. Downtime is the enemy.

Yes, we all like poking at the knobs and tweaking things and making it “better.” But you have to be prepared and able to just leave it the hell alone. If you can’t just leave the system in place, untouched, then it is not a stable system. If you cannot adjust the system in-situ to compensate for shocks, it is not resilient. Avoidance and accommodation are both necessary components.

Simply put, if you find yourself forced to do a patch due to external factors beyond your control just to keep running, it is not stable or resilient. If the shock of an API change breaks it, or an API you rely on changes daily or weekly or monthly, it’s just not resilient – you’re failing to practice effective avoidance. Without avoidance of shocks you cannot have stability, which is always paramount – and without resilience you can’t compensate for those shocks.

So, as I said; it’s an AND and not an OR. The argument that resilience and accepting downtime as ‘normal’ doesn’t fly, folks. It never should. You can’t have resilience without stability, and you can’t have stability without resilience. The idea that stability is tantamount to ‘technical debt’ is frankly, intellectually offensive. The entire idea of ‘technical debt’ is intellectually offensive. Technology is and remains about getting a task done. If you’re getting that task done reliably on a system designed in the 1980′s, that’s not technical debt – that’s stability and resilience.

14 Responses to “Stability + Resilience, not Stability|Resilience”


  1. Devops, complexity and anti-fragility in IT: An introduction — Tech News and Analysis

    [...] control, and whether they are good or bad for the future of software resiliency. It resulted in a well-articulated post from Phil, arguing that you can’t have resiliency without stability, and vice [...]

  2. Devops, complexity and anti-fragility in IT: An introduction ← techtings

    [...] control, and whether they are good or bad for the future of software resiliency. It resulted in a well-articulated post from Phil, arguing that you can’t have resiliency without stability, and vice [...]

  3. Devops, complexity and anti-fragility in IT: An introduction | ImpressiveNews

    [...] control, and whether they are good or bad for the future of software resiliency. It resulted in a well-articulated post from Phil, arguing that you can’t have resiliency without stability, and vice [...]

  4. Global Tech Review | Devops, complexity and anti-fragility in IT: An introduction

    [...] control, and whether they are good or bad for the future of software resiliency. It resulted in a well-articulated post from Phil, arguing that you can’t have resiliency without stability, and vice [...]

  5. Devops, complexity and anti-fragility in IT: An introduction - Cleantech Reporter

    [...] control, and whether they are good or bad for the future of software resiliency. It resulted in a well-articulated post from Phil, arguing that you can’t have resiliency without stability, and vice [...]

  6. Devops, complexity and anti-fragility in IT: An introduction « vyagers

    [...] control, and whether they are good or bad for the future of software resiliency. It resulted in a well-articulated post from Phil, arguing that you can’t have resiliency without stability, and vice [...]

  7. Devops, complexity and anti-fragility in IT: An introduction | Apple Related

    [...] control, and whether they are good or bad for the future of software resiliency. It resulted in a well-articulated post from Phil, arguing that you can’t have resiliency without stability, and vice [...]

  8. GIASTAR – Storie di ordinaria tecnologia » Blog Archive » Devops, complexity and anti-fragility in IT: An introduction

    [...] control, and whether they are good or bad for the future of software resiliency. It resulted in a well-articulated post from Phil, arguing that you can’t have resiliency without stability, and vice [...]

  9. Ron

    I completely agree its an “AND” game rather then an “OR”. an “OR” approach only means the business will outsource your function to someone that can give it an “AND” answer – period.. Having agreed to that I think that as an industry (Iv”e been in IT since 1995) we have this bad habit of blaming complexity for everything and that is many time absolutely wrong – in-fact, very complex systems work just fine as long as they are untouched. so when do failures (big or small) happen – when we change and many times (I wish I could find a quantitative research to prove what I am about to say) when failures are diagnosed for root cause it is process failure (wrong file version somewhere, or configuration value etc..) that caused failure. Its the change application process that is fragile more than anything and there is a lot that can be done to reduce that fragility. I think that is the main promise of all existing PaaS architectures, they streamline and reduce variance in the change application process- but they do it at the expense of freedom to choose an optimal architecture many times so developers will not always adopt these (for good reason as per some of your examples above). so for the rest of the world DevOps is now focused on culture and meetings – those are definitely in need but culture alone will not solve the fragile nature of the change introduction processes itself (upgrades, new releases, patches etc..), there needs to be a platform that allows the process to be reliable and reusable.

  10. Devops, complexity and anti-fragility in IT: Stability and resilience — Tech News and Analysis

    [...] and cloud computing — will focus on that question, the one that prompted Phil Jaenke to write the blog post that inspired this [...]

  11. Devops, complexity and anti-fragility in IT: Stability and resilience ← techtings

    [...] and cloud computing — will focus on that question, the one that prompted Phil Jaenke to write the blog post that inspired this [...]

  12. Global Tech Review | Devops, complexity and anti-fragility in IT: Stability and resilience

    [...] and cloud computing — will focus on that question, the one that prompted Phil Jaenke to write the blog post that inspired this [...]

  13. GIASTAR – Storie di ordinaria tecnologia » Blog Archive » Devops, complexity and anti-fragility in IT: Stability and resilience

    [...] and cloud computing — will focus on that question, the one that prompted Phil Jaenke to write the blog post that inspired this [...]

  14. Devops, complexity and anti-fragility in IT: Stability and resilience « vyagers

    [...] and cloud computing — will focus on that question, the one that prompted Phil Jaenke to write the blog post that inspired this [...]