Discover PerformanceHP Software's community for IT leaders // September 2012
A little chaos buys you a more resilient enterprise
HP Software Security Evangelist Rafal Los argues that rather than chasing stability, you should use instability to inoculate yourself against catastrophe.
By Rafal Los
Hey, CIO—what if I told you that all these years you’ve spent on building a stable IT infrastructure were actually bad for your business?
After I’d spent 10 years in information security, working always toward higher levels of stability, a colleague challenged my thinking by showing me historical models of stability vs. shock tolerance in global currency, then putting it into IT context. An analogy: If we allow our IT environments to become a sterile bubble, the slightest germ can become deadly. Therefore, could what you see as long-term stability actually be hurting your business? After wrestling with the idea, I think the answer is yes.
I wrote a blog post a while back called “Resilient is the new secure,” in which I challenged some conventional thinking about whether a constant state of (instrumented) chaos is better than a stable IT environment, and it has generated some conversation. Let me make it clear, I am not advocating a state in which IT provides unreliable services to the business; rather, quite the opposite. For the sake of this discussion, let us define “stable” as absence of change, or lack of incident. Which leaves one big question: How could that be bad?
‘No problem’ is a problem
In the long haul, stability brings four very negative results: complacency, change resistance, rigidity and a diminished capacity to respond and recover. I tackled stability’s negative impact at greater length on my blog. In short, there’s an emerging line of thinking in IT that says the longer a system experiences no adverse conditions, the less shock the system can recover from. Much like a search-and-rescue team that sits idle for too long can become rusty under pressure without constant drilling and practice.
In organizations that have never experienced a severe outage, or a serious security breach, or a service disruption of any kind, this means that the longer they stay in this steady state, the less chance they will be able to recover meaningfully when catastrophe inevitably strikes. This challenges the entire concept of stability and security as an absolute state. I’d argue that the concept of the static enterprise should be replaced with a vision of the resilient enterprise.
Enterprise “resiliency” is the capability of an organization to respond to adverse conditions with the kind of poise and purpose that only comes from “having done this before.” Whether your incident response team is investigating a security incident, a hardware failure that’s causing an outage, or a failed change in your environment, the ability of your systems to stay available or recover from failure is critical to the continuity of your business. The thing is, the best technology in the world, on its own, won’t make your business completely resilient to failure. You need to incorporate techniques and human resources into this idea—and processes and people only improve with practice.
Therefore, crazy as it may sound, stability shouldn’t be your enterprise’s ultimate goal. The fact is, if your organization is stable, it’s not changing and growing. “Nothin’ ever goes wrong” can quickly turn into “nothing is broken, so don’t touch anything,” and means your IT and IT Security talent stagnates. People in your IT organizations should continually be updating and using their skills—especially the ones who will mitigate a crippling disaster.
How to be chaotic
So if some measure of chaos is what keeps skills sharp, how can we introduce it without hurting our ability to deliver good service to the business? How do we break things a little so that they don’t break a lot? The answer is partly automation, and partly planning. Learning from each successive instrumented failure, we can leverage automation to detect and/or compensate faster and more accurately in the future, when an unplanned failure strikes. With enough instrumented chaos, the likelihood of unhandled failures will fall dramatically. Building policies and procedures—and exercising them—in a manner that is conducive to response becomes crucial in this type of enterprise mentality.
Build applications and services for component-level and system-level resiliency. Allow components (a switch, virtual machine, content cache, or even a database query) to regularly fail, and your systems and processes to detect and respond. The same goes for supporting complete systems, environments and response teams. Having built in component-level resiliency and constantly testing for it in a state of controlled chaos, you can be more confident that real failure won’t catch you off guard.
With information security, of course, you don’t set up to allow occasional failure. Instead, you launch red team activities against your enterprise to simulate response to active security threats.
Fix failures faster
Ultimately, learning to live with an acceptable level of constant chaos teaches us to fail fast and recover fast—which also improves your MTtR (Mean Time to Repair) metric. Defining “acceptable” has always been tricky and subject to your specific organization’s needs, but we can define it loosely as a state that pushes our response mechanisms and teams to continually improve, without negatively impacting our business or environment. The role of the IT leader is to push this boundary beyond test environments and into production environments. The DevOps community is accommodating this through small, agile response teams, as my friend Ben Kepes mentioned to me once.
The shift in organizational mentality to embrace a culture of instrumented, low-key chaos is monumental. Start by asking yourself, “How did our organization respond the last time there was an unplanned failure?” and think about the biggest challenges you faced during recovery. If you could improve those lagging response components by a meaningful amount of time/effort—what would that be worth to your organization? Every organization I’ve ever been a part of has spent countless dollars and immeasurable energy striving for stability in which everything is predictable. Unfortunately, these are the organizations that recover slowest when the inevitable, unpredictable catastrophe hits. Technical risk isn’t black-and-white, yes or no. The question isn’t whether you’re secure—it’s how fast and how often you can recover from a crisis compounded by pressure and unexpected technical or resource challenges. The only way your organization is going to outperform the catastrophe is to learn to live with it on a daily basis. Rather than living in that sterile bubble, you learn to “get a little dirty” every day to keep your immune system growing and constantly evolving to meet new threats.
Diving into disruptive technology trends like cloud, mobile, and Big Data, HP’s CEO talks about moving not just IT, but the whole enterprise, into a new era.
Dig into strategic trends with our new Discover Performance Weekly video series, and go backstage at events like RSA.
Gigaom and Cerner discuss real examples of how advanced analytics can transform healthcare.