Most people prefer not to think about disasters–which is why, despite all of the available information, most people are grossly underprepared when disasters do occur. The same holds true for enterprises. No one disputes that a datacenter outage, or even worse, the total loss of a datacenter, can be catastrophic–in terms of revenue, productivity, and reputation. And yet only a fraction of enterprises have a comprehensive and well-tested disaster recovery strategy, which comprises tools, processes, and people. We recently conducted an informal survey of our customers to learn about their experiences with datacenter disaster and DR. We wanted to share a few of their stories to prompt you to think about your DR strategy, or lack thereof.
It may surprise you that several respondents admitted that they had no DR strategy at all–unless you consider responses such as, “hope and prayers” and “buying all the milk, bread, toilet paper, and water,” strategies. Yet disasters come in many forms, from the spectacular and tragic–hurricanes, tornadoes, fires, and earthquakes–to the predictable (failing devices and human error), to the comically trivial.
One member of the Nutanix community recalled a full weekend outage at a hospital thanks to a rodent taking down an electrical grid by chewing through lines. “UPS [uninterruptible power supply] kicked in, but the generator did not.” It took 30 hours for the storage vendor to come on site, which meant that “the hospital lost a few million dollars from cancelling every single non-urgent appointment the Monday following. The Board never approved a DR site. ROI would have easily been 400-500% in that single day of outage.”
Wayne Conrad, a Nutanix Consulting Architect, noted the varying fortunes of enterprises in Hurricane Sandy in 2012: “Goldman Sachs HQ was lit up like a Christmas tree, while all those hospitals had gone dark. Why? Goldman Sachs realized that DR and disaster prep are like buying insurance, and the hospitals were cutting and scraping by on smaller budgets. If you’re facing a nasty credit card bill, who says, ‘eh, I’ll just skip paying the car and house insurance.’ IT leadership does this all the time with DR sites.”
Some users made sensible, good-faith efforts, but ran into problems nonetheless: “In my last job we placed the core of our data center in a building designed to withstand an F5 tornado. Safest place in the city. It turned out that heat in the summer was our highest threat to our servers because they started turning the A/C off to that part of the building in the afternoons to save money.”
And of course there were many close calls. One community member said when a potential disaster was imminent, such as a tornado or hurricane, he would back up his company’s data onto two external hard drives. “We were never directly hit with any of the storms, but it’s a little nerve-racking knowing that you have the whole server room stored in your backpack while being evacuated for a storm.”
"We were never directly hit with any of the storms, but it’s a little nerve-racking knowing that you have the whole server room stored in your backpack while being evacuated for a storm."
Nutanix user Tre Bell observed that there is a “common misconception that a successful backup strategy equates to a successful DR strategy.” He admonished that “the restoration of systems in a different location or environment is not always cut and dry – restoring an environment to be fully functional often requires reconfiguration beyond simple backup and restore. Let’s say you have 50 systems in an environment that is a complete loss due to a disaster – most, if not all, of these systems have integrations between one another that require reconfiguration once you are able to successfully restore them to the new backup location or environment.” Bell says that “successfully restoring systems is only the first part of a successful DR strategy – DR testing is also crucial; you don’t know what you don’t know until you perform DR testing to verify 100 percent functionality.”
Given the enormous benefits of proper disaster preparation, why don’t more people take steps to properly protect themselves, or their enterprises? In The Ostrich Paradox, Why We Underprepare for Disasters, Robert Meyer and Howard Kunruether point to several widely shared cognitive biases:
- Short memories when thinking about the painful lessons of the past
- Short horizons when thinking about the future (especially when weighing immediate costs against potential benefits of protective actions).
- Unwarranted optimism–won’t happen to me!
- Oversimplification of cost-benefit analyses when considering risk.
- A tendency to follow the actions of others–that is, herding.
- A tendency to default to the status quo when faced with complexity and uncertainty.
Some good news, however, is that there are offerings now that mitigate some of the biases that keep us from tackling DR, ones that eliminate the complexity and uncertainty associated with traditional DR solutions. Disaster Recovery as a Service (DRaaS) solutions such as Xi Leap provide recovery automation and on-demand non-disruptive testing to ensure business continuity.
Xi Leap is part of the Nutanix Enterprise Cloud OS, which means that IT has no need to master another management console, or worry about reconfiguring network and security settings during DR failover processes.
Meyer and Kunruether point out that humans actually have something to learn from ostriches when preparing for disaster–not by sticking our heads in the sand, but by adapting to circumstances to survive. The ostrich compensates for the vulnerability of being flightless with speed and agility. Rather than simply avoiding proper preparation for disaster by doing nothing (hope and prayers), or defaulting to the status quo (complex DR systems), Enterprises may also want to consider embracing a simpler, faster, and more agile alternative.