Taking disaster recovery from complex to simple

Disaster recovery is neither simple nor easy. The underpinnings of a successful disaster recovery strategy are complex and often difficult to implement and maintain. Disaster Recovery as a Service

If you lead a technology department, you have a running list of disaster recovery issues that rapidly goes from manageable to overwhelming. It is a time-consuming and complicated endeavor. You know that your data systems are critical to your business. Without them, your business soon grinds to a not-so-graceful halt. The risk of losing business compounds with every passing hour. This inevitable moment is the reason to invest in a disaster recovery strategy.

But there are so many moving parts in your core, mission-critical applications. DR is more than just data replication. You need to copy ALL of the other components of your production site that help keep it running from a people, technology and process perspective. Without them, your disaster recovery strategy will fail, and you do not want to learn that lesson on the worst day of your career.

What do you need?

Strong disaster recovery infrastructures are run on two full stacks: One for production and one for recovery. The piecemeal philosophy of running a recovery infrastructure at less than 100 percent operational capacity is doomed to fail. I’ve done it – BAD decision. When planning for a disaster, you have to treat your production stack as if it’s completely gone and assume normal business operations will run on the recovery site for an extended period. This means that anything that supports your production site must be replicated, audited and managed in your recovery site as well. Consider the supporting components of your core, mission-critical applications: email services, data transfer services, authentication systems, time keeping systems, certificate authorities, firewall rules, VPN tunnels, load balancers, WAN and Internet bandwidth capacity, and so on. You also need to consider the infrastructure that protects your production infrastructure: monitoring, backup services, syslog or log review services, intrusion prevention systems, anti-virus, etc. And then there is the replicating infrastructure itself that needs to be monitored, managed, audited, and maintained.

In addition:

Do you have the staff necessary to manage the recovery site during the disaster?
How will you manage your public DNS?
Are any of your applications dependent upon static IP addresses?
How will you communicate with your customers and end-users (e.g. website, phone system, email, etc.)

At the risk of being pedantic, you can already envision your own list of requirements continuing to grow, adding to the complexity of both maintaining and orchestrating a successful DR strategy.

Value of testing

Once you have mirrored your core applications and synchronized the data and critical infrastructure, you must test the recovery site regularly. Testing once a year is not enough. Every change to the primary site adds risk to a successful failover. While it would be great to test after every change to your production site, that’s not always feasible, and there will be some risk you must assume. It’s up to you to develop strong change management and maintenance processes to help manage the success of your DR strategy. Entropy is the enemy, and without regular testing, entropy wins. Without a regular test plan, maintaining the DR strategy and recovery site becomes a bottom priority for your technology departments.

Do-It-Yourself disaster recovery can be prohibitively complex and expensive. In my next post, I’ll outline what you can do to make such a time-consuming project easier.