Mitigating the management expense of offsite backup

For 15 years, Online Tech has provided data backup services to our colocation and hosting clients. We used to back up to tapes in a large multi-tape system, then transport them from one data center to another. We moved to disk-based backup when the total cost to operate per GB of protected data using the disk was clearly less than tape.

We knew we had to find a new way with a lower total cost of ownership (TCO) if we were going to keep up with the data growth. The tipping point was when we saw our costs to maintain the physical tape drive (for oil, pads and new arms) was 5 figures per year. All that money on moving parts to back up data in the modern era? … Then came disk, and a whole new slew of options.

Over this period we’ve learned what really drives the TCO for backup. In fact, we analyze it all the time and make the necessary investments to deliver the most effective solution for the most efficient TCO. It’s what we do. Here are a few things we’ve learned doing that for 20 years:

Monitoring: Like all systems, a backup system (software, storage, etc.) needs to be monitored and someone needs to respond to it when something doesn’t work properly. Unfortunately, due to the intense pressure on IT staff, backup is often the task with the lowest priority – until there’s a data loss.

Software: Backup is an application like any other and relies on software to run on the server and at some central location to operate. Software is sometimes sold as a single, one-time license with a smaller (10-20%) maintenance contract or a recurring monthly fee, often based on the amount of data protected.

Servers: The backup software runs on the servers being backed up, but also there’s a server component that needs to run on a centralized server. This is where the management and the administration of the backups occur. It might also include the storage drives.

Storage: The backup software has to write the backup jobs to a storage device somewhere. This storage device is generally a disk drive of some type. Ideally it’s expandable, or you will have to take the gamble and guess how much storage you will need over time. If data is taken offsite, there is usually a local copy for file stores, etc., and a remote copy for the offsite portion.

Offsite: Proper backups should be taken offsite. The transport and storage of the media data offsite can grow dramatically depending on the length of retention and quantity of data. Over time, it’s important to track this cost carefully.

Failed jobs: Every time a backup fails, someone has to manually inspect and restart the job. And not just with a simple click of a button. Restarting a backup job requires careful coordination to avoid negatively impacting production performance. It takes about 60 minutes to deal with a failed job – on average. Some are very quick. Some can take a long time to sleuth why they fail. Extrapolating that across thousands of backup jobs and a 1- to 5-percent failure rate (not that uncommon) can quickly begin to consume staff hours and, more importantly, put data at risk. Thankfully, we weren’t using tape backups or our failure rate would have easily been 50 percent or more, as those stuck with using tape media are still experiencing extremely high failure rates.

File restores: In many cases, restoring a single file can take hours (or more) of someone’s time, depending on the level of self-serve capability there is for the end-user and the type of media.

Encryption: Given the plethora of new regulations surrounding data, it’s important to make sure the data is encrypted in transit and at rest. Without encryption you may be exposing yourself for significant fines.

Backup Management: All these components have to be patched (the backup software) or at times replaced (disk drives); vendors managed; track capacity and utilization, etc. For many organizations this isn’t core.

Like everyone else, our data grows exponentially. A few years back, we reached a level where our standard point solution for offsite backup was being stretched beyond capacity. With 100s of terabytes of data to protect across hundreds of servers, the amount of data had become so large and diverse that the process of backing it up was threatening to impact production performance. Worse, our backup jobs began taking longer than the allotted backup window. This meant failed jobs and even more complications.

When we completed our total cost of ownership analysis of our backup, we indeed found that the failed jobs, backup management and offsite were the biggest cost drivers. Further analysis showed if we could dramatically reduce the time backups took, we would:

A) Reduce the failures, which are very expensive and add risk.

B) Reduce the costs to take the data offsite, because there would be less data to take offsite.

C) Reduce management costs, because there would be less infrastructure to manage to protect the same data.

After six months of requirements gathering, vendor research, reference checks and testing in our lab, we integrated a product called Avamar by EMC into our backup product and portal. The result:

Backup jobs that used to take many hours now take just a few minutes.
Failure rates are down 90%.
Reduced transport costs to take data offsite 90% (deduplication).
Initial seed backups are much faster, allowing us to protect even more data.
Jobs can be restarted within the backup window, reducing risk of data loss.
We can protect significantly more data with the same management costs.
Clients and staff can securely self-serve a file-level restoration.
Encryption at source, in transit and at storage is built-in reducing risk of data loss and without penalty to processing

These types of investments and management of our IT infrastructure on behalf of our clients means our clients maximize their ROI on their IT spend. After all, that’s our real job.

Offsite backup and recovery: Understanding the hidden costs

White paper: Disaster Recovery