Backing Up Large Datasets Without Losing Your Mind

moving large data sets illustration

There is one time-honored way to get bigdata sets into the cloud: dump them onto removable media and call UPS.

This decidedly non-technical solution does work for one-time data dumps to the cloud, massive restores from a cloud-based backup, and large daily backups that take too long over the Internet connection. But when it comes to regularly moving big datasets to the cloud, there has got to be a better way.

There is a way, but it’s not as common as you might think. That’s because backing up large datasets to the cloud is technically demanding, and many cloud backup companies prefer to solve the problem with fatter pipes – paid by the customer. If the customer has the need, money and will to invest, then great. If not, the consequence is unacceptably slow cloud backup performance. The storage alternatives are not great for long-term backup of large datasets: tape libraries are an expensive capital purchase, and even high capacity, low cost disk gets expensive when buying in bulk.

Many companies see the value in uploading backup to the cloud and would prefer to pursue it rather than disk-to-tape (D2T) or disk-to-disk (D2D) options. However, a lot of cloud backup products do not have acceptable service levels for backing up datasets above 2TB in size. The backup will technically happen, but can take a long time — sometimes over 24 hours.
Legacy data migration or backup tools won’t help. Without a dedicated method of speeding up large dataset backup, a company will have to move to expensive tape libraries with massive capacity or buy more and more spindles to house backup data on disk.

Challenges

Let’s look at some numbers. Consumer-grade and SMB backup products are not engineered for backup performance with TB-sized datasets. When you start throwing in terabytes at the backup channel, you run into a real problem. Let’s take a 75TB on-premise dataset, from which you backup 6% a day– about 4.5 TB. With consumer grade cloud backup, you are talking nightly backups above 24 hours. And this does not even account for other activity moving through the WAN: other users, hops, or latency. That’s just not going to work.

What to Look for Instead

When researching enterprise-grade cloud backup, look for distinct features and architectures. Native cloud architecture lets the vendor build in critical features like WAN acceleration and traffic optimization, and sending variable sizes of data over the WAN. Byte-level change detection and advanced compression minimize backup sizes, and multi-threading enables parallel backup and efficient initial base uploads.

Also look for solutions that will manage backup for on-premise datasets of 100TB or more, and that tunes each process with individual clients to optimize data transfer. A high degree of automation will cut down on your management time. Look too for application-specific backup optimization.

Recovery speed is also a major issue of course: you can backup as fast as you can, but can you recover as quickly? Find out from the vendor how fast you can recover datasets in a day; 5TB is a good number to aim for.

As an example of how this works in real life, Zetta.net’s enterprise-level cloud backup service can back up 5TB worth of data to the cloud within 12 hours. Figure a seed upload of 2TB that takes less than 5 hours. And at an average 6% daily change rate, incremental backups take about 1 hour a day.

You can hold off on that truck.