Erasure Coding: A Big Deal for Data Protection

erasure coding secure computer

Unless you are actively managing big data storage, you may not be intimately familiar with erasure coding (EC). Essentially, erasure coding exists to efficiently rebuild disk or flash media in petabyte to zettabye-sized scale-out storage. It does not replace RAID throughout the storage world by any means but is considerably faster than RAID in these large data environments.

And these environments are growing. It’s not particularly difficult to reach petabyte levels of production data, and huge data repositories and hyperscale cloud providers who store active data on disk have surpassed exabyte levels. Thus erasure coding is suitable for storage environments running big data on high capacity disk drives of 6TB and up, and particularly good for large object storage environments such as those found in the cloud.

Let’s take a quick look at RAID, a fundamental data protection product that rebuilds data environments when storage media fails. RAID 1 mirroring for example does exactly as described: it mirrors the protected data set. RAID 5 and 6 both use parity-based data protection where parity determines values for data stored on failed media and recreates the missing sets. RAID 5 uses XOR calculations and RAID 6 uses XOR plus Reed-Solomon. In both cases, read performance is not affected but write performance is. Calculation overhead can also affect rebuild times, which becomes a serious issue with large drives in big RAID sets. The time that it takes is not only due to the number and size of media and the amount of data to be restored, but also to available IO. There is also the issue of a second disk failure during a rebuild which can be catastrophic.

For these reasons, RAID-based protection is not practical for very large scale-out storage, especially object-based storage with billions of objects.

Enter Erasure Coding

Erasure coding is not new but has gotten more attention over the last couple of years as storage environments grow to petabyte and zettabye scales. EC offers much faster media rebuild times in large environments: where a RAID rebuild might take weeks, an EC rebuild can take hours.

EC algorithms break data into several fragments and add metadata to each fragment. It then stores each fragment in separate disks. Should a drive fail, EC will combine remaining fragments and their metadata to recreate the original data set. The rebuild occurs quickly since data recovery does not read all of the data, just enough to recreate a failed fragment.

However, nothing is perfect and EC is no exception. There is no recreating data sets cast a certain point. If enough drives fail and take the fragments with them, there will not be sufficient fragments to recreate the data set. Incidentally, this is why neither RAID nor erasure coding are sufficient for data recovery; you must maintain backups for that.

Also, even though EC does not need to read an entire data set, it still needs sufficient CPU cycles to access the number of fragments needed to recover the data. Although EC may be magnitudes faster than RAID rebuilds, it’s still more useful for protecting less active production data such as disk-based nearline storage, or object-based storage in the cloud.

RAID remains an excellent choice for smaller data stores and small sized block data. But as your data grows, consider investing in storage management and data protection products that offer erasure coding.