MTBF: What it Really Means?

mean time between failures

You may see a disk manufacturer touting that their drives have a Mean Time Between Failures (MTBF) or Mean Time to Failure (MTTF) of a million hours or so. Now, you will never run a disk for that long, so what does that mean in terms of what you can actually expect out of a disk. This is what disk manufacturer Seagate Technology LLC says about the matter:

“It is common to see MTBF ratings between 300,000 to 1,200,000 hours for hard disk drive mechanisms, which might lead one to conclude that the specification promises between 30 and 120 years of continuous operation. This is not the case! The specification is based on a large (statistically significant) number of drives running continuously at a test site, with data extrapolated according to various known statistical models to yield the results.

Based on the observed error rate over a few weeks or months, the MTBF is estimated and not representative of how long your individual drive, or any individual product, is likely to last… Historically, the field MTBF, which includes all returns regardless of cause, is typically 50-60% of projected MTBF.” (http://knowledge.seagate.com/articles/en_US/FAQ/174791en?language=en_US)

So, setting aside the MTBF metric, how often do disks fail? Three studies have come out over the last several years that address this topic. (Note that the data in these studies is not applicable to Zetta.net’s own enterprise-class storage system which makes heavy use of Flash and SSDs to boost performance and a RAIN 6 architecture for reliability, but does apply to many of our customers.)

In 2007, both Google and Carnegie Mellon University (CMU) presented papers on their experiences at the 5th USENIX Conference on File and Storage Technologies (FAST07). CMU’s paper, Disk failures in the real world: What does an MTTF of 1,000,000 hours mean to you? (http://static.usenix.org/events/fast07/tech/schroeder/schroeder.pdf), looked at data on about 100,000 disks, some with a lifespan of five years. They found that, while the MTTFs on the data sheets suggested an annual failure rate of no more than 0.88%, “in the field, annual disk replacement rates typically exceed 1%, with 2-4% common and up to 13% observed on some systems.” The disk failures were found to increase constantly with age, rather than setting in after a nominal life time of five years. “Interestingly, we observe little difference in replacement rates between SCSI, FC and SATA drives, potentially an indication that disk-independent factors, such as operating conditions, affect replacement rates more than component specific factors.”

Google’s FAST07 paper Failure Trends in a Large Disk Drive Population (http://static.usenix.org/events/fast07/tech/full_papers/pinheiro/pinheiro.pdf ) looked at that company’s experience with more than 100,000 serial and parallel ATA consumer-grade HDDs running at speeds of 5400 to 7200 rpm. The disks were at least nine different models from seven different manufacturers, with sizes from 80 to 400 GB. Google found that disks had an annualized failure rate (AFR) of 3% for the first three months, dropping to 2% for the first year. In the second year the AFR climbed to 8% and stayed in the 6% to 9% range for years 3-5.

In January of this year, consumer online backup vendor Backblaze published the failure rates of the 27,134 consumer-grade disks it was using (http://blog.backblaze.com/2014/01/21/what-hard-drive-should-i-buy/). The results varied widely by vendor. The Hitachi 2TB, 3TB and 4TB drives all came in with AFRs under 2%. Certain Seagate drives performed much worse, particularly in the 1.5TB size where the Barracuda LP had a 9.9% AFR, The Barracuda 7200 a 25.4% AFR and the Barracuda Green a 120% AFR. However, these failure rates doesn’t mean that the Backblaze is unhappy with all the Seagate drives. As Backblaze engineer Brian Beach stated, “The Backblaze team has been happy with Seagate Barracuda LP 1.5TB drives. We’ve been running them for a long time – their average age is pushing 4 years. Their overall failure rate isn’t great, but it’s not terrible either.” He also said, however, that “The non-LP 7200 RPM drives have been consistently unreliable. Their failure rate is high, especially as they’re getting older.”

So, what does all this mean? By all means read the entire reports cited above to get additional information on differences in models, what causes disks to fail and how to predict when a particular disk will fail. The main lesson to learn is that disks do fail, and they fail a lot more often than many people expect. This is especially true when using consumer-class drives. That doesn’t mean that you shouldn’t use them; their cost advantage often makes them a great choice for many applications.

But the high rate of disk failures, with all types of drives and under a wide variety of operating conditions, means that you definitely need a robust backup system no matter what you are using as your primary storage.