MTBF A MEASURE OF OEM DISK DRIVE RELIABILITY IBM STORAGE SYSTEMS DIVISION INTRODUCTION As a supplier of small form factor disk drives to original equipment manufacturers (OEM), IBM's Storage Systems Division (SSD) is committed to delivering products that lead the industry in quality and reliability. Efforts to maximize drive reliability are at the forefront of the company's design, development, and manufacturing processes. Progress in this quest is assessed through the use of a reliability measure known as mean time between failure (MTBF). Because MTBF is widely used by suppliers of small-form-factor disk drives and is regarded throughout the industry as an effective gauge of reliability, it is important that customer be able to interpret the relationship between MTBF claims and customer expectations for drive reliability in their own applications. In general, higher MTBF correlates with fewer drive failures; but an MTBF claim is not a guarantee of product reliability and does not represent a condition of warranty. The purpose of this paper is to help clarify the meaning of MTBF by addressing some frequently asked questions regarding it. QUESTIONS AND ANSWERS 1. What is MTBF? MTBF is the mean of a distribution of product life times, often estimated by dividing the total operating time accumulated by a "defined group" of drives within a given time period, by the total number of failures in that time period. 2. What is this "defined group" of drives? This is a group of drive that: - have not reached end-of-life (typically five to seven years) - are operated within a specified reliability temperature range, under normal usage conditions, and - have not been damaged or abused. 3. What is considered to be a failure? Any event that prevents a drive from performing its specified operations, given that the drive meets the group definition described in question 2. This includes drives that fail during shipment and during what is frequently referred to as the "early life period" (failures typically resulting from manufacturing defects). It does not include drives that fail during integration into OEM system units or as a result of mishandling, nor does it encompass drives that fail beyond end-of-life. 4. If I purchase a drive with an MTBF of 1,000,000 hours (114 years), can I expect the drive to operate without failure for 1,000,000 hours? No, because the drive will reach end-of-life before reaching 1,000,000 hours. For example, a continuously operated drive with a five-year useful life will reach end-of-life in less than 45,000 hours. But, theoretically, if the drive is replaced with another new drive when it reaches end-of-life, etc, and the new drive is replaced with another new drive when it reaches end-of-life, etc, then the probability that 1,000,000 hours would elapse before a failure occurs would be greater than 30 percent in most cases. 5. If I purchase 1000 drives with an MTBF of 1,000,000 hours, how many can I expect to fail over a five-year period? Assuming that any failed drive is replaced with a new drive having the same reliability characteristics and that the drives are used continuously, then the number of failures, r, (r = approximately equals) you can expect is: (1000 drives) x (43,800 hours/drive) r (approximately) = ___________________________________ 1,000,000 hours/failure Therefore r approximately equals 44 failures Note that this number is subject to statistical variation (1). If the drives are operated for 16 hours per day instead of 24 hours per day, then the number of failures you can expect is: (1000 drives) x (29,200 hours/drive) r (approximately) = ___________________________________ 1,000,000 hours/failure Therefore r approximately equals 29 failures 6. IBM reports a "predicted MTBF." What does this mean? It is very costly and time-consuming to actually measure high MTBFs with a reasonable degree of precision. Therefore, to assess the reliability of a new disk drive prior to volume production, reliability data from past products and component and assembly tests are merged to create a mathematical model of the drive reliability characteristics. The outcome of that modeling process is the "predicted MTBF." After volume production gets under way, actual field failure data is used to check the validity of the model. 7. If I buy drives that have a "predicted MTBF" of 1,000,000 hours, can I expect to achieve 1,000,000 hours MTBF from those drives? Yes, given the conditions stated in question 2. The actual MTBF measured from any specific set of drives will depend on the usage and the environmental conditions the drives experience. Stressing a drive beyond normal usage conditions may reduce the actual MTBF to a point below the "predicted MTBF." Generally, reliability decreases as temperature increases, so drives that are operated in warm environments with poor airflow, will tend to have a lower MTBF than those operated in cool environments with poor airflow. Drives that experience a high seek rate tend to have a somewhat lower reliability than those that experience a low seek rate. And drives that are in portable equipment tend to be subject to higher levels of shock and vibration, which also degrades reliability. Furthermore, because MTBF can only be measured using statistical methods, any measurement will be subject to statistical variation. The degree of variation depends on the number of drives included in the measurement. With more devices, less variation can be expected. 8. I have seen the reliability of drives characterized by the "CDF". What is CDF? CDF is an acronym for "cumulative distribution function." It is a mathematical function that defines the probability that a drive will fail prior to some point in time. For example, a drive with a CDF equal to four percent at five years has a four percent chance of failing sometime within its first five years of operation. CDF can also be used to determine the number of expected failures from a group of files. For example, say that 1000 drives are put into service simultaneously. If the CDF equals four percent at five years, then four percent, or 40, drives can, on average, be expected to fail after five years. It should be noted that if, when a drive fails, it is replaced with a new drive, the total number of failures over the five year period will, on average, be higher than 40 since some of the replacement drives may also fail. 9. Can I compare the predicted MTBFs reported by IBM, with MTBF claims by other drive vendors? Yes, given that the assumptions behind the claims are the same. Because there is not established industry standard for calculating or reporting MTBF, other vendors may not include early life failures, and/or may not specify the same end-of-life. In general, differences such as these will affect the MTBF claim. 10. Does a predicted MTBF imply a warranty? No. A predicted MTBF provides a reliability indicator for disk drive It is not a guarantee of product reliability and does not represent a condition of warranty. Contact your IBM sales representative for answers to warranty questions. ******************************************************************** 1) In this example, because of statistical variation, there is approximately a 90 percent probability that the actual number of failures will be between 33 and 55.