Statistical Methods for Reliability Data

Chapter 1 - Reliability Concepts and Reliability Data

W. Q. Meeker, L. A. Escobar, and J. K. Freels

06 October 2016

CHAPTER OVERVIEW

This chapter explains...

The background of the SMRD package we'll use throughout this course
Basic ideas behind product reliability
Reasons for collecting reliability data
Distinguishing features of reliability data
General models for reliability data
Examples of reliability data and motivations for collecting the data
A general strategy used for data analysis, modeling, and inference from reliability data

THE SMRD PACKAGE

1.1.1 - Quality And Reliability

What is reliability?

A stochastic measure of the likelihood that a component or system will perform its required function(s) for a specified period of operation under stated operating conditions
- Stochastic - The true measure of the likelihood is random/uncertain due to complex factors affecting how long a given system will survive
- Required function(s) - Systems perform many functions - the function of interest must be specifically defined
- Period of operation - Operating life may be accumulated on a continuous scale (time, distance) or a discrete scale (shots, launches, take-off/landing cycles)
- Operating conditions - Harsh environments may consume operating life more quickly than benign environments - therefore a system's operating environment must be defined

What is quality?

How well a population of manufactured products conform to the initial design requirements and specifications
- A product can have BOTH high quality AND low reliability
- This highlights the importance of developing good reliability requirements early on in a program

What is maintainability?

The probability that an unavailable or degraded function can be restored to a specified condition within a period of time when maintenance is performed in accordance with prescribed procedures

What is availability?

The probability that a component or system can perform a required function at a given point in time when used under stated operating conditions

What is reliability engineering?

Academic discipline focused on the analysis, characterization, and measurement of system failures to increase system design life and improve system availability by:
- Eliminating and/or reducing the likelihood of failures and safety risks
- Reducing downtime due to maintenance
- Merging many statistical and engineering disciplines together

...is engineering in its most practical form... James R. Schlesinger U.S. Secretary of Defense (1973-1975)

Why study reliability?

Ever-increasing system complexity and sophistication
Public awareness and insistence on product reliability
Profit considerations resulting from the high cost of failures, repairs, and warranty programs
Contractual requirements to meet reliability and maintainability performance specifications
Laws and regulations concerning product liability and safety
- Food and Drug Act
- Flammable Fabrics Act
- Federal Hazardous Substance Act
- National Traffic and Motor Vehicle Safety Act
- Fire Research and Safety Act
- Child Protection and Toy Safety Act
- Poison Prevention Packaging Act
- Occupational Safety and Health Act
- Federal Boat Safety Act
- Consumer Product Safety Act

Approaches to meeting reliability requirements

Engineering (deterministic) approach
- Prevent failures by designing in a safety factor of 4 to 10 times the expected average stress
- Can result in overdesigned products leading to dramatically increased costs
- Can also result in under-designed products if an unanticipated load or a material weakness results in a failure
Probabilistic approach
- Treats failures as random events
- In theory, if we understood the exact physics and chemistry of a failure process, many failures of a component could be predicted with certainty
- With limited data on the state of a component, and incomplete knowledge of the processes that cause failures, failures will appear to occur at random over time
- This random process may exhibit a pattern which can be modeled by some probability distribution

The product reliability environment

Product consumers & producers often have different priorities when it comes to product reliability
However, both usually work hard to save money

The EXTENDED product reliability environment

Scientists
- Develop improved materials that can enable new capabilities
- Find more efficient system architectures
Governments
- Issue new safety requirements (for products and employees)
- Create new environmental regulations
Competitors
- Develop newer/better/cheaper products that can impact your profit margins
- Competition to get your product to market first
Test experts
- Develop new test capabilities to find and remove failures
- Derive new analysis techniques to reduce test time or number of samples

1.1.2 - Reasons For Collecting Reliability Data

Why should reliability data be collected?

Assess performance characteristics of materials over design life
Predict product reliability
Assess the effect of a proposed design change
Compare components from different manufacturers
Assess product reliability in field
Checking the veracity of an advertising claim
Track new failure modes
Predict product warranty costs
Ensure safety requirements are met

1.1.3 - Distinguishing Features of Reliability Data

Reliability data structures can be very complex (closed form solutions rarely exist)
Software required to solve all but the most basic problems (this class will use R)
Data are typically censored
Observations (e.g., time or cycles to failure) are strictly positive
Estimating model parameters is usually not the primary interest
Extrapolation is often required when applying accelerated stresses

EXAMPLES OF RELIABILITY DATA

This section lists several real-world examples of reliability data sets
Demonstrates the wide range of reliability data structures
- Right, left, and interval censoring
- Multiple failure modes
- Different usage measures (time, flight hours, miles driven, rounds fired)
- Explanatory variables (accelerated stresses, differing usage environments)

1.2.1 - Failure Data (No Explanatory Variables)

Explanatory variables

Explanatory variables are commonly used in regression modeling or design of experiments
- May be called regressors or experimental factors in certain contexts
- Used to reduce uncertainty in responses by incorporating differences in how units were tested
In the context of reliability, explanatory variables can be used to define the severity of an environment to which a system was exposed
- The severity of the environment relates how quickly a product's "life" is consumed
- If it can be reasonably assumed that each unit was exposed to an equivalent environment, explanatory variables may be ignored and the failure observations \(t_i, i=1,2,...\) are assumed to be \(iid\)
- Assuming \(t_{i} \sim iid, \forall i\) implies that any differences in the sample population or test process are captured by \(Var[X]\)

Example 1.1 - Ball Bearing Fatigue Test

Modeling time-to-event data can produce poor results if

The time-scale used to measure operating life differs between test units
- Company A specifies the lifetime of landing gear components in flight hours
- Company B specifies the lifetime of landing gear components in take-off/landing cycles
The severity of the test environment is not considered

Example 1.2

Example 1.5

1.2.2 - Failure Data (w/ Explanatory Variables)

Analyzing data with explanatory variables

Often, units are tested at multiple environments with differing levels of severity
- To drive failures more quickly
- To reduce the overall test time
Recall, explanatory variables can be used to relate how quickly "life" is consumed in each environment
Changing the value of an explanatory variable changes the value of one or more parameter in the underlying failure distribution (think regression)
Complex models are often required to compare life consumption rates between severity levels
Examples
- Cracks grow faster when a higher level of stress is applied
- Tire treads wear faster when driven on gravel roads
- Metal corrodes faster when it is exposed to humid environments
It is critical that the results obseved at higher stress environments represent the behavior observed at the use-level stress
- We could subject a test unit to a temperature that would cause some component to melt, but that would not give any useful information about the time to failure out in the field
- We can generate failures very quickly instantly in some cases
- But, the desire to reduce the total test time should not cause failures that would never be observed in the operating environment

Example 1.8

1.2.3 - Degradation Data (No Explanatory Variables)

Degradation failures

Occur when the value of a system performance measure crosses above or below a critical value
Soft degradation failures: - Occurs when a performance measure crosses an acceptable level
- Power window motor raises/lowers windows too slowly
- Tire tread depth falls below a safe level
Hard degradation failure: - Occurs when a physical measure crosses a failure level
- Power window motor gear teeth thickness wears down to where it fails
- Tire wear becomes so extreme that the tire fails below

1.3.1 - Define The Target Population Or Process

Enumerative studies

Used to ASSESS the performance of MATURE products
Most statistics courses discuss enumerative studies
Test a random sample from a population of units
Observe test results, analyze data, and make conclusions
BUT, what if your conclusion is these parts suck and don't meet the reliability requirement?

Analytic Studies

Used to IMPROVE the performance of IMMATURE products
Most reliability tests are Analytic Studies
You improve the design, and now...
Your test data is based on a prior design that doesn't exist anymore

1.3.2 - Causes of Failure and Degradation Leading To Failure

Most failure are due to degradation

Failure results when a flaw is exposed to a severe environment for a sufficient period of time.
- Flaws result from poor designs
- Flaws can result from manufacturing
- Flaws can result if the usage environment is not well understood
The environment in which the system operates imparts stresses on the system which over time can
- Extend microcracks
- Loosen joints
- Weaken electical connections
- Magnify vibrations
- Elevate internal temperatures
These enlarged flaws weaken the system until the environmental stresses exceed the flaw's residual strength

1.3.3 - Environmental Effects Of Reliability

It's well understood that a system's performance is affected by its operating environment
It's less understood how many different environments a system is actually exposed to

1.3.4 - Definition Of Time-Scale

Most systems lifetimes can be quantified with more than one "time-scale"

Example: A car
- Timescale 1: Years of ownership
- Timescale 2: Miles driven
- Timescale 3: Number of times started
Each customer may have their own time-scale of interest
Each system component has a time-scale that is most appropriate

1.3.5 - Defining Time Origin And Failure Time

When does a product's life begin? (This can be tricky)

When it leaves the assembly line?
When it ships from factory to a retailer?
When it is purchased by a customer?
When it is installed/unboxed?
When it is first used?
Is it even possible to know when each of these events occur?

Likewise, defining when a product's life ends can be difficult

When the customer realizes it has stopped functioning?
When a warranty claim is made?
When it is received by the manufacturer?

How we'll define time origin and failure time in this course?

As this is a statistics course, defining time origin and failure time is not a concern
However, recognize that these issues must be considered when collecting time-to-event data
As part of the class project you will collect your own reliability data!

1.4.1 - Reliability Data (Non-Repairable Units)

Non-repairable systems are replaced - not repaired - upon failure

Non safety critical & inexpensive items are often non-repairable since replacement is far cheaper than repair
- Light bulbs
- Small appliances
- Consumable items (filters, belts, hoses)
Safety critical & expensive items may also be non-repairable if they must be replaced prior to failure when some degradation level is reached
- Tires (tread depth)
- Computers (processing speed, obsolescence)
Non-repairable systems may be represented by a transition diagram with one working state and one or more absorbing failure states
- State 0: Initial working state
- States 1, 2,...: Absorbing failed states

1.4.2 - Reliability Data (Repairable Systems)

Repairable systems - costs justify repair, rather than replacement

Graphically represented by two or more transient states

Most systems - have repairable AND non-repairable failure modes

1.5.1 - Planning A Reliability Study

Clearly define the purpose of the current test

Reliability tests are often campaigns composed of many disparate tests
Each test will have its own goals
The type(s) of test used depend on where the system is in its development
The goal of a particular test may not sync with the overall goal of the test campaign

Types of reliability tests

Developmental Tests - Improve the performance of immature systems by finding & removing flaws
Operational Tests - Assess the performance of mature systems in the operational environment

Define the test resources available (test equip, ranges, personnel, time)

Developmental test resources
- Environmental test chambers
- Performance diagnostic equipment
- Data acquisition systems
- Personnel - technical SME's for root cause analyses
Operational test resources
- Test ranges
- Maintenance equipment
- Personnel - operational SME's for realistic performance exercises

Understand the environment(s) in which the fielded system will operate

Test environments should be similar to the fielded environment
- Ensure test results are representative of fielded operations
- Consumers want the test environment to be more severe to ensure more failure modes are found and removed
- Producers want the test environment to be less severe to ensure that their product successfully passes the test
Fielded systems will be exposed to multiple-stresses, simultaneously
- Every failure mode will respond to each stress differently
- If a failure mode is sensitive to thermal loads, a vibration test may not produce meaningful results

What metric(s) will be used to validate performance?

MTBF Mean time between failures
MTBM - Mean time between maintenance (any maintenance?)
MTBCF - Mean time between critical failure (what is a critical failure?)
MMH/OH - Maintenance-man hours per operational hour
Do the data being gathered in the test support this?

Define how precise the results should be (sample size)

Often, data from DT cannot be used to improve the estimates in OT
This is a very active research area

1.5.2 - Study For Data Analysis And Modeling

This process will be used throughout the course to assess reliability data

Assess the data nonparametrically - using simple graphical methods
Fit one or more simple parametric models to the data
Check for violations of model assumptions
Compute model parameter estimates and confidence intervals
Use numeric and graphical methods to check fit to data
Perform sensitivity analyses on parameter values and model assumptions

CHAPTER 1 SUMMARY

This chapter presented:

Examples of time-to-failure data for different types of systems
A strategy for estimating system performance metrics from the data
The most common reliability metric is the failure-time distribution
Many failure time processes are modeled using a continuous scale (i.e. Time)