W. Q. Meeker and L. A. Escobar
20 May 2020
Basic ideas behind product reliability
Reasons for collecting reliability data
Distinguishing features of reliability data
General models for reliability data
Examples of reliability data and motivations for collecting the data
A general strategy used for data analysis, modeling, and inference from reliability data
A
How well a population of manufactured products conform to the initial design requirements and specifications
A product can have BOTH high quality AND low reliability
This highlights the importance of developing good reliability requirements early on in a program
Academic discipline focused on the analysis, characterization, and measurement of system failures to increase system design life and improve system availability by:
Eliminating and/or reducing the likelihood of failures and safety risks
Reducing downtime due to maintenance
Merging many statistical and engineering disciplines together
…is engineering in its most practical form…
James R. Schlesinger U.S. Secretary of Defense (1973-1975)
Ever-increasing system complexity and sophistication
Public awareness and insistence on product reliability
Profit considerations resulting from the high cost of failures, repairs, and warranty programs
Contractual requirements to meet reliability and maintainability performance specifications
Laws and regulations concerning product liability and safety
Food and Drug Act
Flammable Fabrics Act
Federal Hazardous Substance Act
National Traffic and Motor Vehicle Safety Act
Fire Research and Safety Act
Child Protection and Toy Safety Act
Poison Prevention Packaging Act
Occupational Safety and Health Act
Federal Boat Safety Act
Consumer Product Safety Act
Engineering (deterministic) approach
Prevent failures by designing in a safety factor of 4 to 10 times the expected average stress
Can result in overdesigned products leading to dramatically increased costs
Can also result in under-designed products if an unanticipated load or a material weakness results in a failure
Probabilistic approach
Treats failures as random events
In theory, if we understood the exact physics and chemistry of a failure process, many failures of a component could be predicted with certainty
With limited data on the state of a component, and incomplete knowledge of the processes that cause failures, failures will appear to occur at random over time
This random process may exhibit a pattern which can be modeled by some probability distribution
Consumers & producers often have different priorities when it comes to product reliability
par(mar = c(0,0,0,0))
plot(NA,
xlim = range(0,500),
ylim = range(25,525),
axes = FALSE,
xlab = '',
ylab = '')
points(x = 250,
y = 250,
col = 'green',
cex = 13,
lwd = 4,
pch = 16)
text(x = c(100,250,400),
y = c(475,250,475),
labels = c(expression(underline(' Producers ')),
'$',
expression(underline(' Consumers '))),
cex = c(4, 6, 4),
font = 2,
col = c(1, 0, 1))
text(x = rep(100, 5),
y = seq(425, 75, -75),
labels = c('Quick to Market',
'Minimize Warranty Costs',
'Reduce LCC',
'Reduce Liability Costs',
'Maintain Market Share'),
cex = rep(2.5, 5),
family = 'serif')
text(x = rep(400, 5),
y = seq(425, 75, -75),
labels = c('Newer Capabilities',
'Improve Current Functions',
'Merge Functions',
'Expect High Performance',
'Expect to Last'),
cex = rep(2.5,5),
family = 'mono')Develop improved materials that can enable new capabilities
Find more efficient system architectures
Issue new safety requirements (for products and employees)
Create new environmental regulations
Develop newer/better/cheaper products that can impact your profit margins
Competition to get your product to market first
Develop new test capabilities to find and remove failures
Derive new analysis techniques to reduce test time or number of samples
Assess performance characteristics of materials over design life
Predict product reliability
Assess the effect of a proposed design change
Compare components from different manufacturers
Assess product reliability in field
Checking the veracity of an advertising claim
Track new failure modes
Predict product warranty costs
Ensure safety requirements are met
Reliability
Software required to solve all but the most basic problems (this class will use R)
Data are typically
Observations (e.g., time or cycles to failure) are strictly positive
Estimating model parameters is usually not the primary interest
This section lists several real-world examples of reliability data sets
Demonstrates the wide range of reliability data structures
Right, left, and interval censoring
Multiple failure modes
Different usage measures (time, flight hours, miles driven, rounds fired)
Explanatory variables (accelerated stresses, differing usage environments)
Explanatory variables are commonly used in regression modeling or design of experiments
May be called regressors or experimental factors in certain contexts
Used to reduce uncertainty in responses by incorporating differences in how units were tested
In the context of reliability, explanatory variables can be used to define the severity of an environment to which a system was exposed
The severity of the environment relates how quickly a product’s “life” is consumed
If it can be reasonably assumed that each unit was exposed to an equivalent environment, explanatory variables may be ignored and the failure observations \(t_i, i=1,2,...\) are assumed to be \(iid\)
Assuming \(t_{i} \sim iid, \forall i\) implies that any differences in the sample population or test process are captured by \(Var[X]\)
Reasons why modeling time-to-event data can produce poor results
Inappropriate time-scale used to measure operating life (e.g. specifying the lifetime of landing gear components in flight hours instead of take-off/landing cycles)
Not considering the severity of the operating environment
Motivation for the study
Fatigue is known to affect the service life of deep-groove ball bearings
Applying higher levels of stress further reduces the fatigue life of the ball bearings
A failure time regression model was developed to describe the effect that higher levels of stress have on the fatigue life of these bearings.
However, there was disagreement within the industry on whether the estimated parameter values currently being used were accurate.
Additional testing was required to reduce the uncertainty about the parameter values used in the model.
Study objectives
Test several ball bearings at a key stress level to augment the existing failure data
The additional observations should result in better estimates of the parameter values used to model fatigue life as a function of the applied stress in deep-groove ball bearings
Data Set (Table 1.1) lzbearing
The data are the number of accumulated fatigue cycles observed when each failure occurred
The data are reported in millions of cycles for \(n = 23\) bearings
We aren’t told what stress was applied to each bearing in this test
We must assume that the stress applied to all 23 bearings was equivalent
Analysis
A full analysis would include the data observed at every stress level to determine the conditional failure-time distributions
In this simplified case, an analysis would entail fitting the data to a single distribution using graphical methods (histograms, event plots) and numerical methods (maximum likelihood)
Motivation for the study
Company A is shipping products with defective integrated circuits (IC)
The company wants to identify which units contain defective IC’s and remove them from the population of items to be shipped
Identifying the defective products requires testing them to simulate their first year of operation (a process called burn-in)
Very little time is available to conduct the burn-in inspection
Temperature and humidity are known to affect the service life of IC’s
Exposing the units to elevated temperature and humidity may help speed the burn-in inspection
However, the company is concerned that this ‘accelerated’ burn-in may not produce meaningful information
Study objectives
Estimate the probability of IC’s failing before 100 hours
Estimate the hazard function at 100 hours
Estimate the required burn-in time to remove the majority of the defective units
Data set (Table 1.2) lfp1370
Lifetimes of \(n = 4156\) integrated circuits (IC’s)
IC’s were exposed to \(80^{o}C\) and \(80\%\) relative humidity until failure
The test was stopped after 1370 hours
Analysis
Of the \(4156\) units tested, only 28 failures were observed
The failure times for the remaining units are unknown, but would have been observed at some point if testing had continued - there are known as right-censored observations.
This analysis entails fitting the data to a probability distribution using maximum likelihood techniques adapted to include right-censored data.
Motivation for the study
Nuclear power plants use heat exchangers to transfer energy from the reactor to the steam turbines
Heat exchangers can contain several thousand tubes to transfer the steam
As result of continuous exposure to steam and pressure, the tubes can develop cracks over time
If the cracks grow large enough, a catastrophic failure could occur
Tubes are periodically inspected (nondestructive) for cracks
If a crack is found in a tube, that tube is sealed off with concrete and is no longer used
Once a certain number of tubes have been sealed, the entire heat exchanger must be replaced
Study objectives
Engineers what to estimate the service life of the tubes at three nuclear power plants
Only a sample of the tubes at each nuclear power plant (100) can be inspected
Data set heatexchanger
The data are inspection results from 300 heat exchanger tubes, inspected at each of the three geographically separated plants
If the data from all three plants could be combined, a more accurate estimate of the the service life might be determined
However, the data can only be combined if the tubes were produced in a similar manufacturing process and were used under similar conditions
Figure 1.6 shows the raw, un-combined data
Figure 1.7 shows the combined data
The observations are left, right, and interval censored since each plant began operating one year apart
Analysis
Of the 300 tubes inspected, 11 were found to have cracks of sufficient length to be sealed off.
In the combined data set all 300 tubes experienced the first year of operation, 200 tubes experienced the first two years of operation, and only 100 tubes experienced all three years of operation.
This analysis entails fitting the data to a probability distribution using maximum likelihood techniques adapted to include left, right, and interval-censored data.
par(family = 'serif',font = 2, mar = c(0,0,0,0))
plot(NA,
axes = FALSE,
xlab = '',
ylab = '',
xlim = range(-50,350),
ylim = range(-10,300))
segments( x0 = c(0,0,0,0,0,rep(0,5),rep(100,5),rep(200,5),100,200,300),
y0 = c(0,100,200,300,0,seq(232,280,12),seq(132,180,12),seq(32,80,12),0,0,0),
x1 = c(350,350,350,350,0,rep(300,15),100,200,300),
y1 = c(0,100,200,300,300,seq(232,280,12),seq(132,180,12),seq(32,80,12),300,300,300),
lwd = c(rep(2,5), rep(1,18)))
text(x = rep(-25,3),
y = seq(56,256,100),
labels = c('Plant 3',
'Plant 2',
'Plant 1'),
cex = rep(0.9,3))
text(x = c(50,rep(150,2),rep(250,3)),
y = c(rep(290,2),190,290,190,90),
labels = c('1 failure',
'2 failures',
'2 failures',
'2 failures',
'3 failures',
'1 failure'),
cex = rep(0.8,6))
text(x = seq(50,250,100),
y = rep(-8,3),
labels = c('1981',
'1982',
'1983'),
cex = rep(0.9,3))
text(x = rep(310,3),
y = seq(56,256,100),
labels = c('99','95','95'),
cex = rep(0.9,3))
segments(x0 = rep(320,3),
y0 = seq(56,256,100),
x1 = rep(345,3),
y1 = seq(56,256,100),
lty = rep(2,3))
arrows(x0 = rep(295,15),
y0 = c(seq(232,280,12),seq(132,180,12),seq(32,80,12)),
rep(300,15),
length = 0.1)
arrows(x0 = rep(345,3),
y0 = seq(56,256,100),
x1 = rep(350,3),
length = 0.1)Figure 1.6 - Diagram of the raw heatexchanger data
par(family = 'serif',font = 2,mar = c(0,0,0,0))
plot(NA,
axes = FALSE,
xlab = '',
ylab = '',
xlim = range(-50,350),
ylim = range(-10,300))
segments( x0 = c(0,0,0,0,0,rep(0,15),100,200,300),
y0 = c(0,100,200,300,0,seq(232,280,12),seq(132,180,12),seq(32,80,12),0,0,0),
x1 = c(350,350,350,350,0,rep(300,5),rep(200,5),rep(100,5),100,200,300),
y1 = c(0,100,200,300,300,seq(232,280,12),seq(132,180,12),seq(32,80,12),300,300,300),
lwd = c(rep(2,5),rep(1,18)))
text(x = rep(-25,3),
y = seq(56,256,100),
c('Plant 3','Plant 2','Plant 1'),
cex=rep(0.9,3))
text(x = c(rep(50,3),rep(150,2),250),
y = c(90,190,290,190,rep(290,2)),
labels = c('1 failure',
'2 failures',
'1 failures',
'3 failures',
'2 failures',
'2 failure'),
cex = rep(0.8,6))
text(x = seq(50,250,100),
y = rep(-8,3),
labels = c('Year 1',
'Year 2',
'Year 3'),
cex = rep(0.9,3))
text(x = c(110,210,310),
y = seq(56,256,100),
labels = c('99',
'95',
'95'),
cex = rep(0.9,3))
segments(x0 = c(120,220,320),
y0 = seq(56,256,100),
x1 = c(345,345,345),
y1 = seq(56,256,100),
lty = rep(2,3))
arrows(x0 = c(rep(295,5),rep(195,5),rep(95,5)),
y0 = c(seq(232,280,12),seq(132,180,12),seq(32,80,12)),
x1 = c(rep(300,5),rep(200,5),rep(100,5)),
length = 0.1)
arrows(x0 = rep(345,3),
y0 = seq(56,256,100),
x1 = rep(350,3),
length = 0.1)Figure 1.7 - Diagram of the transformed heatexchanger data
Often, units are tested at multiple environments with differing levels of severity
To drive failures more quickly
To reduce the overall test time
Recall, explanatory variables can be used to relate how quickly “life” is consumed in each environment
Changing the value of an explanatory variable changes the value of one or more parameters in the underlying failure distribution (think regression)
Complex models are often required to compare life consumption rates between severity levels
Examples
Cracks grow faster when a higher level of stress is applied
Tire treads wear faster when driven on gravel roads
Metal corrodes faster when it is exposed to humid environments
It is critical that the results observed at higher stress environments represent the behavior observed at the use-level stress
We could subject a test unit to a temperature that would cause some component to melt, but that would not give any useful information about the time to failure out in the field
We can generate failures very quickly - even instantly in some cases
But, the desire to reduce the total test time should not cause failures that would never be observed in the operating environment
Motivation for the study
Relative humidity \(\%RH\) affects the time until a short-circuit event in PCB’s
Failures may be generated more quickly by increasing \(\%RH\)
Study objectives
Drive failures of PCB’s at several levels of \(RH\)
Use the failure data to determine the proportion of PCB’s failing at the “use level” humidity
Data set (Table 1.4) printedcircuitboard
\(n = 70\) PCB’s tested at \(4\) separate levels of \(\%RH\)
Each PCB was inspected for failure at designated time intervals - initially every \(4\)-hours and expanding to every \(12\)-hours
Analysis
This analysis entails failure-time regression with right censoring
The data shoud be
par(family = 'serif', font = 2)
plot(lower~rh,
data = SMRD::printedcircuitboard,
pch = 'X',
cex = .85,
log = 'y',
ylim = c(10,10000),
xlim = c(45,85),
las = 1,
ylab = 'Hours',
xlab = "% RH")
text(x = c(50,63,75,82),
y = c(7000,6000,1000,350),
labels = c('48/70 censored',
'11/68 censored',
'0/70 censored',
'0/70 censored'))Figure 1.9 - Scatter plot of the printedcircuitboard failure data at four different stress levels
Occur when the value of a system performance measure crosses above or below a critical value
Power window motor raises/lowers windows too slowly
Tire tread depth falls below a safe level
Power window motor gear teeth thickness wears down to where it fails
Tire wear becomes so extreme that the tire fails below
Used to ASSESS the performance of MATURE products
Most statistics courses discuss enumerative studies
Test a random sample from a population of units
Observe test results, analyze data, and make conclusions
BUT, what if your conclusion is these parts suck and don’t meet the reliability requirement?
Used to IMPROVE the performance of IMMATURE products
Most reliability tests are Analytic Studies
You improve the design, and now…
Your test data is based on a prior design that doesn’t exist anymore
Failure results when a flaw is exposed to a severe environment for a sufficient period of time.
Flaws result from poor designs
Flaws can result from manufacturing
Flaws can result if the usage environment is not well understood
The environment in which the system operates imparts stresses on the system which over time can
Extend microcracks
Loosen joints
Weaken electical connections
Magnify vibrations
Elevate internal temperatures
These enlarged flaws weaken the system until the environmental stresses exceed the flaw’s residual strength
It’s well understood that a system’s performance is affected by its operating environment
It’s less understood how many different environments a system is actually exposed to
Example: A car
Timescale 1: Years of ownership
Timescale 2: Miles driven
Timescale 3: Number of times started
Each customer may have their own time-scale of interest
Each system component has a time-scale that is most appropriate
When it leaves the assembly line?
When it ships from factory to a retailer?
When it is purchased by a customer?
When it is installed/unboxed?
When it is first used?
When the customer realizes it has stopped functioning?
When a warranty claim is made?
When it is received by the manufacturer?
In this course defining the time origin and/or the failure time will not a be a concern
However, you should recognize that these values must be defined prior to collecting time-to-event data for a fielded system
Non safety critical & inexpensive items are often non-repairable since replacement is far cheaper than repair
Light bulbs
Small appliances
Consumable items (filters, belts, hoses)
Safety critical & expensive items may also be non-repairable if they must be replaced prior to failure when some degradation level is reached
Tires (tread depth)
Computers (processing speed, obsolescence)
Non-repairable systems may be represented by a transition diagram with one working state and one or more absorbing failure states
The transition diagram below illustrates a nonrepairable system, where
State 0: Initial working state
States 1, 2,…: Absorbing failed states
Mat1 <- matrix(NA, nrow = 3, ncol = 3)
AA <- as.data.frame(Mat1)
AA[[2,1]] <- 'F[0:1]'
AA[[3,1]] <- 'F[0:2]'
name <- c(expression(0[Alive]),
expression(1[Dead]),
expression(2[Dead]))
par(family='serif')
diagram::plotmat(A = AA, pos = 3, curve = .575,
name = name, lwd = 2, arr.len = 0.6,
arr.width = 0.25, my = .25, box.size = 0.08,
arr.type = 'triangle', dtext = -1,
relsize=.99, box.cex=1.5, cex=1.25)Repairable systems are those that can be restored to working condition after a failure
Repairable systems may be represented graphically by a transition diagram with one or more transient states
The transient states represent failure modes for which the cost of repair is far below the cost of replacement
As result of the repair, the system can transition back to the operating state
library(diagram)
DiffMat <- matrix(NA, nrow = 4, ncol = 4)
AA <- as.data.frame(DiffMat)
AA[[1,2]] <- 'F[1:0]'
AA[[1,3]] <- 'F[2:0]'
AA[[2,1]] <- 'F[0:1]'
AA[[3,1]] <- 'F[0:2]'
AA[[4,1]] <- 'F[0:3]'
name <- c(expression(0[Alive]),
expression(1[Failed]),
expression(2[Failed]),
expression(3[Dead]))
par(family='serif', mar = c(0,0,0,0))
diagram::plotmat(A = AA, pos = 4, curve = .575,
name = name, lwd = 2, arr.len = 0.6,
arr.width = 0.25, my = .15, box.size = 0.08,
arr.type = 'triangle', dtext = -1,
relsize=.99, box.cex=1.5, cex=1.25)Repairable systems also have one or more absrbing states
The absorbing states represent failure modes for which the cost of repair is equal to or greater than the cost of replacement
When these nonrepairable failure modes occur the system does not transition back to the operating state
Reliability tests are often campaigns composed of many disparate tests
Each test will have its own goals
The type(s) of test used depend on where the system is in its development
The goal of a particular test may not sync with the overall goal of the test campaign
Environmental test chambers
Performance diagnostic equipment
Data acquisition systems
Personnel - technical SME’s for root cause analyses
Test ranges
Maintenance equipment
Personnel - operational SME’s for realistic performance exercises
Test environments should be similar to the fielded environment
Ensure test results are representative of fielded operations
Consumers want the test environment to be
Producers want the test environment to be
Fielded systems will be exposed to multiple-stresses, simultaneously
Every failure mode will respond to each stress differently
If a failure mode is sensitive to thermal loads, a vibration test may not produce meaningful results
Do the data being gathered in the test support this?
Often, data from
This is a very active research area
Assess the data nonparametrically - using simple graphical methods
Fit one or more simple parametric models to the data
Check for violations of model assumptions
Compute model parameter estimates and confidence intervals
Use numeric and graphical methods to check fit to data
Perform sensitivity analyses on parameter values and model assumptions
Examples of time-to-failure data for different types of systems
A strategy for estimating system performance metrics from the data
The most common reliability metric is the failure-time distribution
Many failure time processes are modeled using a continuous scale (i.e. Time)