W. Q. Meeker and L. A. Escobar
20 May 2020
Models for continuous failure-time processes
Models used for the discrete data from these continuous failure-time processes
Common censoring mechanisms that restrict our ability to observe all of the failure times that might occur in a reliability study
Principles of likelihood and how likelihood relates to the probability of the observed data
How likelihood ideas can be used to make inferences from reliability data
The probability_functions app from the teachingApps package includes the six probability functions used most in reliability studies and highlights:
Expressions for each function
Mathematical relationships between the functions
How to compute values for each function in R - for a variety of distributions
The most current published version of the teachingApps package can be installed from CRAN by pasting the code below into the R console
teachingApps package (which may contain some newer features) can be installed from GitHub by pasting the code below into the R consoleif(!'devtools'%in%installed.packages()[,1]) {
install.packages('devtools')
}
devtools::install_github('Auburngrads/teachingApps')The text introduces the probability functions using the \(WEIB(1.7,1)\) distribution as an example and illustrates four functions in Figure 2.1
The SMRD package includes the distribution plot( ) function, that can be used to reproduce this figure or similar figures for many other distributions
If single values are given for both parameters distribution.plot( ) returns a plot showing \(F(t), f(t), S(t)\) and \(h(t)\)
If multiple values are given for either parameter distribution.plot( ) returns a plot showing \(F(t), f(t), h(t)\) and a legend of the parameter values displayed
Failure implies that a product’s ability to perform one or more functions has been either…
Lost completely, or
Degraded to the point that the product cannot meet a performance specification
When failure occurs in fielded systems, there may not be an immediate indication to the user or the maintainers
Some component and sub-component failures may only be discovered after a subsystem or the entire system fails
Component failures in complex systems can remain undiscovered until the next inspection cycle
Because of this, inspection intervals should be
Inspection intervals should be
In real-world testing, we typically cannot
Testing may occur in different geographic regions
Sample size may be too large
Test items may be removed from a test to fill a “real-world” need
The test period may end before every item fails
Items may fail before the first inspection
We can’t constantly monitor the performance of every component when a system is in the field.
For this reason real-world failure data are often reported as the inspection time when the failures were observed
Thus, real-world failure data are often reported as DISCRETE values corresponding to the inspection cycle
To analyze this real-world failure data appropriately, we partition the test into non-overlapping intervals corresponding to the time between inspections, i.e. \[ (t_0,t_1], (t_1,t_2],...,(t_{m-1},t_m],(t_m,t_{m+1}),\;\;\;\;\; t_0=0, t_{m+1}=\infty \]
The intervals do not need to be of equal length
Inspections may be made more frequently at the beginning of a test to check for “infant mortality” failures or ensure that the test was set-up correctly
Inspections may then become less frequent once the “bad actors” have been culled from the population
Estimating reliability in this scenario is similar to estmating the likelihood of a single \(n\)-sided die landing on a specific number
However, in the context of failure data the ‘sides’ refer to the \(n\) inspection intervals
Think about how to estimate the probability of a 6-sided die landing on a given side
If the die is ‘fair’ and we rolled it many times (i.e. \(n \gtrsim 10,000\) rolls) we would expect that in the long-run \[\frac{\text{# times we roll 1,2,3,4,5,6}}{\text{# of total rolls}}\rightarrow P(X=1,2,3,4,5,6) = \frac{1}{6}\]
However, if \(n\) is much smaller, say 100 rolls, the probability that the die lands on a given side is modeled using the multinomial distrubution with probability mass function \[ f(n,\pi_{_{1}},...,\pi_{_{m+1}}|d_{_{1}},...,d_{_{m+1}})=\frac{n!}{d_1!,...,d_{m+1}!}\pi_{_{1}}^{d_{1}},...,\pi_{_{m+1}}^{d_{m+1}} \]
In the context of rolling a single die with \(m+1\) sides, the variables in this equation represent
\(n\) - number of rolls
\(d_1,...,d_{m+1}\) - number of times each of the faces \(1,...,m+1\) are observed
\(\pi_{_{1}},...,\pi_{_{m+1}}\) - the long-run probabilities of observing faces \(1,...,m+1\)
In the context of failures occuring in \(m+1\) discrete time intervals, these variables represent
\(n\) - the number of systems under test
\(d_{1}\) - number of systems that failed in the \(1^{st}\) interval \((t_{0},t_{1}]\)
\(d_{2}\) - number of systems that failed in the \(2^{nd}\) interval \((t_{1},t_{2}]\)
\(d_{m+1}\) - number of systems still operating at the final inspection \((t_{m},\infty)\)
\(\pi_{_{i}},...,\pi_{_{m+1}}\) - the long-run probability that a unit will fail in interval \(i,...,m+1\)
For a fair die and a fair toss we say that \(\pi_1=\pi_2=\pi_3=\pi_4=\pi_5=\pi_6 \approx 1/6\)
The value \(1/6\) is the
Think of it as the underlying physical dynamics that govern the likelihood of observing a given result
A “weighted” die would have a different propensity represented by different values of \(\pi_1\ne\pi_2\ne\pi_3\ne\pi_4\ne\pi_5\ne\pi_6\)
Likewise, with some practice we might be able to toss the die in such a way as to affect the propensity
For systems under test, the propensity is a complex relationship between the system’s probability of failure and the environment in which it operates
For the simplest case, \(\pi_{i}=P(t_{i-1}<T\le t_{i})=F(t_{i})-F(t_{i-1})\), where
\(\pi_i\ge 0, \forall i\)
\(\displaystyle \sum_{j=1}^{m+1}\pi_j=1\)
\(F(t)\) is the underlying failure time distribution relating the likelihood of failure at some time \(t\)
Figure 2.5 illustrates the relationship between \(\pi_{i}\) and \(F(t)\), where the underlying failure time distribution is assumed to be \(Weibull(1.7,1)\)
par(family = 'serif', font = 2)
y <- function(t) { pweibull(t,shape = 1.7,scale = 1) }
curve(pweibull(x,shape=1.7,scale=1),lwd = 2,
xlab = 't', ylab = 'f(t)',
from = 0, to = 3,
las = 1)
segments(c(0,.5,1,1.5,2,rep(0,5)),
c(rep(0,5),y(c(0.01,.5,1,1.5,2))),
c(0,.5,1,1.5,2,rep(2.2,5)),
rep(y(c(0.01,.5,1,1.5,2)),2),
lty=rep(2,10),col=rep(1,10))
text(2.3,(y(0.5)+y(.01))/2,expression(pi[1]),cex=1.25)
text(2.3,(y(1.0)+y(0.5))/2,expression(pi[2]),cex=1.25)
text(2.3,(y(1.5)+y(1.0))/2,expression(pi[3]),cex=1.25)
text(2.3,(y(2.0)+y(1.5))/2,expression(pi[4]),cex=1.25)
mtext(side = 3,
expression('Figure 2.5 - Graphical interpretation of the relationship between the '*pi[i]*' values and F('*t[i]*')'),
font = 2,line = 2)Figure 2.5 - Graphical interpretation of the relationship between the \(\pi_{i}\) values and \(F(t_{i})\)
\[ S(t_i)=P(T>t_i)=1-F(t_i)=\sum_{j=i+1}^{m+1}\pi_j \]
Where
\(i\) denotes
\(j\) denotes
In words, this equation states that
Motivation for this example
Many A/C components cannot be inspected by base-level Mx due to lack of training, facilities, tools, etc.
Components will only be inspected every \(t\) flight-hours at the depot
For each component, \(\exists\) an underlying \(F(t), t\in \mathbb{R}^{+}\) describing time to failure
The form of \(F(t)\) is often unknown, but may sometimes be inferred from similar components/systems
Goal of this example
Estimate risk of delaying replacement until next inspection (depot cycle)
Balance safety risks against replacement costs
Estimating the risk of delaying replacement until the next inspection implies that
The system was still operating at inspection \(i-1\)
We want to know how likely it is that the system will fail before inspection \(i\)
The text defines this quantity in (2.7) as \(p_{i}\)
\(p_{i}\) is the
From Equation 2.7 we can define the survival function in terms of \(\mathbf{p}=(p_{1},...,p_{m})\) as \[ \displaystyle S(t_{i})=\prod_{j=1}^{i}[1-p_{j}], i=1,...,m+1. \]
It follows that the CDF is expressed as \[ \begin{aligned} F(t_{i})&=1-\prod_{j=1}^{i}[1-p_{j}], i=1,...,m+1 \\ &=1-\prod_{j=1}^{i} 1-\frac{F(t_i)-F(t_{i-1})}{1-F(t_{i-1})}\\ &=1-\prod_{j=1}^{i} \frac{1-F(t_{i-1})}{1-F(t_{i-1})}-\frac{F(t_i)-F(t_{i-1})}{1-F(t_{i-1})}\\ &=1-\prod_{j=1}^{i} \frac{1 - F(t_i)}{1-F(t_{i-1})}\\ &=1-\prod_{j=1}^{i} \frac{S(t_i)}{S(t_{i-1})}\\ &=\sum_{j=1}^{i}\pi_{j} \end{aligned} \]
How would you state the final expression in this equation
Table 2.1 shows the resulting values for our example in which the underlying failure time distribution is \(WEIB(1.7,1)\)
The following steps show how the values in the table were computed for the first and second inspections
| \(i\) | \(t_{i}\) | \(F(t_{i})\) | \(S(t_{i})\) | \(\pi_{i}\) | \(p_{i}\) | \(1-p_{i}\) |
|---|---|---|---|---|---|---|
| 0 | 0.0 | 0.000 | 1.000 | |||
| 1 | 0.5 | 0.265 | 0.735 | 0.265 | 0.265 | 0.735 |
| 2 | 1.0 | 0.632 | 0.368 | 0.367 | 0.500 | 0.500 |
| 3 | 1.5 | 0.864 | 0.136 | 0.231 | 0.629 | 0.371 |
| 4 | 2.0 | 0.961 | .0388 | .0967 | 0.715 | 0.285 |
| 5 | \(\infty\) | 1.000 | .0000 | .0388 | 1.000 | 0.000 |
| 1.000 |
At the first inspection \(t_{1}=0.5\) \[ \begin{aligned} F(0.5)&=1-\exp\left[-\left(\frac{0.5}{1}\right)^{1.7}\right]=\underline{0.2649275}\\\\\\ S(0.5)&=1-F(0.5)=\underline{0.7350725}\\\\\\ \pi_i&=F(t_i)-F(t_{i-1})=F(0.5)-F(0)=\underline{0.2649275}\\\\\\ p_i&=\frac{\pi_i}{S(t_{i-1})}=\frac{F(0.5)-F(0)}{S(0)}=\frac{0.2649275}{1}=\underline{0.2649275} \end{aligned} \]
At the second inspection \(t_{2}=1.0\) \[ \begin{aligned} F(1)&=1-\exp\left[-\left(\frac{1}{1}\right)^{1.7}\right]=\underline{0.6321206}\\\\\\ S(1)&=1-F(1)=\underline{0.3678794}\\\\\\ \pi_i&=F(t_i)-F(t_{i-1})=F(1)-F(0.5)=\underline{0.3671931}\\\\\\ p_i&=\frac{\pi_i}{S(t_{i-1})}=\frac{F(1)-F(0.5)}{S(0.5)}=\frac{0.3671931}{0.7350725}=\underline{0.4995331} \end{aligned} \]
Observe the event plots shown in the figure below
The event plot on the left shows the lfp1370 data set in the SMRD package
The event plot on the right shows the turbine data set in the SMRD package
What type of testing do you think would produce each of these event plots?
library(SMRD)
lfp1370.ld <-
frame.to.ld(SMRD::lfp1370,
response.column = 1,
censor.column = 2,
case.weight.column = 3,
time.units = 'Hours')
turbine.ld <-
frame.to.ld(SMRD::turbine,
response.column = 1,
censor.column = 2,
case.weight.column = 3,
time.units = 'Hundreds of Hours')
par(mfrow = c(1,2),family = 'serif',font = 2)
event.plot(lfp1370.ld)
event.plot(turbine.ld)
Event plots for the lfp1370 (left) and turbine (right) datasets
Type I (Time) Censoring - Occurs when time considerations limit our ability to observe the exact failure times of one or more samples
Samples still operating after the predetermined end of testing are Type-I right-censored
In this type of test, the number of observed failures is a random quantity
There’s no guarantee that any failures will be observed before testing ends
Type II (Failure) Censoring - Occurs when time considerations limit our ability to observe the exact failure times of one or more samples
Samples still operating after the predetermined number of failures have been observed are Type-II right-censored
In this type of test the total test time is a random quantity
There’s no guarantee a sufficient number of failures will be observed in a reasonable timeframe
Modeling Type II censored data is simpler than for Type I censored data
Failure censored tests (Type II) are far less common than time-censored tests
The number of censoring events
Singly censored
Multiply censored
The contribution to the likelihood function \(\S\) 2.4
Right censored
Left censored
Interval censored
This course considers censoring events that are independent of the time to failure
The time at which testing ends is driven by real-world constraints - not because we want to ensure the test ends before an item fails
Inspections should not be delayed so that a longer lifetime can be reported
Items should not be removed from a test because it appears that they are ‘about to fail’
Competing risks considers dependent censoring events that affect time to failure
Dependent censoring is common in medical survivability research
Dependent censoring is beyond the scope of this course
Suppose we’re investigating the time to recurrence of symptoms in cancer patients who receive a new experimental treatment
Further suppose that the two patients described below are involved in our study
Patient A moves away during the study - we can’t collect any future observations
Patient B is killed in a car crash during the study - we can’t collect any future observations
How should we handle the data we’ve already collected for these patients?
The last observation for Patient A would be independently right censored because there’s a positive probability that the cancer symptoms could return - even though we may not be able to observe when they returned
The data observed for Patient B is an example dependent censoring because there is zero probability that the cancer symptoms could return
Suppose we want to simulate 50 observations from a specified distribution to use for an example
How would we do it?
We’d first have to choose a distribution family to simulate from (log-normal, Weibull, exponential)
We’d then choose a set of parameter values to indicate the specific member of our chosen distribution family
Finally, we’d use a software application to generate 50 random observations using the distribution/parameter set we chose
So, let’s simulate 50 failure observations in the interval \([50,150]\) using the following distributions
\(LOGNOR(\mu = 4.5, \sigma = 0.2)\)rlnorm(n = 50, meanlog = 4.5, sdlog = 0.2)
\(WEIB(\beta = 5.5, \eta = 120)\)rweibull(n = 50, shape = 5.5, scale = 120)
\(EXP(\theta = 40)\)rexp(n = 50, rate = 1/40)
The parameter values in each of these distributions were chosen to generate observations between \(50\) and \(150\)
Paste the code in the chunk below in R to bring up the shiny app
The app shows the result of our simulation in two separate plots
The
Shows the pdfs of each distribution along with the 50 failure times simulated from each distribution
When the observations are plotted this way, it’s easy to see how the distributions differ
The
Shows the raw observations ordered from smallest to largest
In this plot we can see that the observations simulated from the exponential distribution are skewed toward lower values compared to the Weibull and lognormal distributions
However, it’s much harder to distinguish between the observations from the Weibull and lognormal distributions.
Specifying the distribution family and parameter values gives information about the pattern of the data that will result from the simulation
Sometimes, it’s easy to distinguish different patterns, just by looking at the data visually
More often, graphical and numerical methods are required to distinguish between the patterns of data produced by different distributions
Simulating more observations can sometimes make it easier to distinguish between the patterns
But the opposite should be true as well
The data should also provide information about the distribution family and parameter values that were
Again, sometimes it’s easy to pick out what distribution family produced a data set just by looking at the pattern of the data
If more observations are available, it becomes easier to distinguish which distribution family produced the data
In most cases, numerical methods are needed to distinguish which distribution produced a data set
A popular numerical method of extracting information from data to identify which probability distribution is most likely to have produced it is
Suppose that we observe \(DATA = y_{1},y_{2},...,y_{n}\) from a test event
If these observations were produced by units that can reasonably be considered as identical, then we can say that \[ y_{1},y_{2},...,y_{n} \sim F(y|\mathbf{\theta}=\theta_{1},...,\theta_{n}) \]
However, the true form of \(F(y|\mathbf{\theta})\) and values of \(\theta_{1},...,\theta_{n}\) are unknown
Likelihood theory states that for each distribution family, we can find \(\mathbf{\theta}\) such that the negative log of the joint probability of the data is minimized, that is \[ \boldsymbol{\hat{\theta^*}} = \underset{\boldsymbol{\theta} \in \boldsymbol \Theta}{\text{arg}\;\min} \bigg(-\log\bigg[\Pr(y_1,\ldots,y_n|\boldsymbol{\theta})\bigg]\bigg) \]
If these observations are mutually independent, this joint probability can be stated as
\[ \begin{aligned} \boldsymbol{\hat{\theta^*}} &= \underset{\boldsymbol{\theta} \in \boldsymbol \Theta}{\text{arg}\;\min} \bigg(-\log\bigg[\prod_{i=1}^{n}\Pr(y_{i}|\boldsymbol{\theta})\bigg]\bigg)\\\\ &= \underset{\boldsymbol{\theta} \in \boldsymbol \Theta}{\text{arg}\;\min} \bigg(-\sum_{i=1}^{n}\log\bigg[\Pr(y_{i}|\boldsymbol{\theta})\bigg]\bigg)\\\\ \end{aligned} \]
This joint probability is a loss function in \(\theta\) defined over the interval \([0,+\infty)\)
The joint probability defined above \(\Pr(y_1,\ldots,y_n|\boldsymbol{\theta})\) is also known as the Likelihood Function and is denoted by \(\mathscr{L}(\boldsymbol{\theta|y_1,\ldots,y_n)}\), thus for observation \(i\) \[ \mathscr{L}(\mathbf{\theta}|y_i)=\Pr(y_i|\mathbf{\theta}) \]
Note that
\(\mathscr{L}(\theta|y_i)\) is a function of the parameter vector \(\theta\) for a given observation \(y_i\)
\(\Pr(y_{i}|\theta)\) is a function of the observed values \(y_i\) for a given parameter vector \(\theta\)
For a vector of observations, \(y_1,\ldots,y_n\), the total likelihood is expressed as \[ \mathscr{L}(\boldsymbol{\theta}|y_1,\ldots,y_n)=C\prod_{i=1}^{n} \mathscr{L}(\theta|y_i)=\prod_{i=1}^{n}\Pr(t_{i}|\theta) \]
Unsurprisingly, the log of the joint probability \(\log\bigg[\Pr(y_1,\ldots,y_n|\boldsymbol{\theta})\bigg]\) is known as the log-likelihood function \(\log[\mathscr{L}(\mathbf{\theta}|y_i)]\)
Thus, the value of \(\boldsymbol{\theta}\) that minimizes \(-\log\bigg[\Pr(y_1,\ldots,y_n|\boldsymbol{\theta})\bigg]\), also maximizes \(\log[\mathscr{L}(\mathbf{\theta}|y_i)]\)
Because \(\log[x]\) is a monotone function, the value of \(\boldsymbol{\theta}\) that maximizes \(\log[\mathscr{L}(\mathbf{\theta}|y_i)]\) also maximizes \(\log[\mathscr{L}(\mathbf{\theta}|y_i)]\)
This value of \(\boldsymbol{\theta}\) that maximizes the total likelihood is called the maximum likelihood estimate and is denoted \(\boldsymbol{\hat{\theta}}_{_{MLE}}\)
The constant \(C\) in the total likelihood expression is unimportant and may be ignored in nearly all circumstances
In maximum likelihood estimation, the objective is to maximize the total likelihood based on two inputs:
An assumed functional form of the joint probability
A set of parameter values - \(\mathbf{\theta}=\theta_{1},...,\theta_{n}\)
The functional form of the joint probability is typically expressed in terms of the CDF and the PDF of a candidate probability distribution
For each candidate probability distribution we can obtain a \(\boldsymbol{\hat{\theta}}_{_{MLE}}\) that gives a distribution-specific maximum likelihood estimate \(\mathscr{L}(\boldsymbol{\hat{\theta}}_{_{MLE}}|y_1,\ldots,y_n)\)
The candidate probability distribution with the highest maximum likelihood is the best fit distribution for the data.
Each observation in the sample set under evaluation contributes to the total likelihood
The degree to which an observation contritbutes to the total likelihood depends on the underlying failure distribution \(F(t)\) and whether the width of the interval in which observation occurred
Figure 2.6 illustrates the contributions made to the total likelihood for ‘exact’ observations, along with right, left, and interval censored observations
par(family = 'serif', font = 2)
curve(dweibull(x, shape = 1.7, scale = 1),
lwd = 2,
las = 1,
xlim = c(0,2.5),
xlab = 'Time (t)',
ylab = 'f(t)',
main = 'Figure 2.6 - Likelihood contributions for different kids of censoring')
polygon(x = c(seq(0,.5,.01),.5),
y = c(dweibull(seq(0,.5,.01),shape = 1.7,scale = 1),0),
col = 1)
polygon(x = c(1,seq(1,1.5,.01),1.5),
y = c(0,dweibull(seq(1,1.5,.01),shape = 1.7,scale = 1),0),
col = 1)
polygon(x = c(2,seq(2,2.4,.01),2.4),
y = c(0,dweibull(seq(2,2.4,.01),shape = 1.7,scale = 1),0),
col = 1)
text(x = c(.16,1.3,2.2),
y = c(.75,.65,.15),
labels = c('Left Censoring',
'Interval Censoring',
'Right Censoring'))| Censoring type | Range | Likelihood |
|---|---|---|
| \(d_{i}\) observations interval censored in \(t_{i-1}\) and \(t_{i}\) | \(t_{i-1}<T\le t_{i}\) | \([F(t_{i})-F(t_{i-1})]^{d_{i}}\) |
| \(l_{i}\) observations left censored at \(t_{i}\) | \(T\le t_{i}\) | \([F(t_{i})]^{l_{i}}\) |
| \(r_{i}\) observations right censored at \(t_{i}\) | \(T>t_{i}\) | \([1-F(t_{i})]^{r_{i}}\) |
Using the expressions in Table 2.2, the total likelihood for a data set containing exact observations, and arbitrarily censored observations is expressed as \[ \begin{aligned} \mathscr{L}(\theta|t_1,\ldots,t_n) &=C\prod_{i=1}^{n} \mathscr{L}(\theta|t_i)\\ &=C\prod_{i=1}^{m+1}[F(t_{i})]^{l_{i}}[F(t_{i})-F(t_{i-1})]^{d_{i}}[1-F(t_{i})]^{r_{i}} \end{aligned} \]
where \(n = \sum_{j=1}^{m+1}(d_{j}+r_{j}+l_{j})\)
In general, the multinomial constant term \(C\) does not effect the value of \(\boldsymbol{\hat{\theta}}_{_{MLE}}\) that maximizes the total likelihood
For most models used in this course \(C\) may be considered a proportionality constant (i.e. \(C=1\)) \[ \mathcal{C}=\frac{n!}{d_1!...d_{m+1}!} \]
This section presents a more general expression for the total likelihood function when
The intervals in which a failure occurs overlap
The union of the intervals do not cover the entire time line \((0,\infty)\)
In this course, the form of \(\mathscr{L}\) given in equation 2.14 is sufficient
This section presents an even more general expression for the total likelihood function when
Censoring occurs at random times within the intervals according to \(f_{c}(t)\)
The distributions represented by \(f_{c}(t)\) and \(f(t)\) are independent
This type of censoring can occur if multiple failure modes are present in the unit under test
Multiple failure modes with be considered in Chapter 15
Truncated data will be discussed in Chapter 11