W. Q. Meeker, L. A. Escobar, and J. K. Freels
12 October 2016
Models for continuous failure-time processes
Models used for the discrete data from these continuous failure-time processes
Common censoring mechanisms that restrict our ability to observe all of the failure times that might occur in a reliability study
Principles of likelihood and how likelihood relates to the probability of the observed data
How likelihood ideas can be used to make inferences from reliability data
The text describes the probability functions using the \(WEIB(1.7,1)\) distribution as an example
The SMRD package includes the distribution plot( ) function, which was used to produce Figure 2.1
If single values are given for both parameters distribution.plot( ) returns a plot showing \(F(t), f(t), S(t)\) and \(h(t)\)
If multiple values are given for either parameter distribution.plot( ) returns a plot showing \(F(t), f(t), h(t)\) and a legend of the parameter values displayed
Failure implies that a product's ability to perform one or more functions has been...
Lost completely
Degraded to the point that the product cannot meet a performance specification
When failure occurs in a fielded system, there may not be an immediate indication to the user or the maintainers
For complex systems, some failures can remain undiscovered until the next inspection cycle
Component and sub-component failures may only be discovered when a subsystem or the overall system fails
Inspection intervals should be
Inspection intervals should be
For systems being tested we cannot
May be in different geographic regions
Sample size may be too large
Test items may be removed from a test to fill a "real-world" need
Test may end before item fails, or
Items may fail before the first inspection
We can't constantly monitor the performance of every component when a system is in the field.
For this reason real-world failure data are often reported as the inspection time when the failures were observed
Thus, real-world failure data are often reported as DISCRETE values corresponding to the inspection cycle
To analyze this real-world failure data, we partition the test into non-overlapping intervals corresponding to the time between inspections, i.e.
\[(t_0,t_1], (t_1,t_2],...,(t_{m-1},t_m],(t_m,t_{m+1}),\;\;\;\;\; t_0=0, t_{m+1}=\infty\]
The intervals do not need to be of equal length
Inspections may be made more frequently at the beginning of a test to check for "infant mortality" failures or ensure that the test was set-up correctly
Inspections may then become less frequent once the "bad actors" have been culled from the population
Estimating reliability in this scenario is similar to estmating the likelihood of a single \(n\)-sided die landing on a specific number
However, for failure data the 'sides' refer to the \(n\) inspection intervals
Think about how to estimate the probability of a 6-sided die landing on a given side
If the die is 'fair' and we rolled it many times (i.e. \(n \gtrsim 10,000\) rolls) we would expect in the long-run that \[\frac{\text{# times we roll 1,2,3,4,5,6}}{\text{# of total rolls}}\rightarrow P(X=1,2,3,4,5,6) = \frac{1}{6}\]
However, if \(n\) is much smaller, say 100 rolls, the probability that the die lands on a given side is modeled using the multinomial distrubution with probability mass function
\[f(n,\pi_{_{1}},...,\pi_{_{m+1}}|d_{_{1}},...,d_{_{m+1}})=\frac{n!}{d_1!,...,d_{m+1}!}\pi_{_{1}}^{d_{1}},...,\pi_{_{m+1}}^{d_{m+1}}\]
In the context of rolling a single die with \(m+1\) sides, the variables in the above equation represent
\(n\) - number of dice rolls
\(d_1,...,d_{m+1}\) - number of times each of the faces \(1,...,m+1\) are observed
\(\pi_{_{1}},...,\pi_{_{m+1}}\) - the long-run probabilities of observing faces \(1,...,m+1\)
In the context of failures occuring in discrete time intervals, these variables represent
\(n\) - the number of systems under test
\(d_{1}\) - number of systems that failed in the \(1^{st}\) interval \((t_{0},t_{1}]\)
\(d_{2}\) - number of systems that failed in the \(2^{nd}\) interval \((t_{1},t_{2}]\)
\(d_{m+1}\) - number of systems still operating at the final inspection \((t_{m},\infty)\)
\(\pi_{_{i}},...,\pi_{_{m+1}}\) - the long-run probability that a unit will fail in interval \(i,...,m+1\)
For a fair die and a fair toss \(\pi_1=\pi_2=\pi_3=\pi_4=\pi_5=\pi_6 \approx 1/6\)
This value of \(1/6\) is called a
The propensity is a function of how the die was manufactured AND how the die was tossed
A "weighted" die would have a different propensity represented by different values of \(\pi_1\ne\pi_2\ne\pi_3\ne\pi_4\ne\pi_5\ne\pi_6\)
Likewise, with some practice we might be able to toss the die in such a way as to affect the propensity
For systems under test \(\pi_{i}=P(t_{i-1}<T\le t_{i})=F(t_{i})-F(t_{i-1})\) where
\(\pi_i\ge 0, \forall i\)
\(\displaystyle \sum_{j=1}^{m+1}\pi_j=1\)
\(F(t)\) is the underlying failure time distribution
Figure 2.5 illustrates the relationship between \(\pi_{i}\) and \(F(t)\), where the underlying failure time distribution is \(Weibull(1.7,1)\)
\[\displaystyle S(t_i)=P(T>t_i)=1-F(t_i)=\sum_{j=i+1}^{m+1}\pi_j\]
Where
\(i\) denotes
\(j\) denotes
In words, this equation states that
Motivation for this example
Many A/C components cannot be inspected by base-level Mx due to lack of training, facilities, tools, etc.
Components will only be inspected every \(t\) flight-hours at the depot
For each component, \(\exists\) an underlying \(F(t), t\in \mathbb{R}^{+}\) describing time to failure
The form of \(F(t)\) is often unknown, but may sometimes be inferred from similar components/systems
Goal of this example
Estimate risk of delaying replacement until next inspection (depot cycle)
Balance safety risks against replacement costs
Estimating the risk of delaying replacement until the next inspection implies that
The system was still operating at inspection \(i-1\)
We want to know how likely it is that the system will fail before inspection \(i\)
The text defines this quantity in (2.7) as \(p_{i}\)
\(p_{i}\) is the
\[\displaystyle p_i=P(t_{i-1}<T\le t_{i}|T>t_{i-1})=\frac{F(t_i)-F(t_{i-1})}{1-F(t_{i-1})}=\frac{\pi_{i}}{S(t_{i-1})}\]
\[\displaystyle S(t_{i})=\prod_{j=1}^{i}[1-p_{j}], i=1,...,m+1.\]
\[\displaystyle F(t_{i})=1-\prod_{j=1}^{i}[1-p_{j}]=\sum_{j=1}^{i}[\pi_{j}], i=1,...,m+1.\]
Table 2.1 shows the results for the \(WEIB(1.7,1)\) distribution
The following steps show how the values in the table were computed for the first and second inspections
| \(i\) | \(t_{i}\) | \(F(t_{i})\) | \(S(t_{i})\) | \(\pi_{i}\) | \(p_{i}\) | \(1-p_{i}\) |
|---|---|---|---|---|---|---|
| 0 | 0.0 | 0.000 | 1.000 | |||
| 1 | 0.5 | 0.265 | 0.735 | 0.265 | 0.265 | 0.735 |
| 2 | 1.0 | 0.632 | 0.368 | 0.367 | 0.500 | 0.500 |
| 3 | 1.5 | 0.864 | 0.136 | 0.231 | 0.629 | 0.371 |
| 4 | 2.0 | 0.961 | .0388 | .0967 | 0.715 | 0.285 |
| 5 | \(\infty\) | 1.000 | .0000 | .0388 | 1.000 | 0.000 |
| 1.000 |
\[ \begin{aligned} F(0.5)&=1-\exp\left[-\left(\frac{0.5}{1}\right)^{1.7}\right]=\underline{0.2649275}\\\\\\ S(0.5)&=1-F(0.5)=\underline{0.7350725}\\\\\\ \pi_i&=F(t_i)-F(t_{i-1})=F(0.5)-F(0)=\underline{0.2649275}\\\\\\ p_i&=\frac{\pi_i}{S(t_{i-1})}=\frac{F(0.5)-F(0)}{S(0)}=\frac{0.2649275}{1}=\underline{0.2649275} \end{aligned} \]
\[ \begin{aligned} F(1)&=1-\exp\left[-\left(\frac{1}{1}\right)^{1.7}\right]=\underline{0.6321206}\\\\\\ S(1)&=1-F(1)=\underline{0.3678794}\\\\\\ \pi_i&=F(t_i)-F(t_{i-1})=F(1)-F(0.5)=\underline{0.3671931}\\\\\\ p_i&=\frac{\pi_i}{S(t_{i-1})}=\frac{F(1)-F(0.5)}{S(0.5)}=\frac{0.3671931}{0.7350725}=\underline{0.4995331} \end{aligned} \]
Type I (Time) Censoring - Occurs when time considerations limit our ability to observe the exact failure times of one or more samples
Samples still operating after the predetermined end of testing are Type-I right-censored
In this type of test, the number of observed failures is a random quantity
There's no guarantee that any failures will be observed before testing ends
Type II (Failure) Censoring - Occurs when time considerations limit our ability to observe the exact failure times of one or more samples
Samples still operating after the predetermined number of failures have been observed are Type-II right-censored
In this type of test the total test time is a random quantity
There's no guarantee a sufficient number of failures will be observed in a reasonable timeframe
Modeling Type II censored data is simpler than for Type I data
Failure censored tests (Type II) are far less common than time-censored tests
Time censored test data (Type I) may be further sub-divided according to
The number of censoring events
Singly censored
Multiply censored
The contribution to the likelihood function \(\S\) 2.4
Right censored
Left censored
Interval censored
This course considers censoring events that are independent of the time to failure
The time at which testing ends is driven by real-world constraints - not because we want to ensure the test ends before an item fails
Inspections should not be delayed so that a longer lifetime can be reported
Items should not be removed from a test because it appears that they are 'about to fail'
Competing risks considers dependent censoring events that affect time to failure
Dependent censoring is common in medical survivability research
Dependent censoring is beyond the scope of this course
We investigate the time to recurrence of symptoms in cancer patients who've received a new treatment
Patient A moves away during the study - we can't collect any future observations
Patient B is killed in a car crash during the study - we can't collect any future observations
How to treat the data from these two patients
Patient A is right censored because there is a positive probability that cancer symptoms could return - although we won't be able to observe it
Suppose we want to simulate 50 failure times to use for an example
How would we do it?
We'd first have to choose a distribution family to simulate from (log-normal, Weibull, exponential)
We'd then choose a set of parameter values to indicate the specific member of the chosen distribution family
Finally, we'd use a software application to generate 50 random observations using the distribution/parameter set we chose
So, let's simulate 50 failure observations in the interval \([50,150]\) using the following distributions
\(LOGNOR(\mu = 4.5, \sigma = 0.2)\)rlnorm(n = 50, meanlog = 4.5, sdlog = 0.2)
\(WEIB(\beta = 5.5, \eta = 120)\)rweibull(n = 50, shape = 5.5, scale = 120)
\(EXP(\theta = 40)\)rexp(n = 50, rate = 1/40)
The parameter values in each of these distributions were chosen to generate observations between \(50\) and \(150\)
The
Shows the pdfs of each distribution along with the 50 failure times simulated from each distribution
When the observations are plotted this way, it's easy to see how the distributions differ
The
Shows the raw observations ordered from smallest to largest
In this plot we can see that the observations simulated from the exponential distribution are skewed toward lower values compared to the Weibull and lognormal distributions
However, it's much harder to distinguish between the observations from the Weibull and lognormal distributions.
Specifying the distribution family and parameter values gives information about the pattern of the data that will result from the simulation
Sometimes, it's easy to distinguish different patterns, just by looking at the data visually
More often, numerical methods are required to distinguish between the patterns of data produced by different distributions
Simulating more observations can sometimes make it easier to distinguish between the patterns
But the opposite should be true as well
The data should also provide information about the distribution family and parameter values that were
Again, sometimes it's easy to pick out what distribution family produced a data set just by looking at the pattern of the data
By making more observations are available, it becomes easier to distinguish which distribution family produced the data
In most cases, numerical methods are needed to distinguish which distribution produced a data set
This numerical method of extracting information from data to identify the distribution that is most likely to have produced it is called
Suppose that we observe \(DATA = t_{1},t_{2},...,t_{n}\) from a test event
If these observations were produced by units that can reasonably be considered as identical, then
\[t_{1},t_{2},...,t_{n} \sim i.i.d. F(t|\mathbf{\theta}=\theta_{1},...,\theta_{n})\]
The form of \(F(t|\mathbf{\theta})\) and the values of \(\theta_{1},...,\theta_{n}\) are unknown
But, likelihood theory states that for each distribution family, we can find \(\mathbf{\theta}\) such that
\[\mathscr{L_i}(\mathbf{\theta}|data_i)=f(data_i|\mathbf{\theta})=f(t_{i}|\mathbf{\theta})\]
\[\mathscr{L}(\theta|DATA)=C\prod_{i=1}^{n} \mathscr{L}_{i}(\theta|data_i)=\prod_{i=1}^{n}f(t_{i}|\theta)\]
Note that
\(\mathscr{L_i}(\theta|data_i)\) - function of the parameter set \(\theta\) assuming we observe \(data_i\)
\(f(t_{i}|\theta)\) - function of \(data_i\) assuming \(\theta\) is known
The idea behind maximum likelihood is to maximize the likelihood function of observing \(DATA\) based on two inputs:
An assumed probability model - \(f(t)\)
A set of parameter values - \(\mathbf{\theta}=\theta_{1},...,\theta_{n}\)
[1] 11.3320098 7.4660938 3.8889996 13.0314761 9.8971047 11.8703029
[7] 7.2016601 7.1390631 14.0791268 2.2901230 14.2079391 7.1940855
[13] 14.0213653 7.0142401 11.5523583 3.6210279 12.8739551 4.1321127
[19] 12.8495079 7.6663446 15.4626644 9.9028315 7.3127562 10.8901172
[25] 9.9988580 10.9927350 0.4339705 6.1634816 5.2514609 11.3828036
If this were test data we wouldn't know the form of the underlying distribution
Suppose we wanted to model this data with an exponential distribution with pdf
\[f(t)=\frac{1}{\theta}\exp^{-\left(\frac{t}{\theta}\right)}\]
\[\prod_{i=1}^{30}f(t_{i}|\theta)=\left(\frac{1}{\theta}\right)^{30}\exp{\left(-\frac{\sum_{i=1}^{30}t_{i}}{\theta}\right)}\]
[1] 1.884537
[1] 1.884536
[1] 1.884536
Every item in the sample set under evaluation contributes to the total likelihood
For an underlying failure distribution \(F(t)\), the value of \(\mathscr{L}_i\) for a censored observation is
| Censoring type | Range | Likelihood |
|---|---|---|
| \(d_{i}\) observations interval censored in \(t_{i-1}\) and \(t_{i}\) | \(t_{i-1}<T\le t_{i}\) | \([F(t_{i})-F(t_{i-1})]^{d_{i}}\) |
| \(l_{i}\) observations left censored at \(t_{i}\) | \(T\le t_{i}\) | \([F(t_{i})]^{l_{i}}\) |
| \(r_{i}\) observations right censored at \(t_{i}\) | \(T>t_{i}\) | \([1-F(t_{i})]^{r_{i}}\) |
\[ \begin{aligned} \mathscr{L}(\theta|DATA)&=C\prod_{i=1}^{n} \mathscr{L}_{i}(\theta|data_i)\\ &=C\prod_{i=1}^{m+1}[F(t_{i})]^{l_{i}}[F(t_{i})-F(t_{i-1})]^{d_{i}}[1-F(t_{i})]^{r_{i}} \end{aligned} \]
In general,the multinomial constant term \(C\) does not effect the parameter values that maximize the likelihood function
For most models used in this course \(C\) may be considered a proportionality constant (i.e. \(C=1\))
\[\mathcal{C}=\frac{n!}{d_1!...d_{m+1}!}\]
This section presents a more general expression for the total likelihood function when
The intervals in which a failure occurs overlap
The union of the intervals down not cover the entire time line \((0,\infty)\)
In this course, the form of \(\mathscr{L}\) given in equation 2.14 is sufficient
This section presents an even more general expression for the total likelihood function when
Censoring occurs at random times within the intervals according to \(f_{c}(t)\)
The distributions represented by \(f_{c}(t)\) and \(f(t)\) are independent
This type of censoring can occur if multiple failure modes are present in the unit under test
Multiple failure modes with be considered in Chapter 15
Truncated data will be discussed in Chapter 11