Statistical Methods for Reliability Data

Chapter 2 - Models, Censoring, and Likelihood for Failure-Time Data

W. Q. Meeker, L. A. Escobar, and J. K. Freels

12 October 2016

OVERVIEW

This Chapter Explains...

Models for continuous failure-time processes
Models used for the discrete data from these continuous failure-time processes
Common censoring mechanisms that restrict our ability to observe all of the failure times that might occur in a reliability study
Principles of likelihood and how likelihood relates to the probability of the observed data
How likelihood ideas can be used to make inferences from reliability data

2.1.1 - Failure Time Distribution Functions

Figure 2.1 Distribution Plots For \(T \sim WEIB(1.7,1)\)

The text describes the probability functions using the \(WEIB(1.7,1)\) distribution as an example
The SMRD package includes the distribution plot( ) function, which was used to produce Figure 2.1
- If single values are given for both parameters distribution.plot( ) returns a plot showing \(F(t), f(t), S(t)\) and \(h(t)\)
- If multiple values are given for either parameter distribution.plot( ) returns a plot showing \(F(t), f(t), h(t)\) and a legend of the parameter values displayed

2.2 - MODELS FOR DISCRETE DATA FROM A CONTINUOUS PROCESS

Understanding failure

Failure implies that a product's ability to perform one or more functions has been...
- Lost completely
- Degraded to the point that the product cannot meet a performance specification
When failure occurs in a fielded system, there may not be an immediate indication to the user or the maintainers
- For complex systems, some failures can remain undiscovered until the next inspection cycle
- Component and sub-component failures may only be discovered when a subsystem or the overall system fails
- Inspection intervals should be short enough to allow maintainers multiple opportunities to discover flaws and prevent serious failures
- Inspection intervals should be long enough to not overburden maintainers with unnecessary inspections
For systems being tested we cannot observe the exact failure time for every unit in the sample
- May be in different geographic regions
- Sample size may be too large
- Test items may be removed from a test to fill a "real-world" need
- Test may end before item fails, or
- Items may fail before the first inspection

2.2.1 - Multinomial Failure Time Model

Real world reliability data

We can't constantly monitor the performance of every component when a system is in the field.
For this reason real-world failure data are often reported as the inspection time when the failures were observed
Thus, real-world failure data are often reported as DISCRETE values corresponding to the inspection cycle
This occurs even if the sample space for the data is continuous \(\Omega \in \mathbb{R}\)
To analyze this real-world failure data, we partition the test into non-overlapping intervals corresponding to the time between inspections, i.e.

\[(t_0,t_1], (t_1,t_2],...,(t_{m-1},t_m],(t_m,t_{m+1}),\;\;\;\;\; t_0=0, t_{m+1}=\infty\]

The intervals do not need to be of equal length
- Inspections may be made more frequently at the beginning of a test to check for "infant mortality" failures or ensure that the test was set-up correctly
- Inspections may then become less frequent once the "bad actors" have been culled from the population

Multinomial Reliability Estimation

Estimating reliability in this scenario is similar to estmating the likelihood of a single \(n\)-sided die landing on a specific number
However, for failure data the 'sides' refer to the \(n\) inspection intervals
Think about how to estimate the probability of a 6-sided die landing on a given side
- If the die is 'fair' and we rolled it many times (i.e. \(n \gtrsim 10,000\) rolls) we would expect in the long-run that \[\frac{\text{# times we roll 1,2,3,4,5,6}}{\text{# of total rolls}}\rightarrow P(X=1,2,3,4,5,6) = \frac{1}{6}\]
- However, if \(n\) is much smaller, say 100 rolls, the probability that the die lands on a given side is modeled using the multinomial distrubution with probability mass function

\[f(n,\pi_{_{1}},...,\pi_{_{m+1}}|d_{_{1}},...,d_{_{m+1}})=\frac{n!}{d_1!,...,d_{m+1}!}\pi_{_{1}}^{d_{1}},...,\pi_{_{m+1}}^{d_{m+1}}\]

In the context of rolling a single die with \(m+1\) sides, the variables in the above equation represent
- \(n\) - number of dice rolls
- \(d_1,...,d_{m+1}\) - number of times each of the faces \(1,...,m+1\) are observed
- \(\pi_{_{1}},...,\pi_{_{m+1}}\) - the long-run probabilities of observing faces \(1,...,m+1\)
In the context of failures occuring in discrete time intervals, these variables represent
- \(n\) - the number of systems under test
- \(d_{1}\) - number of systems that failed in the \(1^{st}\) interval \((t_{0},t_{1}]\)
- \(d_{2}\) - number of systems that failed in the \(2^{nd}\) interval \((t_{1},t_{2}]\)
- \(d_{m+1}\) - number of systems still operating at the final inspection \((t_{m},\infty)\)
- \(\pi_{_{i}},...,\pi_{_{m+1}}\) - the long-run probability that a unit will fail in interval \(i,...,m+1\)

Figure 2.5 - Relationship between \(\pi_{i}\) and \(F(t)\)

For a fair die and a fair toss \(\pi_1=\pi_2=\pi_3=\pi_4=\pi_5=\pi_6 \approx 1/6\)
- This value of \(1/6\) is called a propensity of the EXPERIMENT, not just the die
- The propensity is a function of how the die was manufactured AND how the die was tossed
- A "weighted" die would have a different propensity represented by different values of \(\pi_1\ne\pi_2\ne\pi_3\ne\pi_4\ne\pi_5\ne\pi_6\)
- Likewise, with some practice we might be able to toss the die in such a way as to affect the propensity
For systems under test \(\pi_{i}=P(t_{i-1}<T\le t_{i})=F(t_{i})-F(t_{i-1})\) where
- \(\pi_i\ge 0, \forall i\)
- \(\displaystyle \sum_{j=1}^{m+1}\pi_j=1\)
- \(F(t)\) is the underlying failure time distribution
Figure 2.5 illustrates the relationship between \(\pi_{i}\) and \(F(t)\), where the underlying failure time distribution is \(Weibull(1.7,1)\)

Multinomial Survival Function

Based on this definition of \(F(t_{i})\), the survival function evaluated at \(t_{i}\) may be expressed as

\[\displaystyle S(t_i)=P(T>t_i)=1-F(t_i)=\sum_{j=i+1}^{m+1}\pi_j\]

Where
- \(i\) denotes the interval of interest
- \(j\) denotes an arbitrary interval prior to interval \(i\)
In words, this equation states that the probability that a system will survive to inspection \(i\) is the sum of the probabilities that the system is found to have failed at inspection \(i+1\) or later

In-Class example - multinomial failure time model

Motivation for this example
- Many A/C components cannot be inspected by base-level Mx due to lack of training, facilities, tools, etc.
- Components will only be inspected every \(t\) flight-hours at the depot
- For each component, \(\exists\) an underlying \(F(t), t\in \mathbb{R}^{+}\) describing time to failure
- The form of \(F(t)\) is often unknown, but may sometimes be inferred from similar components/systems
Goal of this example
- Estimate risk of delaying replacement until next inspection (depot cycle)
- Balance safety risks against replacement costs
Estimating the risk of delaying replacement until the next inspection implies that
- The system was still operating at inspection \(i-1\)
- We want to know how likely it is that the system will fail before inspection \(i\)
- The text defines this quantity in (2.7) as \(p_{i}\)
- \(p_{i}\) is the conditional probability that the system will fail prior to inspection \(i\), given that it has survived to inspection \(i-1\).

\[\displaystyle p_i=P(t_{i-1}<T\le t_{i}|T>t_{i-1})=\frac{F(t_i)-F(t_{i-1})}{1-F(t_{i-1})}=\frac{\pi_{i}}{S(t_{i-1})}\]

2.2.2 - Multinomial CDF

From Equation 2.7 we define the survival function in terms of \(\mathbf{p}=(p_{1},...,p_{m})\) as

\[\displaystyle S(t_{i})=\prod_{j=1}^{i}[1-p_{j}], i=1,...,m+1.\]

It follows that the CDF is expressed as

\[\displaystyle F(t_{i})=1-\prod_{j=1}^{i}[1-p_{j}]=\sum_{j=1}^{i}[\pi_{j}], i=1,...,m+1.\]

Think about what the above equation says IN WORDS

Example 2.8 - Computation of \(F(t_i), S(t_i), \pi_i\) and \(p_i\)

Table 2.1 shows the results for the \(WEIB(1.7,1)\) distribution
The following steps show how the values in the table were computed for the first and second inspections

\(i\)	\(t_{i}\)	\(F(t_{i})\)	\(S(t_{i})\)	\(\pi_{i}\)	\(p_{i}\)	\(1-p_{i}\)
0	0.0	0.000	1.000
1	0.5	0.265	0.735	0.265	0.265	0.735
2	1.0	0.632	0.368	0.367	0.500	0.500
3	1.5	0.864	0.136	0.231	0.629	0.371
4	2.0	0.961	.0388	.0967	0.715	0.285
5	\(\infty\)	1.000	.0000	.0388	1.000	0.000
				1.000

At the first inspection \(t_{1}=0.5\)

\[ \begin{aligned} F(0.5)&=1-\exp\left[-\left(\frac{0.5}{1}\right)^{1.7}\right]=\underline{0.2649275}\\\\\\ S(0.5)&=1-F(0.5)=\underline{0.7350725}\\\\\\ \pi_i&=F(t_i)-F(t_{i-1})=F(0.5)-F(0)=\underline{0.2649275}\\\\\\ p_i&=\frac{\pi_i}{S(t_{i-1})}=\frac{F(0.5)-F(0)}{S(0)}=\frac{0.2649275}{1}=\underline{0.2649275} \end{aligned} \]

At the second inspection \(t_{2}=1.0\)

\[ \begin{aligned} F(1)&=1-\exp\left[-\left(\frac{1}{1}\right)^{1.7}\right]=\underline{0.6321206}\\\\\\ S(1)&=1-F(1)=\underline{0.3678794}\\\\\\ \pi_i&=F(t_i)-F(t_{i-1})=F(1)-F(0.5)=\underline{0.3671931}\\\\\\ p_i&=\frac{\pi_i}{S(t_{i-1})}=\frac{F(1)-F(0.5)}{S(0.5)}=\frac{0.3671931}{0.7350725}=\underline{0.4995331} \end{aligned} \]

2.3 - CENSORING

2.3.1 - Censoring Mechanisms

Types of Right Censoring

Type I (Time) Censoring - Occurs when time considerations limit our ability to observe the exact failure times of one or more samples
- The total available test time is predetermined before testing begins
- Samples still operating after the predetermined end of testing are Type-I right-censored
- In this type of test, the number of observed failures is a random quantity
- There's no guarantee that any failures will be observed before testing ends
Type II (Failure) Censoring - Occurs when time considerations limit our ability to observe the exact failure times of one or more samples
- The number of observed failures is predetermined before testing begins
- Samples still operating after the predetermined number of failures have been observed are Type-II right-censored
- In this type of test the total test time is a random quantity
- There's no guarantee a sufficient number of failures will be observed in a reasonable timeframe
Modeling Type II censored data is simpler than for Type I data
Failure censored tests (Type II) are far less common than time-censored tests
Why would these two statements be true?

Other Types Of Censoring

Time censored test data (Type I) may be further sub-divided according to

The number of censoring events
- Singly censored
- Multiply censored
The contribution to the likelihood function \(\S\) 2.4
- Right censored
- Left censored
- Interval censored
Remember...ALL DATA IS INTERVAL DATA

2.3.2 - Important Assumptions On Censoring Mechanisms

Independent vs. Dependent Censoring

This course considers censoring events that are independent of the time to failure
- The time at which testing ends is driven by real-world constraints - not because we want to ensure the test ends before an item fails
- Inspections should not be delayed so that a longer lifetime can be reported
- Items should not be removed from a test because it appears that they are 'about to fail'
Competing risks considers dependent censoring events that affect time to failure
- Dependent censoring is common in medical survivability research
- Dependent censoring is beyond the scope of this course

In-Class Example: Independent & Dependent Censoring

We investigate the time to recurrence of symptoms in cancer patients who've received a new treatment
- Patient A moves away during the study - we can't collect any future observations
- Patient B is killed in a car crash during the study - we can't collect any future observations
How to treat the data from these two patients
- Patient A is right censored because there is a positive probability that cancer symptoms could return - although we won't be able to observe it
- Patient B is dependent censoring because there is zero probability that cancer symptoms could return

2.4 - LIKELIHOOD

The Likelihood Function \(\mathscr{L}(\theta|\mathbf{t})\) And The PDF \(f(\mathbf{t}|\theta)\)

Suppose we want to simulate 50 failure times to use for an example
How would we do it?
1. We'd first have to choose a distribution family to simulate from (log-normal, Weibull, exponential)
2. We'd then choose a set of parameter values to indicate the specific member of the chosen distribution family
3. Finally, we'd use a software application to generate 50 random observations using the distribution/parameter set we chose
So, let's simulate 50 failure observations in the interval \([50,150]\) using the following distributions
- \(LOGNOR(\mu = 4.5, \sigma = 0.2)\)
  rlnorm(n = 50, meanlog = 4.5, sdlog = 0.2)
- \(WEIB(\beta = 5.5, \eta = 120)\)
  rweibull(n = 50, shape = 5.5, scale = 120)
- \(EXP(\theta = 40)\)
  rexp(n = 50, rate = 1/40)
The parameter values in each of these distributions were chosen to generate observations between \(50\) and \(150\)

The figure below shows the result of this simulation in two separate plots

The left plot
- Shows the pdfs of each distribution along with the 50 failure times simulated from each distribution
- When the observations are plotted this way, it's easy to see how the distributions differ
The right plot
- Shows the raw observations ordered from smallest to largest
- In this plot we can see that the observations simulated from the exponential distribution are skewed toward lower values compared to the Weibull and lognormal distributions
- However, it's much harder to distinguish between the observations from the Weibull and lognormal distributions.

What's The Point?

Specifying the distribution family and parameter values gives information about the pattern of the data that will result from the simulation
- Sometimes, it's easy to distinguish different patterns, just by looking at the data visually
- More often, numerical methods are required to distinguish between the patterns of data produced by different distributions
- Simulating more observations can sometimes make it easier to distinguish between the patterns
But the opposite should be true as well
The data should also provide information about the distribution family and parameter values that were most likely to have produced them
- Again, sometimes it's easy to pick out what distribution family produced a data set just by looking at the pattern of the data
- By making more observations are available, it becomes easier to distinguish which distribution family produced the data
- In most cases, numerical methods are needed to distinguish which distribution produced a data set
- This numerical method of extracting information from data to identify the distribution that is most likely to have produced it is called Maximum Likelihood Estimation

2.4.1 - Likelihood-Based Statistical Methods

Suppose that we observe \(DATA = t_{1},t_{2},...,t_{n}\) from a test event
If these observations were produced by units that can reasonably be considered as identical, then

\[t_{1},t_{2},...,t_{n} \sim i.i.d. F(t|\mathbf{\theta}=\theta_{1},...,\theta_{n})\]

The form of \(F(t|\mathbf{\theta})\) and the values of \(\theta_{1},...,\theta_{n}\) are unknown
But, likelihood theory states that for each distribution family, we can find \(\mathbf{\theta}\) such that

\[\mathscr{L_i}(\mathbf{\theta}|data_i)=f(data_i|\mathbf{\theta})=f(t_{i}|\mathbf{\theta})\]

The total likelihood is equal to the joint probability of the data

This means that

\[\mathscr{L}(\theta|DATA)=C\prod_{i=1}^{n} \mathscr{L}_{i}(\theta|data_i)=\prod_{i=1}^{n}f(t_{i}|\theta)\]

Note that
- \(\mathscr{L_i}(\theta|data_i)\) - function of the parameter set \(\theta\) assuming we observe \(data_i\)
- \(f(t_{i}|\theta)\) - function of \(data_i\) assuming \(\theta\) is known
The idea behind maximum likelihood is to maximize the likelihood function of observing \(DATA\) based on two inputs:
- An assumed probability model - \(f(t)\)
- A set of parameter values - \(\mathbf{\theta}=\theta_{1},...,\theta_{n}\)

Likelihood Example

Let's generate a set 30 exact observations from a \(Weib(1.7.10)\) distribution

 [1] 11.3320098  7.4660938  3.8889996 13.0314761  9.8971047 11.8703029
 [7]  7.2016601  7.1390631 14.0791268  2.2901230 14.2079391  7.1940855
[13] 14.0213653  7.0142401 11.5523583  3.6210279 12.8739551  4.1321127
[19] 12.8495079  7.6663446 15.4626644  9.9028315  7.3127562 10.8901172
[25]  9.9988580 10.9927350  0.4339705  6.1634816  5.2514609 11.3828036

If this were test data we wouldn't know the form of the underlying distribution
Suppose we wanted to model this data with an exponential distribution with pdf

\[f(t)=\frac{1}{\theta}\exp^{-\left(\frac{t}{\theta}\right)}\]

If we were to analytically determine \(\theta_{_{MLE}}\)

\[\prod_{i=1}^{30}f(t_{i}|\theta)=\left(\frac{1}{\theta}\right)^{30}\exp{\left(-\frac{\sum_{i=1}^{30}t_{i}}{\theta}\right)}\]

[1] 1.884537

[1] 1.884536

[1] 1.884536

Figure 2.6 - Likelihood contributions

2.4.3 - Contributions To The Likelihood Function

Every item in the sample set under evaluation contributes to the total likelihood
For an underlying failure distribution \(F(t)\), the value of \(\mathscr{L}_i\) for a censored observation is

Table 2.2 - Likelihood contributions for life-table data

Censoring type	Range	Likelihood
\(d_{i}\) observations interval censored in \(t_{i-1}\) and \(t_{i}\)	\(t_{i-1}<T\le t_{i}\)	\([F(t_{i})-F(t_{i-1})]^{d_{i}}\)
\(l_{i}\) observations left censored at \(t_{i}\)	\(T\le t_{i}\)	\([F(t_{i})]^{l_{i}}\)
\(r_{i}\) observations right censored at \(t_{i}\)	\(T>t_{i}\)	\([1-F(t_{i})]^{r_{i}}\)

The total likelihood function may therefore be expressed as

\[ \begin{aligned} \mathscr{L}(\theta|DATA)&=C\prod_{i=1}^{n} \mathscr{L}_{i}(\theta|data_i)\\ &=C\prod_{i=1}^{m+1}[F(t_{i})]^{l_{i}}[F(t_{i})-F(t_{i-1})]^{d_{i}}[1-F(t_{i})]^{r_{i}} \end{aligned} \]

where \(n = \sum_{j=1}^{m+1}(d_{j}+r_{j}+l_{j})\)

2.4.4 - Form Of The Constant Term \(C\)

In general,the multinomial constant term \(C\) does not effect the parameter values that maximize the likelihood function
For most models used in this course \(C\) may be considered a proportionality constant (i.e. \(C=1\))

\[\mathcal{C}=\frac{n!}{d_1!...d_{m+1}!}\]

2.4.5 - Likelihood Terms For General Reliability Data

This section presents a more general expression for the total likelihood function when
- The intervals in which a failure occurs overlap
- The union of the intervals down not cover the entire time line \((0,\infty)\)
In this course, the form of \(\mathscr{L}\) given in equation 2.14 is sufficient

2.4.6 - Other Likelihood Terms

This section presents an even more general expression for the total likelihood function when
- Censoring occurs at random times within the intervals according to \(f_{c}(t)\)
- The distributions represented by \(f_{c}(t)\) and \(f(t)\) are independent
This type of censoring can occur if multiple failure modes are present in the unit under test
Multiple failure modes with be considered in Chapter 15
Truncated data will be discussed in Chapter 11