Statistical Methods for Reliability Data

Chapter 4 - Location-Scale-Based Parametric Distributions

W. Q. Meeker, L. A. Escobar, and J. K. Freels

16 July 2018

OVERVIEW

This Chapter Explains...

Important ideas behind parametric models in the analysis of reliability data
Motivation for important functions of model parameters that are of interest in reliability studies
The location-scale family of probability distributions
Properties and importance of the exponential distribution
Properties and importance of the log-location-scale distributons such as the Weibull, lognormal, and loglogistic distributions
How to generate pseudorandom data from a specified distribition (such random data are used in simulation evaluations in subsequent chapters)

4.1 - INTRODUCTION

Parametric Models For Reliability Data

Can simplify the process of computing quantities of interest compared to nonparametric methods
Can be described concisely, by listing a few parameters rather that the entire curve
Can enable extrapolation in the upper or lower tails of the distribution
Can provide smooth estimates of failure-time distributions
Can and should both be used together with nonparametric techniques to analyze a data set

4.2 - QUANTITIES OF INTEREST IN RELIABILITY APPLICATIONS

Are not the parameters themselves, but are functions of one or more parameters

The CDF \(\rightarrow F(t|\theta)\)
- Failure probability for a single unit at or before time \(t\)
- Fraction of units from a population that have failed at or before time \(t\)
The quantile value \(\rightarrow t_{p}=F^{-1}(p|\theta)\)
- Time at which the failure probability for a single unit is \(p\)
- Time at which \(p\) units are likely to have failed
The hazard function \(\rightarrow h(t|\theta)=f(t|\theta)/S(t|\theta)\)
- Conditional failure probability in the interval \([t, t+\Delta t)\)
- Time at which \(p\) units are likely to have failed
The mean lifetime \(\;\;\;\;\;\;\hspace{1pt}\rightarrow \; E[T]=\int_{0}^{\infty}tf(t|\theta)dt=MTTF\)
The lifetime variance \(\;\;\;\rightarrow \; Var[T]=\int_{0}^{\infty}\left(t-E[T]\right)^{2}f(t|\theta)dt\)

4.3 - LOCATION-SCALE AND LOG-LOCATION-SCALE DISTRIBUTIONS

The Location-Scale Family of Distributions

The location_scale app from the teachingApps package illustrates properties of several location-scale distributions
Run the app by pasting the code below in the R console

teachingApps::teachingApp('location_scale')

The Log-Location-Scale Family of Distributions

The log_location_scale app from the teachingApps package illustrates properties of several log-location-scale distributions
Run the app by pasting the code below in the R console

teachingApps::teachingApp('log_location_scale')

4.13 - Generating Psuedorandom Observations From a Specified Distribution

Classifying Types of Psuedorandom Number Generators

Several algorithms exist to generate PRN's
These algorithms are classified by
- Randomness - no algorithm generates truly random numbers, but some generate numbers that are more random than others.
- Periodicity - if enough numbers are generated, every algorithm will eventually show a cyclical pattern. The larger the period, the more numbers that can be generated without developing this pattern
- Speed - how quickly can an algorithm generate 1 million random numbers
Specific 'brands' of PRNG
- Linear Congruential
- Combined Linear Congruential
- Inversive Linear Congruential
- Multiply-With-Carry
- Lehman
- Mersenne Twister (the default PRNG in R, C, MatLab, Python, Julia,...)

Pseudorandom observations in R

Recall, R can compute values for any available probability distribution using the standard set of four functions
These functions are differentiated by the first letter of the function name
Thus, using the normal distribution as an example:
- pnorm(q,mean = 0, sd = 1) returns the value of CDF of a \(NOR(0,1)\) distribution at \(q\)
- dnorm(x,mean = 0, sd = 1) returns the value of PDF of a \(NOR(0,1)\) distribution at \(x\)
- qnorm(p,mean = 0, sd = 1) returns the value of quantile function (inverse CDF) of a \(NOR(0,1)\) distribution at \(p\)
- rnorm(n,mean = 0, sd = 1) returns \(n\) random observations from a \(NOR(0,1)\) distribution

PRNG Seed Values

PRNG's require a seed value to establish a starting point for how the random numbers will be generated
The function set.seed( ) can be used to set the seed value in R
Setting the seed value allows you to create 'repeatable' random numbers Is that an oxymoron?
Why would anyone want repeatable random numbers?
I often use randomly generated data to discuss an example
The matrix \(\bar{A}\) below is one example of this

\[\bar{A}=\left[\begin{array}{rrrr} 17 & 40 & 37 & 38 \\ 49 & 42 & 31 & 29 \\ 33 & 3 & 10 & 12 \\ 50 & 18 & 6 & 23 \\ \end{array} \right]\]

If I say 'the diagonal elements in \(\bar{A}\) are \(30,12,31,16\)' it would be confusing if the actual elements were different
By setting the seed value I know what values will be generated

\[ \bar{A} = \left[\begin{array}{rrrr} 30 & 38 & 40 & 4 \\ 1 & 12 & 3 & 36 \\ 15 & 32 & 31 & 48 \\ 14 & 39 & 45 & 16 \\ \end{array} \right] \]

4.13 - Generating Psuedorandom Observations From a Specified Distribution

Example: Generate PRN's With and Without Seed Values

Eight observations from a \(NOR(0.75, 1.5)\) distribution set.seed(NULL)
Eight observations from a \(NOR(0.75, 1.5)\) distribution set.seed(NULL)
Eight observations from a \(NOR(0.75, 1.5)\) distribution set.seed(42)
Eight observations from a \(NOR(0.75, 1.5)\) distribution set.seed(42)

What Does This Show?

The table below shows the four sets of PRN's that were generated above
- The observations in Set 1 and Set 2 are different because no seed value was set
- The observations in Set 3 and Set 4 are the same because they were generated using the same seed value

set1	set2	set3	set4
0.2251	0.1583	0.9075	0.9075
1.0119	1.6559	1.8055	1.8055
1.3853	1.6727	1.8368	1.8368
2.7658	2.7740	3.6499	3.6499
4.0767	4.4050	3.8822	3.8822
7.4931	9.1429	5.4701	5.4701
25.2562	25.5608	16.5509	16.5509
36.6162	63.8440	20.4357	20.4357

4.13 - Generating Psuedorandom Observations From a Specified Distribution

Pseudorandom Numbers From Continuous Distributions

R simulates observations from contiuous distributions in the following way
1. Create a vector of observations from a \(UNIF(0,1)\) distribution (These observations serve as values for \(F(t_p)= p\))
2. Select a distribution and set of parameter values to simulate from
3. Invert the CDF to find the quantile function \(t_{p}=F^{-1}(p)\)
4. Substitute the \(UNIF(0,1)\) observations for \(p\) and the selected parameter values and solve for \(t_p\)
5. The resulting values for \(t_p\) will be PRN's from the selected distribution

Simulate Observations From A \(EXP(2)\) Distribution

Step 1 - Create a vector of observations from a \(UNIF(0,1)\)
- The code below will return \(n = 10\) observations from a \(UNIF(0,1)\) distribution
- These values represent results from the CDF of an arbitrary distribution
- Note: I did not set a seed here - if you paste this code into your R console you will likely generate different values

probs <- runif(10,0,1)

probs

 [1] 0.97822643 0.11748736 0.47499708 0.56033275 0.90403139 0.13871017
 [7] 0.98889173 0.94666823 0.08243756 0.51421178

Step 2 - Select a distribution and parameter values to simulate from
- We've already selected the exponential distribution with scale parameter \(\theta = 2\)
- The CDF for this specific distribution were simulating from is expressed as

\[ F(t|\theta=2)=1-\exp\left[-\frac{t}{2}\right] \]

Step 3 - Invert the CDF to determine the quantile function \(t_{p}=F^{-1}(p)\)
- The quantile function is found by inverting the CDF (if possible), solving for \(t_p\)
- Not all CDF's can be inverted analytically
- For these distributions we have to solve for \(t_p\) numerically
- Inverting the CDF of the \(EXP(2)\) distribution results in a quantile function expressed as

\[ t_{p}=-\ln[1-p]\times2 \]

Step 4 - Substitute the values generated from the \(UNIF(0,1)\) distribution into the quantile function
- The only unknown in the quantile function above is the value \(p\)
- Substituting the values we generated from the \(UNIF(0,1)\) distribution into this quantile function will return the values of \(t_p\) that correspond to each \(p\) value
- The first line in the code chunk below creates an R function that computes \(t_p\) values from and exponential distribution for any specified value of \(p \text{and} \theta\)
- Substituting the values \(p=\)probs and \(\theta=2\) in this function returns \(10\) simulated observations from the \(EXP(2)\) distribution

# Create function
t_p.exp <- function(p,theta) -log(1-p)*theta

# Evaluate function
t_p.exp(p = probs, theta = 2)

 [1] 7.6541167 0.2499643 1.2887029 1.6434742 4.6874682 0.2986484 9.0001306
 [8] 5.8624462 0.1720693 1.4439650

Step 5 - Check the resulting values for \(t_p\)
- The values returned from this process are PRN's from the selected distribution
- We can compare our simulated values to a plot of the CDF for the \(EXP(2)\) distribution
- In this example, only \(10\) observations were simulated, so the fit may be poor
- Simulating more observations (say \(1000\)) should reduce the sampling error and result in a better fit
- The left plot in the Figure below shows the result for \(n = 10\) observations, the right plot shows the result for \(n = 1000\) observations
- The smooth black line in each plot is the true shape of the \(EXP(2)\) distribution
- The step plot is a nonparametric plot of the empirical CDF of the data

par(family = "serif", bg = NA, mfrow = c(1,2))

probs10   <- runif(10,0,1)
probs1000 <- runif(1000,0,1)

t_p.exp2_10   <- t_p.exp(p = probs10,   theta = 2)
t_p.exp2_1000 <- t_p.exp(p = probs1000, theta = 2)

plot(ecdf(t_p.exp2_10), 
     main = "10 observations",
     col = 'red')
curve(pexp(x,1/2), add = TRUE, lwd = 2)

plot(ecdf(t_p.exp2_1000), 
     main = "1000 observations",
     col = 'blue')
curve(pexp(x,1/2), add = TRUE, lwd = 2)

4.13 - Generating Psuedorandom Observations From a Specified Distribution

Example: Generating Psuedorandom Censored Observations

The probability integral transform provides an efficient way to simulate random observations from many distributions
However, these observations assume that we have complete data
Suppose we wanted to generate \(n\) censored observations from a distribution \(F(t)\)
When planning a test we choose when the test will end beforehand
- After a specified time has elapsed
- After a specified number of failures have been observed
Say we know that observation \(i-1\) occurs before censoring -- what is the probability that observation \(i\) also occurs before censoring?
This can be accomplished using pseudorandom uniform order statistics as follows:
- Suppose \(U_i \sim UNIF(0,1)\) and \(U_{(i)}\) is the \(i^{th}\) order statistic
- For an arbitrary value \(u\) we want to know

\[ \Pr \left [ U_{(i)}\leq u|U_{(i-1)}=u_{(i-1)}\right ]=1 - \left [\frac{1-u } { 1-u_{(i-1)} } \right ]^{(n-i+1)}, \quad u \ge u_{(i-1)} \]

Using the methods previously discussed, this expression can be inverted, resulting in the following expressions - note that \(U(0) = 0\).

\[ \begin{eqnarray*} U_{(1)}&=& 1-[1-U_{(0)}]\times(1-U_{1})^{1/n} \\ U_{(2)}&=& 1-[1-U_{(1)}]\times (1-U_{2})^{1/(n-1)} \\ & \vdots & \\ U_{(r)}&=& 1-[1-U_{(r-1)}] \times (1-U_{r})^{1/(n-r+1)} \end{eqnarray*} \]

Generating Pseudorandom Order Statistics

Generate \(n\) observations from a \(UNIF(0,1)\) distribution

n = 10

obs  <- runif(n,0,1) ; obs

 [1] 0.93739662 0.63554887 0.48474653 0.08589623 0.94491319 0.34671377
 [7] 0.36934622 0.62414874 0.72740583 0.19792811

Sort the \(n\) observations from smallest to largest

obs <- sort(obs, decreasing = F) ; obs

 [1] 0.08589623 0.19792811 0.34671377 0.36934622 0.48474653 0.62414874
 [7] 0.63554887 0.72740583 0.93739662 0.94491319

Store the ranks of the \(n\) observations from \(1,\ldots,n\)

ranks <- rank(obs) ; ranks

 [1]  1  2  3  4  5  6  7  8  9 10

Pre-allocate a vector in which to write the pseudorandom uniform order statistics

U.order <- double(length(obs)) ; U.order

 [1] 0 0 0 0 0 0 0 0 0 0

For Type II (failure censored data) compute the order statistics for the \(r\) failures
- For \(r+1,\ldots,n\), \(U_{r+1}\ldots,U_{n} = U_{r}\)
- The function defined below can be used to produce the psedorandom order statistics

PRUOS <- function(n, r = NULL){
  
  `if`(is.null(r),
       r <- n,
       if(r > n) stop("r value greater than in_vec length - dummy"))
  
  in_vec  <- sort(runif(n,0,1))
  out_vec <- double(r) 
  
  for(i in 1:r) {
    
    `if`(i == 1,
         out_vec[i] <- 1 - (1 - in_vec[i])**(1 / (n - i + 1)),
         out_vec[i] <- 1 - (1 - out_vec[i - 1]) * (1 - in_vec[i])**(1 / (n - i + 1)))
    
    
  }
  
  if(r < n) out_vec[(r + 1):n] <- out_vec[r]
  
  df <- data.frame(probs  = out_vec,
                   censor = rep(c('failed','right'), c(r,n - r)),
                   stringsAsFactors = F)
  
  return(df)
  
}

# Generate order statistics for n=25 systems with r=4 failures
n = 25 ; r = 4
samp1 <- PRUOS(n = n, r = r)

# using these order statistics to generate values from 
# a Weibull(2.25, 10) distribution 
samp1$times <- qweibull(samp1$probs, shape = 2.25, scale = 10)

# Compare values to not using order statistics
samp2 <- data.frame(probs = sort(runif(n)),
                    censor = 'failed',
                    stringsAsFactors = F)

samp2$times <- qweibull(samp2$probs, shape = 2.25, scale = 10)
samp2$times[(r + 1):n] <- samp2$times[r]
samp2$censor[(r + 1):n] <- 'right'

dt1 <- DT::datatable(cbind(samp1,samp2), rownames = F)

DT::formatRound(dt1, 
                digits = 5, 
                columns = which(sapply(dt1$x$data, is.numeric)))

For Type I (time censored data) compute the order statistics for the \(r = n\) failures and cap off the resulting times that occur after a specified censoring time

# Generate order statistics for n=25 systems with r=25 failures
n = 25 ; r = 25
samp3 <- PRUOS(n = n, r = r)

# using these order statistics to generate values from 
# a Weibull(2.25, 10) distribution 
samp3$times <- qweibull(samp3$probs, shape = 2.25, scale = 10)

# Specify a censoring time and cap all values at this time
t_c <- 12.5
samp3$censor[samp3$times >= t_c] <- 'right'
samp3$times[samp3$censor == 'right'] <- max(samp3$times[samp3$times <= t_c])


# Compare values to not using order statistics
samp4 <- data.frame(probs = sort(runif(n)),
                    censor = 'failed',
                    stringsAsFactors = F)

samp4$times <- qweibull(samp4$probs, shape = 2.25, scale = 10)
samp4$censor[samp4$times >= t_c] <- 'right'
samp4$times[samp4$censor == 'right'] <- max(samp4$times[samp4$times <= t_c])

dt2 <- DT::datatable(cbind(samp3,samp4), rownames = F)

DT::formatRound(dt2, 
                digits = 5, 
                columns = which(sapply(dt2$x$data, is.numeric)))