W. Q. Meeker, L. A. Escobar, and J. K. Freels
23 July 2018
Applications of probability plots
Basic probability plotting concepts for both complete and censored data
How to analytically linearize a cdf on special plotting scales
How to plot a nonparametric estimate of \(F(t)\) on probability paper and how to use such a plot to judge the adequacy of a particular parametric distribution
Analytical and simulation methods of separating useful information from "noise" when using a probability plot to assess the reasonableness of a particular model
Graphical estimates of important reliability characteristics like failure probabilities and distribution quantiles
Reveal information about a population, a process, or data that might otherwise escape detection
Assess the adequacy of a particular distributional model
Provide nonparametric graphical estimates of probabilities and distribution quantiles
Obtain graphical estimates of parametric model parameters (e.g. by fitting a straight line through the points on a probability plot)
Display the results of a parametric maximum likelihood fit along with the data
Probability plots are nonparametric tools that should be used early on to eliminate poorly fitting models
Probability plots use transformed axes to plot a linearized cdf vs \(t\)
teachingApps::teachingApp('probability_plotting')Identify the event times and sort them
If we're trying to model data - then we should already have data
Data should be sorted smallest to largest
In R, we can sort data using sort( )
Invert and rearrange \(F(t_{p})\) to obtain a linear function of the form
\[ \begin{aligned} y&=m(x)\;+b\\\\ F^{-1}(p)&=m(t_{p})+b \end{aligned} \]
Figure out some a way to come up with values for \(p\) to use in the plot
The values are called
We'll discuss plotting positions at length in a minute
For now, we'll use what is known as the Hazen plotting position which is expressed as
\[\hat{p}_i=\frac{i-0.5}{n}\]
|
|
|
|
|
|
The Figures 6.1-6.5 show probability plots using ideal data (aka cheater data) where every observation falls on the line
This
If you ever present a probability plot that looks like these figures - I'll know your're lying
Sort the observations from smallest to largest \(t_1, t_2,...,t_n\)
Linearize the exponetial CDF
For \(T\sim EXP(\theta,\gamma)\) the CDF is \[F(t_{p})=p=1-\exp\left(-\frac{t_{p}-\gamma}{\theta}\right),\quad t_{p}\in [0,\infty)\]
Inverting and rearranging the exponential CDF gives the following linear equation \[ \begin{aligned} p&=1-\exp\left(-\frac{t_{p}-\gamma}{\theta}\right)\\ 1-p&=\exp\left(-\frac{t_{p}-\gamma}{\theta}\right)\\ \ln\left[\frac{1}{1-p}\right]&=\frac{t_{p}}{\theta}- \frac{\gamma}{\theta} \end{aligned} \]
Plot the points
If the points roughly fall on a straight line the data
\(\theta \equiv\;\) is the slope of this line, i.e. \(\displaystyle m = \frac{1}{\theta}\)
\(\gamma \equiv\;\) the value of the \(y\) intercept, i.e. \(\displaystyle b=\frac{\gamma}{\theta}\)
par(family='serif',font = 2)
p <- seq(.01,.99,.01)
gamma1 <- 0
theta1 <- 50
gamma2 <- 0
theta2 <-200
plot(qexp(p,rate=1/theta1) - gamma1,-log(1-p),
type='l', las = 1,
lwd=2, col=1,
xlim=range(0,800),
xlab='Time',
ylab='Standard Exponential Quantile')
lines(qexp(p,rate=1/theta2) - gamma2, -log(1-p),
lwd=2, col=2)
text(c(100,650),c(4,2.5),
c(expression(theta*', '*gamma*' = 50, 0'),
expression(theta*', '*gamma*' = 200, 0')))
box(lwd=1.25)Sort the observations from smallest to largest \(t_1, t_2,...,t_n\)
Linearize the normal CDF
For \(T\sim NOR(\mu,\sigma)\) the CDF is \[F(t_{p})=p=\Phi_{NOR}\left[\frac{t_{p}-\mu}{\sigma}\right] \;\;t_{p}\in [-\infty,\infty)\]
Inverting this CDF gives the normal quantile function \[t_p=F^{-1}(p)=\Phi^{-1}_{NOR}(p)\sigma +\mu\]
Rearranging this expression as a linear equation gives \[\Phi_{NOR}^{-1}(p)=\frac{t_{p}}{\sigma}-\frac{\mu}{\sigma}\]
Plot the points
If the points roughly fall on a straight line the data
\(\sigma \equiv\;\) the slope of this line, i.e. \(\displaystyle m=\left(\frac{1}{\sigma}\right)\)
\(\mu \equiv\;\) the \(x\)-value where the line crosses \(\displaystyle \Phi_{NOR}^{-1}(p) = 0\)
par(family='serif', font=2)
p <- seq(.01,.99,.01)
mean1 <- 40
sd1 <- 10
mean2 <- 80
sd2 <- 5
plot(qnorm(p, mean = mean1, sd = sd1), qnorm(p),
type='l',
lwd=2, col=1,
xlim=range(0,100),
las = 1,
xlab='Time',
ylab='Standard Normal Quantile')
lines(qnorm(p,mean = mean2, sd = sd2), qnorm(p), lwd=2, col=2)
abline(h=0, lty=2)
text(c(15, 85), c(-1,-2),
c(expression(mu*', '*sigma*' = 40, 10'),
expression(mu*', '*sigma*' = 80, 5')))
box(lwd=1.25)Sort the observations from smallest to largest \(t_1, t_2,...,t_n\)
Linearize the lognormal CDF
For \(T\sim LOGNOR(\mu,\sigma)\) the CDF is \[F(t_{p})=p=\Phi_{NOR}\left[\frac{\log(t_{p})-\mu}{\sigma}\right] \;\;t_{p}\in [0,\infty)\]
Inverting this CDF gives the lognormal quantile function \[\log[t_p]=F^{-1}(p)=\Phi^{-1}_{NOR}\sigma+\mu\]
Rearranging this expression as a linear equation gives \[\Phi_{NOR}^{-1}(p)=\frac{\log[t_{p}]}{\sigma}-\frac{\mu}{\sigma}\]
Plot the points
If the points roughly fall on a straight line, the data
\(\log[\sigma] \equiv\;\) the slope of this line i.e. \(\displaystyle m = \left(\frac{1}{\sigma}\right)\)
\(\log[\mu] \equiv\;\) the \(x\)-value where the line crosses \(\displaystyle \Phi_{NOR}^{-1}(p) = 0\)
par(family='serif',font=2)
p <- seq(.01,.99,.01)
meanlog1 <- log(50)
sdlog1 <- 2
meanlog2 <- log(500)
sdlog2 <- 2
meanlog3 <- log(500)
sdlog3 <- 1
plot(qlnorm(p,meanlog = meanlog1,
sdlog = sdlog1),
qnorm(p),
type='l',
lwd=2,
xlim=c(1,1000),
las = 1,
xlab='Time',
ylab='Standard Normal Quantile',
log='x')
lines(qlnorm(p,meanlog = meanlog2, sdlog = sdlog2),
qnorm(p), lwd=2, col=2)
lines(qlnorm(p,meanlog = meanlog3, sdlog = sdlog3),
qnorm(p), lwd=2, col=3)
abline(h=0, lty=2)
box(lwd=1.25)
legend('topleft',
c(expression('exp('*mu*'),'*sigma*' = 50,2'),
expression('exp('*mu*'),'*sigma*' = 500,2'),
expression('exp('*mu*'),'*sigma*' = 500,1')),
lty = 1,col = c(1:3),lwd = 2, bty = 'n')Sort the observations from smallest to largest \(t_1, t_2,...,t_n\)
Linearize the smallest extreme value CDF
For \(T\sim SEV(\mu,\sigma)\) the CDF is \[F(t_{p})=p=\Phi_{SEV}\left[\frac{t_{p}-\mu}{\sigma}\right] \;\;t_{p}\in [-\infty,\infty)\]
Inverting this CDF gives the SEV quantile function \[t_p=F^{-1}(p)=\Phi^{-1}_{SEV}(p)\sigma +\mu\]
Rearranging this expression as a linear equation gives \[\Phi_{SEV}^{-1}(p)=\frac{t_{p}}{\sigma}-\frac{\mu}{\sigma}\]
Plot the points
If the points roughly fall on a straight line the data
\(\sigma \equiv\;\) the slope of this line i.e. \(\displaystyle m =\left(\frac{1}{\sigma}\right)\)
\(\mu \equiv\;\) the \(x\)-value where the line crosses \(\displaystyle \Phi_{SEV}^{-1}(p) = 0\)
par(family='serif', font=2)
library(SMRD)
p<- seq(.01,.99,.01)
mean1 <- 40
sdev1 <- 10
mean2 <- 80
sdev2 <- 5
plot(qsev(p, location=mean1, scale=sdev1), qsev(p),
type='l',
lwd=2, col=1,
xlim=range(0,100),
las = 1,
xlab='Time',
ylab='Standard SEV Quantile')
lines(qsev(p,location=mean2, scale=sdev2), qsev(p), lwd=2, col=2)
abline(h=0, lty=2)
text(c(15, 85), c(-1,-2),
c(expression(mu*', '*sigma*' = 40, 10'),
expression(mu*', '*sigma*' = 80, 5')))
box(lwd=1.25)Sort the observations from smallest to largest \(t_1, t_2,...,t_n\)
Linearize the Weibull\((\beta,\eta)\) CDF
For \(T\sim WEIB(\beta,\eta)\) the CDF is \[F(t_{p})=p=1-\exp\left[-\left(\frac{t}{\eta}\right)^{\beta}\right]\]
Inverting this CDF gives the Weibull quantile function \[\log[t_p]=F^{-1}(p)=\Phi^{-1}_{SEV}(p)\sigma +\mu\]
Rearranging this expression as a linear equation gives \[\log\left[\log\left[\frac{1}{1-p}\right]\right]=\frac{\log\left[t_{p}\right]}{\beta}-\frac{\log[\eta]}{\beta}\]
Plot the points
If the points roughly fall on a straight line the data
\(\beta \equiv\;\) the slope of the line \(\displaystyle m = \left(\frac{1}{\beta}\right)\)
\(\theta \equiv\;\) the \(x\)-value where the line crosses \(\displaystyle \Phi_{SEV}^{-1}(p) = 0\)
Sort the observations from smallest to largest \(t_1, t_2,...,t_n\)
Linearize the Weibull\((\mu,\sigma)\) CDF
For \(T\sim WEIB(\mu,\sigma)\) the CDF is \[F(t_{p})=p=\Phi_{SEV}\left(\frac{\log[t_p]-\mu}{\sigma}\right)\]
Inverting this CDF gives the Weibull quantile function \[\log[t_p]=F^{-1}(p)=\Phi^{-1}_{SEV}(p)\sigma +\mu\]
Rearranging this expression as a linear equation gives \[\Phi_{SEV}^{-1}(p)=\frac{\log[t_{p}]}{\sigma}-\frac{\mu}{\sigma}\]
Plot the points
If the points roughly fall on a straight line the data
\(\sigma \equiv\;\) the slope of this line, i.e. \(\displaystyle m = \left(\frac{1}{\sigma}\right)\)
\(\mu \equiv\;\) the \(x\)-value where the line crosses \(\displaystyle \Phi_{SEV}^{-1}(p) = 0\)
par(family='serif',font=2)
library(SMRD)
p<- seq(.01,.99,.01)
plot(qweibull(p,.5,50), qsev(p),
type='l',
lwd = 2,
col = 1,
xlim = c(1,1000),
las = 1,
xlab='Time',
ylab='Standard SEV Quantile',
log='x')
lines(qweibull(p,0.5,500),
qsev(p), lwd = 2, col = 2)
lines(qweibull(p,1.0,500),
qsev(p), lwd = 2, col = 3)
abline(h=0, lty=2)
box(lwd=1.25)
legend(x = 400,
y = -4.6,
c(expression(eta*','*beta == ~~50*','*0.5),
expression(eta*','*beta == 500*','*0.5),
expression(eta*','*beta == 500*','*1.0)),
lty = 1,
col = c(1:3),
lwd = 2,
bty = 'n',
y.intersp = 0.75,
yjust = 0.2,
xjust = 0.5,
cex = 0.9,
seg.len = 1)Assessing the adequacy of a distribution's fit for a data set is an important application of probability plots
Probability plots are created by
Identifying the points \(\left(t_{i},\hat{F}(t_{i})=\hat{p}_{i}\right), \;\;\;\;i=1,2,...,n\)
Plotting these points on axes that have been transformed for a distribution of interest
Observe if the data fall approximately on a straight line
We discussed most of this in 6.2
These values are called
Choosing a plotting position formula depends on the purpose of the plot
Checking distributional assumptions
Estimating distribution parameters
The best plotting position formula is a function of
The assumed underlying model
The function to be estimated \((t_{p}, h(t), F(t),...)\)
The type and amount of censoring
\[ p_{i}=\frac{i-a}{n+1-2a} \]
where
\(i\) is an index of the ordered observations (smallest \(\rightarrow\)largest)
\(n\) is the sample size
\(a\) is the plotting position parameter
\(p_{i}\) is the value of the plotting position corresponding to time \(t_{i}\)
The value of \(a\) is chosen to produce approximately unbiased estimates for \(F(t_{i})\) for an assumed distribution
| Proponent | Year | \(a\) | Formula |
|---|---|---|---|
| Hazen | 1914 | 0.50 | \(\displaystyle\frac{i-0.5}{n}\) |
| Weibull | 1939 | 0 | \(\displaystyle\frac{i}{n+1}\) |
| Blom | 1958 | 0.375 | \(\displaystyle\frac{i-0.375}{n+0.25}\) |
| Gringorten | 1963 | 0.44 | \(\displaystyle\frac{i-0.44}{n+0.12}\) |
| Chegodayev | 2000 | 0.30 | \(\displaystyle\frac{i-0.30}{n+0.4}\) |
| Cunnane | 1977 | 0.40 | \(\displaystyle\frac{i-0.40}{n+0.2}\) |
| Median | 1943 | 0.3175 | \(\displaystyle\frac{i-0.3175}{n+0.365}\) |
\[ \begin{aligned} \text{Hazen}\; &p_{i}=\frac{i-0.5}{n}\\\\ \text{Chegodayev}\; &p_{i}=\frac{i-0.3}{n+0.4}\;\text{also known as "Median-Ranks"}\\\\ \text{Weibull}\; &p_{i}=\frac{i}{n+1}\\\\ \end{aligned} \]
\[ p_{i}=\begin{cases}\frac{i}{n}, &\mbox{Distribution free - (Eq 3.1)}\\\\ \frac{i-0.5}{n}, &\mbox{Assumed distribution}\end{cases} \]
Case 1 - Continuous inspection or small inspection intervals
Case 2 - Data With Continuous Inspection And Multiple Right Censoring Events
Case 3 - Interval censoring with large intervals
Case 4 - Arbitrary censoring
Left censoring
Right censoring
Interval censoring
Overlapping intervals
\[ \hat{p}_{i}=\frac{i}{n} \]
Estimator properties
Increases by \(\frac{1}{n}\) at each failure time \(t_{i}\)
Underestimates \(F(t)\) prior to the first failure time at \(t_{1}\)
Overestimates \(F(t)\) after the last failure time at \(t_{n}\)
plot(data.ld)
plot(data.ld, distribution = NULL)
A compromise plotting position for addressing this bias uses midpoint of the \(1/n\) jumps
\[ \hat{p}_{i}=\frac{i-0.5}{n}=\frac{1}{2}\left[\hat{F}(t_{(i)})+\hat{F}(t_{(i-1)})\right] \]
The SMRD package uses this comprimise estimator for parametric CDF plots
plot(data.ld, distribution = "distribution")SMRD::smrd_app('at7987_data')For this case, each failure
Censored observations between failures will reduce \(n\)
Furthermore, \(\frac{i-0.5}{n}\ne\frac{1}{2}\left[\hat{F}(t_{(i)})+\hat{F}(t_{(i-1)})\right]\)
For this case, we use the estimated values \(\hat{F}(t_{(i)})\) directly as
\[ \hat{p_{i}}=\frac{1}{2}\left[\hat{F}(t_{(i)})+\hat{F}(t_{(i-1)})\right] \]
Where Equations 3.6 & 3.7 are used to compute
\(\hat{F}(t_{(i)})\)
\(\hat{F}(t_{(i-1)})\)
\[ \begin{aligned} \hat{F}(t_{i})&=1-\prod_{j=1}^i \left[1-\hat{p}_{j}\right]\\\\ &=1-\prod_{j=1}^i \left[1-\frac{d_j}{n_j}\right] \end{aligned} \]
\[ \begin{aligned} p_i=\frac{1}{2}\left[1-\prod_{j=1}^i \left[1-\frac{d_j}{n_j}\right]+1-\prod_{j=1}^{i-1} \left[1-\frac{d_j}{n_j}\right]\right] \end{aligned} \]
SMRD::smrd_app('shockabsorber_data6')In this case, failures "occur" at the end of an interval when discovered by inspection
\(\hat{F}(t)\) defined as the upper endpoint of an interval in which failures occurred
For this case, the plotting position is
\[ p_{i}=\hat{F}(t_{(i)}) \]
Where
\(E\left[\hat{F}(t_{(i)})\right]=F(t_{(i)})\) when there are no losses due censoring
\(E\left[\hat{F}(t_{(i)})\right]\approx F(t_{(i)})\) when multiple censoring does occur
SMRD::smrd_app('heatexchanger_data6')In this case the data can consist of a mixture of
Left censored observations
Right censored observations
Interval censored observations
Exact failures
For arbitrarily censored data the plotting positions \(\hat{F}(t)\) are computed using the formulas for the other cases described above
SMRD::smrd_app('turbine_data6')par(family = 'serif', font = 2)
Bleed.ld <-
frame.to.ld(bleed,
response.column = 1,
censor.column = 2,
case.weight.column = 3,
x.columns = c(4),
time.units = 'Hours')
plot(Bleed.ld,
distribution = NULL)Sharp bends or departures from linearity could indicate the existence of
Multiple failure modes
Different usage patterns in the sample
An unexpected or unaccounted stressor
For this reason reliability data should include
The failure mode(s) observed
Variations in the usage environment
Variations in the amount of usage
SMRD::smrd_app('bleed_data6')