Prediction Interval and its Calibration

An Application in Reliability using R Statistical Software

Shishir Rao

2025-04-17

Introduction

A potential customer would like to know how long will an expensive piece of industrial equipment they intend to purchase operate before failing. A maintenance manager needs to make a decision on whether an operating unit should be shut down and brought in for maintenance or let it run for some more time. A manufacturing company would like to allocate budget for potential warranty claim returns and would like to know the number of product failures in the next 6 months.

Questions of the above nature can be addressed by analyzing data from past failures and constructing prediction intervals. In this blog article, we discuss prediction intervals, how are they different from confidence intervals and why is it necessary to calibrate prediction intervals. The methods mentioned in this blog post are from a chapter on prediction intervals in a textbook11 Statistical Methods for Reliability Data, Second Edition (William Q. Meeker, Luis A. Escobar, Francis G. Pascual) that I have been reading. I highly recommend this book to anyone interested in applying statistical methods in the field of reliability engineering.

Prediction Intervals vs Confidence Intervals

A prediction interval is an interval within which a future observation is likely to fall, whereas a confidence interval is an interval within which a population parameter is likely to fall. Lets look at this difference with the help of an example.

Suppose there are 1000 houses that are on sale in a particular neighborhood. We are interested in finding out the average sale price of a house. If we had access to the 1000 sale prices, finding the average is a straightforward exercise. Now, suppose that only 50 have been sold so far and we want to use this data to estimate the average sale price of the 1000 houses in the neighborhood.

In the above example, our sample size is 50 and the population size is 1000. The population parameter that we want to estimate is the mean (or average) of 1000 homes 22 Other quantities like the median or any other quantile are also population parameters. This population parameter is an unknown quantity, but it is fixed!33 We are talking about the frequentist approach. In the Bayesian approach, parameters are also treated as random quantities. We can use the 50 sale prices to construct a confidence interval for the mean of 1000 sale prices.

On the other hand, suppose we want to know the price range within which the next house that sells is going to fall in. This range would be the prediction interval. Note that the sale price of the next house that is sold is a random quantity, unlike the unknown population parameter which is fixed. The prediction that we are interested in is an “unknown and random” quantity whereas the average sale price of a house in the neighborhood is an “unknown but fixed” quantity. Confidence intervals are constructed for “unknown but fixed” population parameters44 Like the mean in this case and prediction intervals are constructed for the “unknown and random” quantity.55 Sale price of the next house sold in this case. Te rest of this article discusses prediction intervals and how to calibrate them.

Reliability Applications

Similar to the example of house sale prices, consider a scenario where a customer wants to purchase an expensive piece of industrial equipment and is interested in the prediction interval for that particular equipment. The confidence interval would give the customer an idea of the mean time to failure of all identical equipment manufactured in the past, present and future. Although this is good information to have, the customer wants an interval for one equipment, which would also account for the individual variability in addition to the uncertainty in estimating a parameter. Due to this additional variability, the prediction intervals are wider than confidence intervals.

We can use data from historical failures to get a prediction interval for an equipment. The difference between this case and the previous example of house prices is that historical data from equipment failures may be incomplete. This means that in addition to failures, the data might consist of censored observations, where the equipment hasn’t failed by the end of the observation period66 This is for right censored observations. We can also have left or interval censored observation and even truncated observations in reliability data..

In the following article, a dataset consisting of past failures (and censored) observations will be analyzed to construct a prediction interval. It will then be calibrated to ensure adequate coverage probability77 Will be discussed later in the article. All computations are performed using R and the code is available here.

Case Study: Mechanical Switches

Data

Table 1 shows failure times88 in Millions of Operations of 40 randomly selected mechanical switches tested in a facility. 3 switches had not failed by the end of the test, leading to right censored observations. This dataset is from an article by Nair (1984)99 Nair, V. N. (1984). Confidence bands for survival functions with censored data: A compar- ative study. Technometrics 26, 265–275.. It is also publicly available on Datashare, Iowa State University’s open data repository and can be accessed through this link.

Table 1. Mechanical Switches Life Test Data

Millions of Operations	Failure Mode
1.151	Spring B
1.170	Spring B
1.248	Spring B
1.331	Spring B
1.381	Spring B
1.499	Spring A
1.508	Spring B
1.534	Spring B
1.577	Spring B
1.584	Spring B
1.667	Spring A
1.695	Spring A
1.710	Spring A
1.955	Spring B
1.965	Spring A
2.012	Spring B
2.051	Spring B
2.076	Spring B
2.109	Spring A
2.883	Censored
2.883	Censored
3.793	Censored
2.116	Spring B
2.119	Spring B
2.135	Spring A
2.197	Spring A
2.199	Spring B
2.227	Spring A
2.250	Spring B
2.254	Spring A
2.261	Spring B
2.349	Spring B
2.369	Spring A
2.547	Spring A
2.548	Spring A
2.738	Spring B
2.794	Spring A
2.910	Spring A
3.015	Spring A
3.017	Spring A

There are 2 modes of failure - Spring A and Spring B1010 The 2 springs are identical in construction, but are subject to different stresses, thereby leading to different time-to-failure probability distributions.. We are interested in constructing a prediction interval for the time to failure of a mechanical switch, irrespective of mode of failure. Note that a proper analysis of this dataset should treat the two modes of failures as competing risks and apply appropriate methods that are capable of handling competing risks. We are not going to do that in this article. The purpose of this blog article is to discuss prediction intervals and their calibration. A simple parametric model ignoring competing risks will be assumed in the interest of a succinct article focusing on one topic i.e prediction intervals.

Model

The log location scale distribution model that we consider for this dataset is shown in Eq. A below.

\[ \hspace{-2cm} \begin{align*} F(t;\mu,\sigma) & = Pr(T\le t)\\ & = \Phi \left[\frac{log(t)-\mu}{\sigma}\right] \tag{Eq. A} \end{align*} \]

The distribution of \(\Phi\) determines the time-to-failure distribution. For example, if \(t\) is assumed to have a Weibull distribution, \(\Phi\) is the cumulative distribution function (CDF) of the standard smallest extreme value (SEV) distribution. Similarly, if \(t\) is assumed to have a log-normal distribution, then \(\Phi\) is the CDF of the standard normal distribution.

Weibull and log-normal distributions are widely used to describe time-to-failure distributions in reliability applications. Other distributions1111 Eg: exponential, log-logistic, generalized gamma, normal etc can also be considered, depending on the application and motivated by the failure process. Only log-normal and Weibull distributions are considered as candidate models in this article.

Figure 1 shows a Weibull probability plot and Figure 2 shows a log-normal probability plot of the mechanical switches dataset.

Figure 1: Weibull probability plot

Figure 2: log-normal probability plot

The log-normal model looks like a better fit to the data as compared to the Weibull model, especially at the lower tail. The AIC of the Weibull model is 83.01 and the AIC of the log-normal model is 77.38, further reinforcing that the log-normal is a better choice for this data. The maximum likelihood (ML) estimates of the parameters of the log-normal distribution and their standard errors are shown in the table 2 below.

Table 2. Maximum Likelihood estimates of the log-normal model

	Estimate
\(\hat{\mu}\)	0.723
\(\mathrm SE_{\hat{\mu}}\)	0.048
\(\hat{\sigma}\)	0.300
\(\mathrm SE_{\hat{\sigma}}\)	0.036

Plug-in Prediction Interval

Now that we have the ML estimates of the log-normal distribution parameters, we can easily obtain the 90% approximate1212 As opposed to an “exact” prediction interval where the confidence level is exactly met. prediction interval. A one sided lower 95% approximate prediction bound and a one sided upper 95% approximate prediction bound are combined to form an equal tailed two sided 90% approximate prediction interval. Although it might be possible to find a narrower interval with unequal upper and lower tail probabilities that add to \(\alpha\) = 10%, two-sided intervals are often reported in reliability applications even when the primary interest is on one side or the other. With the appropriate adjustment to the confidence level, a two-sided interval can be correctly interpreted as a one-sided prediction bound. For the mechanical switches, a two sided approximate 90% prediction interval calculated using the ML estimates of the parameters is [1.26,3.38].

Calibrating Prediction Intervals

The method of finding the approximate prediction interval described in the previous section is a “naive” method because it simply uses the ML estimates of the parameters. It ignores the uncertainty in the estimated parameter values, which then leads to the coverage probability1313 Coverage probability is the probability that a prediction interval procedure gives an interval containing the future prediction. being generally smaller than the nominal confidence level. When there is a lot of information in the data set i.e large number of failures and little censoring, the naive plug-in method may suffice. If this is not the case, it is necessary to refine/calibrate the interval to ensure better coverage probability.

Escobar and Meeker (1999)1414 Escobar, L. A. and W. Q. Meeker (1999). Statistical prediction based on censored life data. Technometrics 41, 113–124., discuss different prediction interval calibration methods in their paper. Here, we use the method where calibration is done by simulating B1515 B is typically chosen between 10,000 and 50,000. realizations of the data and averaging the conditional coverage probabilities. A comprehensive discussion on this method is outside the scope of this blog article and I encourage you to read the paper or the textbook that I have mentioned previously. A summary of the steps involved in this method is shown below:

Step 1: Choose a particular value of the one sided confidence level \(1 - \alpha_{c}\). In our case study, the one sided confidence level is 5%. So, we can start with \(\alpha_{c} = 0.05\).
Step 2: Simulate a realization of the available data using fractional random weight bootstrap1616 Note on fractional random weight bootstrap method is included in the next section. sampling.
Step 3: Find the maximum likelihood estimates of the distribution parameter for the data simulated in Step 2.
Step 4: Find the one sided upper 95% prediction interval from the parameter estimates in step 3.
Step 5: Find the coverage probability of the interval from Step 4 using the original ML estimates of the parameters in Table 2.
Step 6: Repeat Steps 2,3,4 and 5 \(B\) times. \(B = 10,000\) has been used in this case.
Step 7: Take the average of the 10,000 coverage probabilities from Step 5. If this value is less than our target confidence level of 95%, decrease the value of \(\alpha_{c}\) in Step 1 and repeat the whole procedure until the average in Step 7 is at or above our target confidence level.

The above procedure will give us a calibrated one sided approximate 95% upper prediction bound. Repeat the same procedure to find the calibrated one sided approximate 95% lower prediction bound by modifying Step 4 to calculate the lower bound instead of the upper bound.

Combining the two one-sided 95% bounds gives a two-sided 90% interval. Table 3 below shows the calibrated interval along with the naive plug in interval.

Table 3: Approximate 90% Prediction Intervals

Naive 90% Prediction Interval	Calibrated 90% Prediction Interval
[1.26, 3.38]	[1.22, 3.49]

Notice that the width of the calibrated interval is greater than the width of the naive interval. The calibrated approximate prediction interval is expected to have better coverage properties than the naive approximate prediction interval.

Note on Fractional Random Weight Bootstrap

With a traditional bootstrap approach, we randomly sample data (from Table 1), with replacement. In a given bootstrap sample, some rows from table 1 may appear more than once while other may not appear at all. Think of the number of times that a particular row appears in a bootstrap sample as the “weight” for that observation. These weights are integer values and can take on values between 0 and \(n\), where \(n\) is the number of data points in a bootstrap sample. Note that these weights are positive “integer” values. When there is heavy censoring in the data, a bootstrap sample might consist of only censored observations, or very few failures. This leads to estimation problems when we try to estimate parameters using maximum likelihood. An alternative to these “integer weights” is “fractional weights”, where each row in Table 1 gets assigned random non-integer weights. These non-integer weights can be generated from a continuous distribution of a random variable that has the same mean and standard deviation as the integer resampling weights. Xu et al. (2020)1717 Xu, L., C. Gotwalt, Y. Hong, C. B. King, and W. Q. Meeker (2020). Applications of the fractional‐random‐weight bootstrap. The American Statistician 74, 345–358. discusses an application of this method and shows how to generate the random weights. Although we don’t have heavy censoring in our case and would probably not have encountered any estimation problems, we have still used fractional random weight bootstrapping to be on the safer side.

Conclusion

In this blog post, we discussed prediction intervals and how to construct them using an example of a data set that contained time-to-failure of 40 mechanical switches. We also touched on the calibration of the interval to account for uncertainty in the estimates of the distribution parameters. Based on the analysis of the mechanical switches dataset, we can say with 90% confidence that the time-to-failure prediction for a future mechanical switch can be expected to fall in the range of [1.22,3.49] millions of operations.

This was a simple application of prediction intervals where all the equipment started testing at time zero and only three were censored observations. The concept here can be used to find prediction intervals for individual equipment that have been operating for different durations and operating under different conditions that would lead to different failure time distributions, using regression analysis.

Acknowledgements

Most of what I have learnt about time-to-event analysis is from the book Survival Analysis: Techniques for Censored and Truncated Data, Second Edition (John P. Klein and Melvin L. Moescheberger) and Statistical Methods for Reliability Data, Second Edition (William Q. Meeker, Luis A. Escobar, Francis G. Pascual).
All analyses were performed using R Statistical Software(v4.4.1; R Core Team 2024). The survival R package(v3.7.0; Therneau T 2024) was extensively used. Please refer to next section for all the packages used, their versions and the names of the package developers who deserve credit for building these amazing open source packages.
This article’s format is a style that Edward Tufte uses in his books and handouts. I have used the tufte library in R and am grateful to the authors of this package.

Credits for R packages used

##   - Grolemund G, Wickham H (2011). "Dates and Times Made Easy with lubridate." _Journal of Statistical Software_, *40*(3), 1-25. <https://www.jstatsoft.org/v40/i03/>.
##   - Müller K, Wickham H (2023). _tibble: Simple Data Frames_. R package version 3.2.1, <https://CRAN.R-project.org/package=tibble>.
##   - R Core Team (2024). _R: A Language and Environment for Statistical Computing_. R Foundation for Statistical Computing, Vienna, Austria. <https://www.R-project.org/>.
##   - Therneau T (2024). _A Package for Survival Analysis in R_. R package version 3.7-0, <https://CRAN.R-project.org/package=survival>. Terry M. Therneau, Patricia M. Grambsch (2000). _Modeling Survival Data: Extending the Cox Model_. Springer, New York. ISBN 0-387-98784-3.
##   - Wickham H (2016). _ggplot2: Elegant Graphics for Data Analysis_. Springer-Verlag New York. ISBN 978-3-319-24277-4, <https://ggplot2.tidyverse.org>.
##   - Wickham H (2023). _forcats: Tools for Working with Categorical Variables (Factors)_. R package version 1.0.0, <https://CRAN.R-project.org/package=forcats>.
##   - Wickham H (2023). _stringr: Simple, Consistent Wrappers for Common String Operations_. R package version 1.5.1, <https://CRAN.R-project.org/package=stringr>.
##   - Wickham H, Averick M, Bryan J, Chang W, McGowan LD, François R, Grolemund G, Hayes A, Henry L, Hester J, Kuhn M, Pedersen TL, Miller E, Bache SM, Müller K, Ooms J, Robinson D, Seidel DP, Spinu V, Takahashi K, Vaughan D, Wilke C, Woo K, Yutani H (2019). "Welcome to the tidyverse." _Journal of Open Source Software_, *4*(43), 1686. doi:10.21105/joss.01686 <https://doi.org/10.21105/joss.01686>.
##   - Wickham H, François R, Henry L, Müller K, Vaughan D (2023). _dplyr: A Grammar of Data Manipulation_. R package version 1.1.4, <https://CRAN.R-project.org/package=dplyr>.
##   - Wickham H, Henry L (2023). _purrr: Functional Programming Tools_. R package version 1.0.2, <https://CRAN.R-project.org/package=purrr>.
##   - Wickham H, Hester J, Bryan J (2024). _readr: Read Rectangular Text Data_. R package version 2.1.5, <https://CRAN.R-project.org/package=readr>.
##   - Wickham H, Vaughan D, Girlich M (2024). _tidyr: Tidy Messy Data_. R package version 1.3.1, <https://CRAN.R-project.org/package=tidyr>.
##   - Xie Y (2025). _knitr: A General-Purpose Package for Dynamic Report Generation in R_. R package version 1.50, <https://yihui.org/knitr/>. Xie Y (2015). _Dynamic Documents with R and knitr_, 2nd edition. Chapman and Hall/CRC, Boca Raton, Florida. ISBN 978-1498716963, <https://yihui.org/knitr/>. Xie Y (2014). "knitr: A Comprehensive Tool for Reproducible Research in R." In Stodden V, Leisch F, Peng RD (eds.), _Implementing Reproducible Computational Research_. Chapman and Hall/CRC. ISBN 978-1466561595.
##   - Xie Y, Allaire J (2023). _tufte: Tufte's Styles for R Markdown Documents_. R package version 0.13, <https://CRAN.R-project.org/package=tufte>.
##   - Zhu H (2024). _kableExtra: Construct Complex Table with 'kable' and Pipe Syntax_. R package version 1.4.0, <https://CRAN.R-project.org/package=kableExtra>.

End

I hope you found this article informative! If you have any comments or suggestions or if you find any errors and are keen to help correct the error, please write to me at shishir909@gmail.com.