
Introduction
- A prediction interval is a lower and an upper limit within which a future measurement of a sample taken from a population will fall (given a certain probability)
- It is based on what has been observed in a sample taken from the given population
- Prediction intervals therefor predict the distribution of individual future measurements
- Confidence intervals predict the distribution of estimates of parameters in the population (i.e. the mean), which can not inherently be observed (the whole population cannot be included in a sample so as to calculate the true parameter)
- Therefor, under the assumption of a normal distribution of a variable in a population, a confidence interval is used to estimate a true mean or standard deviation, or at least, the limits in between which it may be located
- For a given confidence level, c %, if an experiment is repeated an infinite number of times, the true population parameter will be found within the intervals c% of the time
- A prediction intervals is simply concenrned with limits within which a future measurement for that variable will fall and expresses this as a percentage probability
Example explanation
- A serum total cholesterol value is taken from \(500\) patients after treatment with a new cholesterol-lowering drug
- The sample mean is \(180.7\) mg/dL, with a standard deviation of \(19.4\)
- The \(95\)%confidence interval for the mean is \(179\) to \(182.4\)
- This means that repeating this experiments an infinite many times, with a random sample taken from the population each time, that in 95% of the cases, the true population parameter would correctly fall within the given limits
- The prediction interval for the mean is \(142.5\) to \(218.9\)
- This means that with a confidence of \(95\)% a new measurement taken from a random sample wouyld be within these limits
The code
- Set pseudo-random
- Take \(500\) samples from a normal distribution with \(\mu = 180\) and \(\sigma^2 = 20\)
- Save the dataset as a
data.frame
set.seed(123)
df <- data.frame(Cholesterol = round(rnorm(500,mean = 180,sd = 20),digits = 0))
- Mean and standard deviation
mean(df$Cholesterol)
## [1] 180.694
sd(df$Cholesterol)
## [1] 19.44285
ci <- predict(lm(df$Cholesterol ~ 1),
interval = "confidence")
ci[1,]
## fit lwr upr
## 180.6940 178.9856 182.4024
- Prediction interval (a warning will be stated)
pi <- predict(lm(df$Cholesterol ~ 1),
interval = "predict")
## Warning in predict.lm(lm(df$Cholesterol ~ 1), interval = "predict"): predictions on current data refer to _future_ responses
pi[1,]
## fit lwr upr
## 180.6940 142.4559 218.9321