Confidence Intervals

2022-09-15

What is a confidence interval?

A confidence level is the probability that a dependent variable will fall between a set of acceptable values for a given independent measurement.
For example, if the height of a tree is being estimated from its girth, one can calculate the height confidence interval from existing data at a 95% confidence level.
A confidence interval is the range of values a measurement can be expected to be found for a certain confidence level. The higher the confidence level, the wider the confidence interval. This can be seen in the following confidence bands using the linear regression function.

Creating a linear model between perimeter and area of a petroleum core

mod = lm(area ~ peri, data=rock)
x = rock$peri; y = rock$area

xax <- list(
  title = "Perimeter in pixels",
  titlefont = list(family="Modern Computer Roman")
)
yax <- list(
  title="Area of pores space in pixels squared",
  titlefont = list(family="Modern Computer Roman")
)
fig <- plot_ly(x=x, y=y, type="scatter", mode="markers", name="data",
               width=800, height=430) %>% 
              add_lines(x = x, y= fitted(mod), name="fitted") %>%
              layout(xaxis = xax, yaxis = yax)

Using plotly to plot a relationship between two variables

Using ggplot to construct a confidence interval band

g1 <- ggplot(data = rock, aes(x = peri, y = area)) + geom_point()
g <- g1 + geom_smooth(formula = y ~ x, method="lm", level=0.95) + 
  theme_bw() +
  xlab("Perimeter in pixels") +
  ylab("Area in pixels squared")

Linear regression with a 95% confidence band

Linear regression with a 99.9% confidence band

How to calculate the standard deviation

-Before the confidence interval can be calculated, one must determine the standard deviation of the data points.

\({S_{x}=\sqrt{\sum_i^n {(x_i-\overline{x})^2} \over {n-1}}}\)

-where \({S_{x}}\) is the sample standard deviation, \({n}\) is the number of data points, \({x_i}\) is each data value, and \({\overline{x}}\) is the average of the data points.

-The standard deviation is the measure of deviation from the average value and represents the square root of the variance in a Gaussian distribution.

How to calculate the confidence interval

-Then determine the confidence interval using the corresponding “t” value (for small numbers of measurements) or the “z” score (for a large number of measurements), which is a statistical measure of how far away from the mean an observation will be in a Gaussian-like distribution.

-This can be determined easily from tables online.

\({CI = \overline{x}\pm z{s\over{\sqrt{n}}}}\)

-where \(CI\) is either the maximum or minimum bound of the convidence interval, \(\overline{x}\) is the sample mean, \(z\) is the “t” or “z” value, \(s\) is the sample deviation calculated earlier, and \(n\) is the sample size.