2024-03-21

Confidence Intervals

What are Confidence Intervals and the role they play in statistics?

  • Confidence intervals are a range of values typically described by a lower and upper bound, referred to as a margin of error, such as ‘+/-3%’. They allow us to estimate unknown population parameters, like the population mean, based on population sample data.

  • An interval has a % level of confidence: 95% confidence level would suggest that repeating the sampling process would result in 95% of the intervals containing the true parameter value.

  • Confidence intervals are best used for smaller samples where there is less information provided and therefore more uncertainty with standard methods.

How do you calculate Confidence Intervals?

\[ CI = \bar{x} \pm z \left( \frac{s}{\sqrt{n}} \right) \] Where: \(CI\) = Confidence Interval \(\bar{x}\) = Sample Mean \(z\) = Confidence Level \(s\) = Sample Standard Deviation \(n\) = Sample Size

Confidence Interval ggplot with “PlantGrowth” dataset

Previous Example R Code

library(ggplot2)
library(dplyr)
data(PlantGrowth)

summary_data <- PlantGrowth %>%
  group_by(group) %>%
  summarise(mean = mean(weight),
            se = sd(weight) / sqrt(n()))

ggplot(summary_data, aes(x = group, y = mean, group = 1)) +
  geom_line() +
  geom_point() +
  geom_errorbar(aes(ymin = mean - 1.96 * se, ymax = mean + 1.96 * se), width = 0.1) +
  labs(x = "Treatment Type", y = "Weight") +
  theme_minimal() + 
  theme( 
    axis.title = element_text(size = 16, face = "bold", margin =  margin(t = 20)), 
    axis.text = element_text(size = 14, face = "bold"), 
    plot.title = element_text(size = 20, face = "bold", hjust = 0.5),
    legend.title = element_text(size = 16), 
    legend.text = element_text(size = 14) 
  )

Plotly graph with “PlantGrowth” dataset

T-Distribution

What is T-Distribution and it’s role in statistics and confidence intervals?

  • T-Distribution is a probability distribution, like normal distribution, and is similar in (bell-curve) shape while having heavier tails to account for the increased statistical variability. T-Distribution is characterized by degrees of freedom. These degrees of freedom depend on the dataset sample sizes.

  • T-Distribution is particularly useful for smaller sample sizes (such as those n<30) and when population standard deviation (or, sigma) is unknown.

  • T-Distribution can especially be used to provide critical values and confidence intervals for smaller sample data or illustrating rejection regions in hypothesis testing.

How do you calculate T-Distribution?

\[ t = \frac{\bar{x} - \mu}{\frac{s}{\sqrt{n}}} \] Where: \(t\) = t-value \(\bar{x}\) = Sample Mean \(\mu\) = Population Mean \(s\) = Sample Standard Deviation \(n\) = Sample Size

T-Distribution ggplot with “PlantGrowth” dataset