Confidence Intervals

Confidence intervals present a range of reasonable values for the population. In the frequentist paradigm the population (\(\mu\)) is assumed to be fixed and unknown. We collect a sample of data and estimate \(\mu\) using the sample mean \(\bar{x}=\sum_{i=1}^n x_i\).

Generate the Data

In this simulation, I will suppress to the population mean mu from the code to maintain the integrity of the simulation.

pop_n=10000
data<-rnorm(pop_n, mean=mu, sd=1)

pop_mean<-mean(data)
pop_mean
## [1] 1.992503
pop_sd<-sd(data)
pop_sd
## [1] 1.008436

Generate the Sample

### ONE TIME
samp_n<-30
samp<-sample(data, size=samp_n)

samp_mean<-mean(samp)
samp_mean
## [1] 2.197769
samp_sd<-sd(samp)
samp_sd
## [1] 0.9519552

Recall: Central Limit Theorem

From the Central Limit Theorem, we learned that …

No matter the underlying distribution of the data, the distribution of sample means is approximately Normal with mean \(\mu\) and population standard deviation \(\frac{\sigma}{n}\), for a sufficiently large sample size \(n\).

\[\bar{x}\sim Normal(\mu, \frac{\sigma}{n})\]

Z-Methods

The Z-confidence interval for the mean is given by \[\bar{x} \pm z^* \times \frac{\sigma}{\sqrt{n}}\]

  • \(\bar{x}\) : Point Estimate
  • \(z^*\) : Critical value from the Standard Normal (Z) distribution
  • \(\frac{\sigma}{\sqrt{n}}\): Standard errors

Finding the Z-Critical Values

Critival values from the Standard Normal distribution using the qnorm function. When constructing a confidence interval we set a confidence level, which is given by the function \(100\times (1-\alpha) \%\). This fuction uses the significance level \(\alpha\). Since the most common significance level is \(\alpha=0.05\), the most common confidence level is \(95\%\).

The confidence level describes the amount of area in the middle of the distribution, around the point estimate. Therefore, since the Normal Distribution is symmetric, this means that upper bound of the confidence interval is above \(100\times (1-\frac{\alpha}{2}) \%\). So in the case of \(\alpha=0.05\) this would be the \(97.5 \%\) quantile.

# CRITICAL VALUE
qnorm(0.975)
## [1] 1.959964
Try it out!

What would the critical value be for the following significance levels?

  • \(\alpha=0.2\)
  • \(\alpha=0.1\)
  • \(\alpha=0.01\)
  • \(\alpha=0.001\)

What do you notice?

T-Methods

In reality we are rearly know what the population standard deviation is, so we estimate it with the sample standard deviation \(s=\sqrt{\frac{1}{n-1}\sum_{i=1}^n (x_i-\bar{x})^2}\).

The T-confidence interval for the mean is given by \[\bar{x} \pm t_{df=n-1}^* \times \frac{s}{\sqrt{n}}\]

  • \(\bar{x}\) : Point Estimate
  • \(z^*\) : Critical value from the T distribution with \(n-1\) degrees of freedom.
  • \(\frac{s}{\sqrt{n}}\): Standard errors

Finding the T-Critical Values

T-critical values are similar to Z-critical values, but the you need to define the degrees of freedom. The T-confidence interval for the mean has \(n-1\) degrees of freedom, where the sample size of the population is given by \(n\)

# CRITICAL VALUE
qt(0.975, df=30-1)
## [1] 2.04523

Simulation

library(tidyverse)

## SIMULATION

nsim<-1000 # repeat the process 1000 times
sim_mean<-c()
sim_sd<-c()
for(i in 1:nsim){
  this_samp<-sample(data, size=samp_n) # sample size
  sim_mean<-c(sim_mean, mean(this_samp)) # sample means for each sample
  sim_sd<-c(sim_sd, sd(this_samp)) # sample sd for each sample
}

sim_df<-data.frame(sim=1:nsim,
                   sim_mean, 
                   sim_sd)%>%
  mutate(pop_sd=pop_sd)%>% # true population sd needed for Z
  mutate(z_crit=qnorm(0.975),  # critical values
         t_crit=qt(0.975, df=samp_n-1))%>%
  mutate(z_low=sim_mean-z_crit*pop_sd/sqrt(samp_n), # z ci lower bound
         z_up=sim_mean+z_crit*pop_sd/sqrt(samp_n), # z ci upper bound
         t_low=sim_mean-t_crit*sim_sd/sqrt(samp_n), # t ci lower bound
         t_up=sim_mean+t_crit*sim_sd/sqrt(samp_n))%>% # t ci upper bound
  mutate(z_cov =(mu>z_low & mu<z_up), # coverage
         t_cov =(mu>t_low & mu<t_up))

# COVERAGE
mean(sim_df$z_cov)
## [1] 0.946
mean(sim_df$t_cov)
## [1] 0.951
ggplot(sim_df, aes(y=sim, x=sim_mean, color=z_cov))+
  geom_point()+
  geom_errorbar(alpha=.3, aes(xmin=z_low, xmax=z_up), size=.5, width=.1)+
  ggtitle("Z-Methods Coverage Simulation \n Sigma (Known) = 1")+
  theme_bw()

ggplot(sim_df, aes(y=sim, x=sim_mean, color=t_cov))+
  geom_point()+
  geom_errorbar(alpha=.3, aes(xmin=z_low, xmax=z_up), size=.5, width=.1)+
  ggtitle("T-Methods Coverage Simulation \n Sigma (Unknown)")+
  theme_bw()

What is the relationship between the coverage and the confidence level?