Confidence intervals present a range of reasonable values for the population. In the frequentist paradigm the population (\(\mu\)) is assumed to be fixed and unknown. We collect a sample of data and estimate \(\mu\) using the sample mean \(\bar{x}=\sum_{i=1}^n x_i\).
In this simulation, I will suppress to the population mean mu
from the code to maintain the integrity of the simulation.
pop_n=10000
data<-rnorm(pop_n, mean=mu, sd=1)
pop_mean<-mean(data)
pop_mean
## [1] 1.992503
pop_sd<-sd(data)
pop_sd
## [1] 1.008436
### ONE TIME
samp_n<-30
samp<-sample(data, size=samp_n)
samp_mean<-mean(samp)
samp_mean
## [1] 2.197769
samp_sd<-sd(samp)
samp_sd
## [1] 0.9519552
From the Central Limit Theorem, we learned that …
No matter the underlying distribution of the data, the distribution of sample means is approximately Normal with mean \(\mu\) and population standard deviation \(\frac{\sigma}{n}\), for a sufficiently large sample size \(n\).
\[\bar{x}\sim Normal(\mu, \frac{\sigma}{n})\]
The Z-confidence interval for the mean is given by \[\bar{x} \pm z^* \times \frac{\sigma}{\sqrt{n}}\]
Critival values from the Standard Normal distribution using the qnorm
function. When constructing a confidence interval we set a confidence level, which is given by the function \(100\times (1-\alpha) \%\). This fuction uses the significance level \(\alpha\). Since the most common significance level is \(\alpha=0.05\), the most common confidence level is \(95\%\).
The confidence level describes the amount of area in the middle of the distribution, around the point estimate. Therefore, since the Normal Distribution is symmetric, this means that upper bound of the confidence interval is above \(100\times (1-\frac{\alpha}{2}) \%\). So in the case of \(\alpha=0.05\) this would be the \(97.5 \%\) quantile.
# CRITICAL VALUE
qnorm(0.975)
## [1] 1.959964
What would the critical value be for the following significance levels?
What do you notice?
In reality we are rearly know what the population standard deviation is, so we estimate it with the sample standard deviation \(s=\sqrt{\frac{1}{n-1}\sum_{i=1}^n (x_i-\bar{x})^2}\).
The T-confidence interval for the mean is given by \[\bar{x} \pm t_{df=n-1}^* \times \frac{s}{\sqrt{n}}\]
T-critical values are similar to Z-critical values, but the you need to define the degrees of freedom. The T-confidence interval for the mean has \(n-1\) degrees of freedom, where the sample size of the population is given by \(n\)
# CRITICAL VALUE
qt(0.975, df=30-1)
## [1] 2.04523
library(tidyverse)
## SIMULATION
nsim<-1000 # repeat the process 1000 times
sim_mean<-c()
sim_sd<-c()
for(i in 1:nsim){
this_samp<-sample(data, size=samp_n) # sample size
sim_mean<-c(sim_mean, mean(this_samp)) # sample means for each sample
sim_sd<-c(sim_sd, sd(this_samp)) # sample sd for each sample
}
sim_df<-data.frame(sim=1:nsim,
sim_mean,
sim_sd)%>%
mutate(pop_sd=pop_sd)%>% # true population sd needed for Z
mutate(z_crit=qnorm(0.975), # critical values
t_crit=qt(0.975, df=samp_n-1))%>%
mutate(z_low=sim_mean-z_crit*pop_sd/sqrt(samp_n), # z ci lower bound
z_up=sim_mean+z_crit*pop_sd/sqrt(samp_n), # z ci upper bound
t_low=sim_mean-t_crit*sim_sd/sqrt(samp_n), # t ci lower bound
t_up=sim_mean+t_crit*sim_sd/sqrt(samp_n))%>% # t ci upper bound
mutate(z_cov =(mu>z_low & mu<z_up), # coverage
t_cov =(mu>t_low & mu<t_up))
# COVERAGE
mean(sim_df$z_cov)
## [1] 0.946
mean(sim_df$t_cov)
## [1] 0.951
ggplot(sim_df, aes(y=sim, x=sim_mean, color=z_cov))+
geom_point()+
geom_errorbar(alpha=.3, aes(xmin=z_low, xmax=z_up), size=.5, width=.1)+
ggtitle("Z-Methods Coverage Simulation \n Sigma (Known) = 1")+
theme_bw()
ggplot(sim_df, aes(y=sim, x=sim_mean, color=t_cov))+
geom_point()+
geom_errorbar(alpha=.3, aes(xmin=z_low, xmax=z_up), size=.5, width=.1)+
ggtitle("T-Methods Coverage Simulation \n Sigma (Unknown)")+
theme_bw()