load(url("http://www.openintro.org/stat/data/ames.RData"))
population <- ames$Gr.Liv.Area
samp <- sample(population, 60)
hist(samp, breaks=25)
summary(samp)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 800 1092 1447 1473 1687 2790
sample_mean <- mean(samp)
se <- sd(samp) / sqrt(60)
lower <- sample_mean - 1.96 * se
upper <- sample_mean + 1.96 * se
c(lower, upper)
## [1] 1355.720 1589.447
mean(population)
## [1] 1499.69
if (mean(population) > lower & mean(population) < upper){
printo<-"TRUE"
} else{
printo<-"FALSE"
}
sd_samp_classmates<-c()
mean_samp_classmates<-c()
for(i in 1:35){
samp_classmates<-sample(samp, 60)
sd_samp_classmates[i]<-sd(samp_classmates)
mean_samp_classmates[i]<-mean(samp_classmates)
}
upper_samp_classmates<-mean_samp_classmates+(1.96*sd_samp_classmates)
lower_samp_classmates<-mean_samp_classmates-(1.96*sd_samp_classmates)
Using R, we’re going to recreate many samples to learn more about how sample means and confidence intervals vary from one sample to another. Loops come in handy here (If you are unfamiliar with loops, review the Sampling Distribution Lab).
Here is the rough outline:
But before we do all of this, we need to first create empty vectors where we can save the means and standard deviations that will be calculated from each sample. And while we’re at it, let’s also store the desired sample size as n
.
samp_mean <- rep(NA, 50)
samp_sd <- rep(NA, 50)
n <- 60
Now we’re ready for the loop where we calculate the means and standard deviations of 50 random samples.
for(i in 1:50){
samp <- sample(population, n) # obtain a sample of size n = 60 from the population
samp_mean[i] <- mean(samp) # save sample mean in ith element of samp_mean
samp_sd[i] <- sd(samp) # save sample sd in ith element of samp_sd
}
Lastly, we construct the confidence intervals.
lower_vector <- samp_mean - 1.96 * samp_sd / sqrt(n)
upper_vector <- samp_mean + 1.96 * samp_sd / sqrt(n)
Lower bounds of these 50 confidence intervals are stored in lower_vector
, and the upper bounds are in upper_vector
. Let’s view the first interval.
c(lower_vector[1], upper_vector[1])
## [1] 1328.970 1550.997
Using the following function (which was downloaded with the data set), plot all intervals. What proportion of your confidence intervals include the true population mean? Is this proportion exactly equal to the confidence level? If not, explain why.
plot_ci(lower_vector, upper_vector, mean(population))
df <- data.frame(lower_vector, upper_vector)
low <- sum(df$upper_vector < mean(population))
up <- sum(df$lower_vector > mean(population))
without_mean_proportion <- round((low + up)/60 ,2)
plot_ci
function, plot all intervals and calculate the proportion of intervals that include the true population mean. How does this percentage compare to the confidence level selected for the intervals?lower_vector <- samp_mean - 1.65 * samp_sd / sqrt(n)
upper_vector <- samp_mean + 1.65 * samp_sd / sqrt(n)
plot_ci(lower_vector, upper_vector, mean(population))
df <- data.frame(lower_vector, upper_vector)
left <- sum(df$upper_vector < mean(population))
right <- sum(df$lower_vector > mean(population))
without_mean_proportion <- round((left + right)/60 ,2)