# Ames, Iowa dataset
download.file("http://www.openintro.org/stat/data/ames.RData", destfile = "ames.RData")
load("ames.RData")
# Generate a simple random sample of size 60. Look at size of house.
population <- ames$Gr.Liv.Area
samp <- sample(population, 60)
This sample has a right-skewed distributionm. The typical size is about 1526 sq ft. I interpreted ‘typical’ as where the median lies. I chose median over mean because there are upper outliers that skew the mean higher.
hist(samp, breaks=10)
summary(samp)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 572 1117 1403 1480 1721 4476
No, not identical. It would be similar. The sample was randomly generated to mimic a random sample from the population. A little variation is expected due to sampling error.
sample_mean <- mean(samp)
# Calculate a 95% confidence interval for a sample mean
se <- sd(samp) / sqrt(60)
lower <- sample_mean - 1.96 * se
upper <- sample_mean + 1.96 * se
c(lower, upper)
## [1] 1330.528 1629.772
The sample must be a random sample. The observations (i.e. homes) must be independent of each other. And, the population must be large compared to the sample size.
It means we are 95% confident that the population mean lies within the interval between a lower bound and an upper bound.
mean(population)
## [1] 1499.69
Yes, the confidence interval captures the true average size. The true population mean (1499.7) is between 1470.3 and 1714.0 sq ft.
I would expect 95% of the confidence intervals generated by other students to capture the true population mean. If a confidence interval is created for each sample in a meta study, then 95% of the confidence intervals will contain the true population mean.
Using R, we’re going to recreate many samples to learn more about how sample means and confidence intervals vary from one sample to another. Loops come in handy here (If you are unfamiliar with loops, review the Sampling Distribution Lab).
Here is the rough outline:
# But before we do all of this, we need to first create empty vectors where we can save the means and standard deviations that will be calculated from each sample. And while we’re at it, let’s also store the desired sample size as n.
samp_mean <- rep(NA, 50)
samp_sd <- rep(NA, 50)
n <- 60
# Now we’re ready for the loop where we calculate the means and standard deviations of 50 random samples.
for(i in 1:50){
samp <- sample(population, n) # obtain a sample of size n = 60 from the population
samp_mean[i] <- mean(samp) # save sample mean in ith element of samp_mean
samp_sd[i] <- sd(samp) # save sample sd in ith element of samp_sd
}
# Lastly, we construct the confidence intervals.Lower bounds of these 50 confidence intervals are stored in lower_vector, and the upper bounds are in upper_vector.
lower_vector <- samp_mean - 1.96 * samp_sd / sqrt(n)
upper_vector <- samp_mean + 1.96 * samp_sd / sqrt(n)
# Let’s view the first interval.
c(lower_vector[1], upper_vector[1])
## [1] 1311.882 1571.951
2 out 50 random samples do not include the true population mean. So, 48 out of 50, or 96% of the intervals, include the true population mean.
This proportion is not exactly equal to the confidence level, which is not unusual. The confidence interval can be interpreted as a probabilty (of an interval containing the true population mean).
plot_ci(lower_vector, upper_vector, mean(population))
80% confidence level. Critical value is 1.299
# Calculate critical value for 80% CI
qt(0.90, df=49) # n is 60
## [1] 1.299069
7 out of 50 samples do not include the true population mean. Therefore, the proportion of intervals that include the true population mean is 43 out of 50, or 86%.
This percentage is quite a bit higher than the 80% confidence level.
lower_vector <- samp_mean - 1.299 * samp_sd / sqrt(n)
upper_vector <- samp_mean + 1.299 * samp_sd / sqrt(n)
plot_ci(lower_vector, upper_vector, mean(population))