Foundations for statistical inference - Confidence intervals

Sampling from Ames, Iowa- The Data

download.file("http://www.openintro.org/stat/data/ames.RData", destfile = "ames.RData")
load("ames.RData")
population <- ames$Gr.Liv.Area
samp <- sample(population, 60)

Exercise 1: Describe the distribution of your sample. What would you say is the “typical” size within your sample? Also state precisely what you interpreted “typical” to mean.

hist(samp)

The distribution of my sample is somewhat right skewed, with the peak between 1000-1500 sq feet. Based on where the peak is, I would say the typical size of a house is between 1000-1500 square feet. Typical in this case is area where the majority of house sizes fall.

Exercise 2: Would you expect another student’s distribution to be identical to yours? Would you expect it to be similar? Why or why not?

I would not expect another student’s distribution to be identical, since the odds of pulling the exact same 60 values for the sample out of 2930 total values is really small. It may be similar, in that the shape of the distribution may be similar if the overall data is somewhat normal, but there’s virtually no way that it could be identical.

Confidence Intervals

sample_mean <- mean(samp)
se <- sd(samp) / sqrt(60)
lower <- sample_mean - 1.96 * se
upper <- sample_mean + 1.96 * se
c(lower, upper)
## [1] 1459.875 1719.859

Exercise 3 For the confidence interval to be valid, the sample mean must be normally distributed and have standard error s/n‾√. What conditions must be met for this to be true?

must know the SD of the sample; n must be greater than/equal to 30

Exercise 4 What does “95% confidence” mean? If you’re not sure, see Section 4.2.2.

95% confidence means that we are 95% confident that the true average sizes of houses in Ames lies between 1328.883 sq feet and 1554.717 sq feet.

mean(population)
## [1] 1499.69

Exercise 5 Does your confidence interval capture the true average size of houses in Ames?

Yes, the confidence interval ranged from 1328.883 to 1554.717, and the true mean of 1499.69 falls within that range.

Exercise 6 Each student in your class should have gotten a slightly different confidence interval. What proportion of those intervals would you expect to capture the true population mean? Why?

Based on our 95% confidence interval, I would say that 95% of the intervals should capture the true population mean. My interval was fairly large, and if the other intervals are similar ranges (based on the data) then the intervals SHOULD include the true mean.

samp_mean <- rep(NA, 50)
samp_sd <- rep(NA, 50)
n <- 60
for(i in 1:50){
  samp <- sample(population, n)
  samp_mean[i] <- mean(samp)
  samp_sd[i] <- sd(samp)
}
lower_vector <- samp_mean - 1.96 * samp_sd / sqrt(n) 
upper_vector <- samp_mean + 1.96 * samp_sd / sqrt(n)
c(lower_vector[1], upper_vector[1])
## [1] 1316.141 1537.392