Sampling from Ames, Iowa

The Data

download.file("http://www.openintro.org/stat/data/ames.RData", destfile = "ames.RData")
load("ames.RData")
population <- ames$Gr.Liv.Area
samp <- sample(population, 60)

Exercise 1. Describe the distribution of your sample. What would you say is the “typical” size within your sample? Also state precisely what you interpreted “typical” to mean.

hist(samp)

summary(samp)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     616    1218    1368    1447    1628    2654

The distribution is slightly right skewed.This is supported by having median less than mean. The range is 2014. The size of my sample is 60 based on the summary of sample size and to me, it means the mean reflect the the average living area space that most homes have in Ames, Iowa.

Exercise 2. Would you expect another student’s distribution to be identical to yours? Would you expect it to be similar? Why or why not?

Another student’s distribution will not be identical because these are the 60 samples I received, another student will have different 60 samples but there can be few overlaps. And distribution can be similar since the population data is unbiased.

Confidence Intervals

sample_mean <- mean(samp)
se <- sd(samp) / sqrt(60)
lower <- sample_mean - 1.96 * se
upper <- sample_mean + 1.96 * se
c(lower, upper)
## [1] 1340.208 1553.425

Exercise 3. For the confidence interval to be valid, the sample mean must be normally distributed and have standard error s/sqrt(n). What conditions must be met for this to be true?

For the confidence interval to be valid, these conditions must be true:random sampling, independent observations, sample size is greather than 30 and distribution is not skewed.

Confidence Levels

Exercise 4. What does “95% confidence” mean?

95% confidence means roughly 95% of the time the estimates that the true population mean is within two standard deviations around the sample mean.

mean(population)
## [1] 1499.69

Exercise 5. Does your confidence interval capture the true average size of houses in Ames? If you are working on this lab in a classroom, does your neighbor’s interval capture this value?

Yes the confidence interval of 1375.715 < u < 1640.819 has include the population mean of 1499.69.

Exercise 6. Each student in your class should have gotten a slightly different confidence interval. What proportion of those intervals would you expect to capture the true population mean? Why? If you are working in this lab in a classroom, collect data on the intervals created by other students in the class and calculate the proportion of intervals that capture the true population mean.

Approximately 95% of the intervals will have include true population mean.

Loop for 50samples

samp_mean <- rep(NA, 50)
samp_sd <- rep(NA, 50)
n <- 60
for(i in 1:50){
  samp <- sample(population, n)
  samp_mean[i] <- mean(samp)
  samp_sd[i] <- sd(samp)
}
sd(samp)
## [1] 576.9694
lower_vector <- samp_mean - 1.96 * samp_sd / sqrt(n) 
upper_vector <- samp_mean + 1.96 * samp_sd / sqrt(n)
c(lower_vector[1], upper_vector[1])
## [1] 1410.820 1646.413

On Your Own

1. Using the following function (which was downloaded with the data set), plot all intervals. What proportion of your confidence intervals include the true population mean? Is this proportion exactly equal to the confidence level? If not, explain why.

plot_ci(lower_vector, upper_vector, mean(population))

#### 4 out of the 50 confidence interval did not include the true population mean of 1499.69, which is 92%. This proportion is not exactly equal to 95% confidence level because the confidence interval is a range to values that 95% of the estimate will contain the true population mean. ### 2. Pick a confidence level of your choosing, provided it is not 95%. What is the appropriate critical value?

qnorm(0.95,0,1)
## [1] 1.644854

I choose 90% confidence interval.The new critical value is 1.67 from t distribution with degree of freedom = 59.

3.Calculate 50 confidence intervals at the confidence level you chose in the previous question. You do not need to obtain new samples, simply calculate new intervals based on the sample means and standard deviations you have already collected. Using the plot_ci function, plot all intervals and calculate the proportion of intervals that include the true population mean. How does this percentage compare to the confidence level selected for the intervals?

samp_mean2 <- rep(NA, 50)
samp_sd2 <- rep(NA, 50)
p <- 60
for(i in 1:50){
  samp2 <- sample(population, p)
  samp_mean2[i] <- mean(samp2)
  samp_sd2[i] <- sd(samp2)
}
lower <- samp_mean2 - 1.645 * samp_sd2 / sqrt(p) 
upper <- samp_mean2 + 1.645 * samp_sd2 / sqrt(p)
c(lower[1], upper[1])
## [1] 1377.575 1618.559
plot_ci(lower, upper, mean(population))

4 out of 50 confidence intervals did not include the true population mean of 1499.69, which is 92%. This proportion is higher than the 90% confidence level because the confidence interval generates values within a certain boundary so that 90% of the estimate will contain the true population mean.