Foundations for statistical inference - Confidence intervals

Sampling from Ames, Iowa

The Data

download.file("http://www.openintro.org/stat/data/ames.RData", destfile = "ames.RData")
load("ames.RData")
population <- ames$Gr.Liv.Area
samp <- sample(population, 60)

summary(samp)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     792    1115    1482    1472    1803    2599
hist(samp)

Exercise 1

Describe the distribution of your sample. What would you say is the “typical” size within your sample? Also state precisely what you interpreted “typical” to mean.

My sample is unimodal and rightward skewed. A majority of the data, or the typical size, is between 1000 and 1500. I interpret typical to mean which size house is most common (the mode).

Exercise 2

Would you expect another student’s distribution to be identical to yours? Would you expect it to be similar? Why or why not?

I would not expect another student’s distribution to be identical to mine. This is beacuse we are using a small sample size of only 60 out of a large population. While there is a chance another student’s distribution may be similar in terms of being rightward skewed, it is very unlikely it would look indentical.

Confidence Intervals

sample_mean <- mean(samp)

se <- sd(samp) / sqrt(60)
lower <- sample_mean - 1.96 * se
upper <- sample_mean + 1.96 * se
c(lower, upper)
## [1] 1366.089 1578.611

Exercise 3

For the confidence interval to be valid, the sample mean must be normally distributed and have standard error s/sqrt(n) What conditions must be met for this to be true?

The data must be randomly sampled, all observations must be independent, and the data should be normally distributed.

Confidence Levels

Exercise 4

What does “95% confidence” mean? If you’re not sure, see Section 4.2.2.

A 95% confidence is a sense of precision of the mean. It suggests that we are 95% confident that the true population mean is between our two calculated values.

In this case we have the luxury of knowing the true population mean since we have data on the entire population. This value can be calculated using the following command:
mean(population)
## [1] 1499.69

Exercise 5

Does your confidence interval capture the true average size of houses in Ames?

Yes, my confidence levels are (1344.969, 1614.765). This includes the true population mean of 1299.69

Exercise 6

Each student in your class should have gotten a slightly different confidence interval. What proportion of those intervals would you expect to capture the true population mean? Why?

I would expect 95% of the class to have confidence intervals that included the true population mean. This is because each of us have calculated a 95% confidence interval on our own random identically sized samples.

samp_mean <- rep(NA, 50)
samp_sd <- rep(NA, 50)
n <- 60
for(i in 1:50){
  samp <- sample(population, n) # obtain a sample of size n = 60 from the population
  samp_mean[i] <- mean(samp)    # save sample mean in ith element of samp_mean
  samp_sd[i] <- sd(samp)        # save sample sd in ith element of samp_sd
}
lower_vector <- samp_mean - 1.96 * samp_sd / sqrt(n) 
upper_vector <- samp_mean + 1.96 * samp_sd / sqrt(n)
c(lower_vector[1], upper_vector[1])
## [1] 1277.675 1529.125

On Your Own

plot_ci(lower_vector, upper_vector, mean(population))

1) Using the following function (which was downloaded with the data set), plot all intervals. What proportion of your confidence intervals include the true population mean? Is this proportion exactly equal to the confidence level? If not, explain why.

47/50 of my confidence intervals include the true population mean. This is lower than the confidence level. A 95% confidence means that at least 95% of the sample intervals will include the true mean, in my situation the number was lower at 94%.

2) Pick a confidence level of your choosing, provided it is not 95%. What is the appropriate critical value?

For a 99% confidence level the critical value for this confidence interval is +/- 2.58 * Standard Error

3) Calculate 50 confidence intervals at the confidence level you chose in the previous question. You do not need to obtain new samples, simply calculate new intervals based on the sample means and standard deviations you have already collected. Using the plot_ci function, plot all intervals and calculate the proportion of intervals that include the true population mean. How does this percentage compare to the confidence level selected for the intervals?

lower_vector <- samp_mean - 2.58 * samp_sd / sqrt(n) 
upper_vector <- samp_mean + 2.58 * samp_sd / sqrt(n)
plot_ci(lower_vector, upper_vector, mean(population))

For me, this has a higher value as there is 100% inclusion of the population mean in each confidence interval.