The Data

Load up the data for this lab by inserting a code chunk here:

download.file("http://www.openintro.org/stat/data/ames.RData", destfile = "ames.RData")
load("ames.RData")
  1. Describing the Data
    1. Describe the distribution of your sample. (Be sure to use appropriate terminology such as symmetric, skewed right/left, unimodal, bimodal, etc.)
    2. What would you say is the “typical” size of your sample? State precisely what you interpreted “typical” to mean.
population <- ames$Gr.Liv.Area
samp <- sample(population,60)
hist(samp)

mean(samp)
## [1] 1577.717

Solution: The distribution of the sample is unimodal and skewed right. For the typical size of the sample, we use the sample mean, which is computed above.

  1. Would you expect another group’s sample distribution to be identical to yours? Would you expect it to be similar? Why or why not?

Solution: Because sampling is random, we shouldn’t expect another group’s distribution to be identical. However, it should be somewhat similar because the samples we are taking are reasonably large.

Confidence intervals

  1. Create a variable called se that contains the standard error. Then create a variable called za that has the critical value for a 95% confidence interval. Finally, create a variable called lower that has the lower bound of the confidence interval and a variable called upper that has the upper bound of the confidence interval. After doing that, use the command c(lower, upper) to show the confidence interval you just created.
sample_mean <- mean(samp)
za <- qnorm(0.025, lower.tail=FALSE)
se <- sd(population)/sqrt(60)
lower <- sample_mean - za*se
upper <- sample_mean + za*se
c(lower,upper)
## [1] 1449.808 1705.626
  1. For the confidence interval to be valid, the sample mean must be normally distributed and have standard error sigma/sqrt(n). What conditions must be met for this to be true?

Solution: We need to either be sampling from a normal population or we need to have a sufficiently large sample size (bigger than 30) to apply the central limit theorem. The latter holds true here.

  1. Does your confidence interval capture the true average size of houses in Ames?
mean(population)
## [1] 1499.69

Solution: Yes, the confidence interval contains the true population mean. (Note, this may not be true upon some knits of this notebook due to the randam nature of sampling.)

  1. Each student in your class should have gotten a slightly different confidence interval. Most of them probably contain the true population mean, but it’s possible that same might not. What percent do you expect to contain the true population mean?

Solution: We expect 95% to contain the true population mean because we are creating 95% confidence intervals.

  1. Looping to create many samples and confidence intervals:
    1. Create a code chunk that performs the following actions: First, create empty vectors of size 50 called samp_mean and samp_sd. Then create a variable called n that holds the sample size, 60.
    2. Now we’re ready for the loop where we calculate the means and standard deviations of 50 random samples. Create a for-loop that repeats 50 times. Inside the loop, first collect a sample of size n from the population. Store the sample in a variable called samp. Then compute the mean of samp and store it as the ith entry in the samp_mean vector you created. Finally, compute the standard deviation of samp and store it as the ith entry in the samp_sd vector.
samp_mean <- rep(0,50)
samp_sd <- rep(0,50)
n <- 60

for(i in 1:50){
   samp <- sample(population,50)
   samp_mean[i] <- mean(samp)
   samp_sd[i] <- sd(samp)
   }
  1. What proportion of your confidence intervals include the true population mean? Is this proportion exactly equal to the confidence level? If not, explain why.
za <- qnorm(0.025, lower.tail=FALSE)
lower_vector <- samp_mean - za * samp_sd / sqrt(n) 
upper_vector <- samp_mean + za * samp_sd / sqrt(n)
plot_ci(lower_vector,upper_vector,mean(population))

Solution: This will vary from run to run. However, the proportion will never exactly match the confidence level because we are doing 50 samples. The closes will be 47 out of 50, corresponding to 94% or 48 out of 50, corresponding to 96%.

  1. More confidence intervals:
    1. Pick a new confidence level and use R to compute the appropriate critical value.
    2. Calculate 50 confidence intervals at the confidence level you chose. You do not need to obtain new samples, simply calculate new intervals based on the sample means and standard deviations you have already collected.
    3. Using the plot_ci function, plot all intervals and calculate the proportion of intervals that include the true population mean. How does this percentage compare to the confidence level selected for the intervals?
za <- qnorm(0.005, lower.tail=FALSE)
lower_vector_99 <- samp_mean - za * samp_sd / sqrt(n) 
upper_vector_99 <- samp_mean + za * samp_sd / sqrt(n)
plot_ci(lower_vector_99,upper_vector_99,mean(population))

Solution: I did a 99% confidence interval here. As expected, I often get about 49 out of 50 confidence intervals that contain the true population mean.