Lab 5: Confidence Intervals

Complete all Questions, and submit final documents in PDF form on Canvas.

The Goal

Today, we will work with a data set residential homes sales in Ames, Iowa. We will be working on using confidence intervals as a way to attempt to capture the true population parameter. We actually have the whole population, so we know what the population parameter is. We are going to be taking samples to explore properties of confidence intervals.

If you have access to data on an entire population, say the size of every house in Ames, Iowa, it’s straight forward to answer questions like, “How big is the typical house in Ames?” and “How much variation is there in sizes of houses?”. If you have access to only a sample of the population, as is often the case, the task becomes more complicated. What is your best guess for the typical size if you only know the sizes of several dozen houses? This sort of situation requires that you use your sample to make inference on what your population looks like.

The Data

For today, we are going to start with a simple random sample of size 60 from the population. We see that there are quite a few variables in the data set, enough to do a very in-depth analysis. For this lab, we’ll restrict our attention to just one of the variables: the above ground living area of the house in square feet. To save some effort throughout the lab, we're going to save all the population values of Gr.Liv.Area in a vector called population. We are then going to save our random sample in a vector called sample. Running all the of the code below will load the data as well as create these two vectors.

load(url("http://www.openintro.org/stat/data/ames.RData"))
population <- ames$Gr.Liv.Area
set.seed(344)
samp <- sample(population, 60)

Confidence intervals

In our last lab, we use simulation methods to attempt to determine the variability of the sample mean. We drew many samples from the population, computed the samples means of each, and then computed the sampling distribution using these sample means. Through this, we obtained a measure of the center and spread of the sampling distribution, helping us use the samples to try and make some conclusions about the population.

Today, we are going to use a different technique to connect the results of our single sample to the population value. Specifically, we are going to use confidence intervals. We recall that confidence intervals are ranges of plausible values where we believe the population parameter of interest lies. For today, that population parameter of interest is a population mean.

Recall that we can calculate and store the mean of the sample using

sample_mean <- mean(samp)

Return for a moment to the question that first motivated this lab: based on this sample, what can we infer about the population? Based only on this single sample, the best estimate of the average living area of houses sold in Ames would be the sample mean, usually denoted as \(\bar{x}\) (here we’re calling it sample_mean). This value serves as a good point estimate for the population mean. However, in statistics, we always report more than just a sample mean. Specifically, it is critical to communicate how uncertain we are of that point estimate before we use it to make any decisions. This uncertainty can be captured by using a confidence interval.

We have learned that there are two types of confidence intervals. A approximate confidence interval is based on the fact that we tend to consider anything more than two standard deviations away from the mean unusual. Accordingly, we build an approximate 95% confidence interval by adding and subtracting 2 standard errors from our point estimate. We expect that roughly 95% of the data will be within 2 standard errors of the mean.

However, if we can conclude that the central limit theorem applies, we can build a more precise confidence interval . We can calculate a 95% confidence interval for a sample mean by adding and subtracting 1.96 standard errors to the point estimate. It turns out that exactly 95% of the data in a normal is within 1.96 standard errors of the mean.

For the more precise confidence interval to be valid, the sample mean must be normally distributed and have standard error \(\sigma / \sqrt{n}\). Which of the following is not a condition needed for the central limit theorem to hold?

A) The sample is random.
B) The sample size, 60, is less than 10% of all houses.
C) The sample distribution must be nearly normal.

For this sample, does the central limit theorem apply? Explain. Based on this, what critical value will you use to build your confidence interval?

Regardless of whether our are building an approximate or more precise confidence intervals, our first step has to be to compute the standard error. We are going to be using R as a calculator here, but we will also be storing our values along the way. Keep an eye on your work space and notice that we are storing the standard error and other values as we compute them. Let's start off by computing the standard error.

For now, we need to know the population standard deviation in order to compute the standard error. We will learn soon how we can still build confidence intervals when we do not have the population standard deviation! For now, luckily we have it. The population standard deviation is 505.5089. Because of this, we can compute the standard error of the sample mean to be

se <- 505.5089/sqrt(60)

The command sqrt compute the square root of whatever value is in the parentheses. You will notice that if you type se into a chunk and run it, the standard error you have just computed will appear. You have stored the standard error under the name se. Now that you have done that, you are able to use the standard error to compute your confidence interval.

If you have chosen to build an 95% confidence interval for the population mean, you will use the code:

lower <- sample_mean - 1.96 * se
upper <- sample_mean + 1.96 * se
c(lower, upper)

State and interpret your confidence interval.

Based on our interval, is it plausible that the true population mean could be 1400? What about 1700? Justify using your interval.

This is an important inference that we’ve just made: even though we don’t know what the full population looks like, we’re 95% confident that the true average size of houses in Ames lies between the values lower and upper.

What does “95% confidence” mean?

A) 95% of the time the true average area of houses in Ames, Iowa, will be in this interval.
B) 95% of random samples of size 60 will yield confidence intervals that contain the true average area of houses in Ames, Iowa.
C) 95% of the houses in Ames have an area in this interval.
D) 95% confident that the sample mean is in this interval.

In this case we have the luxury of knowing the true population mean since we have data on the entire population. This value can be calculated using the following command:

mean(population)

Does your confidence interval capture the true average size of houses in Ames? If so, is the true value near the center of your interval, closer to the lower bound or closer to the upper bound?

Understanding Confidence levels

What proportion of 95% confidence intervals would you expect to capture the true population mean?

It's great to state this, but can we actually check it? In this particular case, as we have the population available to us, we can! We can take 100 samples, compute a sample mean and 95% confidence interval for each, and check to see whether or not each confidence interval catches the true mean. The proportion of these confidence intervals that capture the true mean should be roughly .95. Let's find out

A special kind of computing structure, called a loop, comes in handy here. Here is the rough outline of what we are about to do:

Step 1: Obtain a random sample of size 60 from the population.
Step 2: Calculate the sample’s mean of each sample.
Step 3: Use these statistics to calculate a confidence interval.
Step 4: Repeat steps (1)-(3) 100 times.

Our first task is to prepare for Steps 1-4 by creating empty vectors where we can save the mean that will be calculated from each sample. Remember that when you create objects in R, you are essentially creating an empty box of a specific size. You are then able to fill up the space in the box with whatever you choose. Creating an empty vector 100 units long means that you have created a box with space to store 100 numbers. We are going to create a "box" called samp_mean_store to store the sample means. Each time we create a sample mean, we will store it in one of the slots in samp_mean_store. Running the code below creates the vector samp_mean_store.

samp_mean_store <- rep(NA, 100)

Now we’re ready for the loop where we calculate the means and standard deviations of 100 random samples.

set.seed(243)
for(i in 1:100){
  samp1 <- sample(population, 60) # obtain a sample of size n = 60 from the population
  samp_mean_store[i] <- mean(samp1)    # save sample mean in ith element of samp_mean
}

Now, we have all of our sample means. We can now construct the confidence intervals.

lower <- samp_mean_store - 1.96 * 505.5089 / sqrt(60) 
upper <- samp_mean_store+ 1.96 * 505.5089/ sqrt(60)

Lower bounds of these 50 confidence intervals are stored in lower, and the upper bounds are in upper. Now, we are going to use a special function called plot_ci to see how many of these confidence intervals capture the true population mean.

plot_ci(lower, upper, mean(population))

In this plot, the dashed vertical line represents the population mean. Each dot is a sample mean (there are 100 of them since we took 100 samples). For each dot, we have a horizontal bar the represents the 95% confidence interval around that sample mean. Any intervals which are red fail to capture the population mean.

What proportion of confidence intervals include the true population mean? How does this compare to the confidence level?

Changing Confidence levels

While 95 is perhaps the confidence level you will hear most often, there are other confidence levels we can choose. Let's try building a 99% confidence interval.

Will a 99% confidence interval be narrower or wider than our 95% confidence interval? Explain.

To change the confidence level of a confidence interval, we need to change the critical value used to compute the interval.

What is the appropriate critical value for a 99% confidence level?

Calculate and interpret a 99% confidence interval using the sample samp. Is the true population mean in this interval?

Now, we are going to repeat our simulation. Using the 100 samples we have already collected, we are going to compute 99% confidence intervals for each, and see how many of these intervals capture the true population mean. Use the code below, but replace the critical value 1.96 with the critical value needed for a 99% confidence interval.

lower <- samp_mean_store - 1.96 * 505.5089 / sqrt(60) 
upper <- samp_mean_store + 1.96 * 505.5089/ sqrt(60)
plot_ci(lower, upper, mean(population))

What proportion of the confidence intervals contain the truth?

If the 99% confidence intervals capture the truth more often, why don't we always build 99% confidence intervals? Explain your response.

This lab was adapted by Nicole Dalzell from a product of OpenIntro that is released under a Creative Commons Attribution-ShareAlike 3.0 Unported. That lab was written for OpenIntro by Andrew Bray and Mine Çetinkaya-Rundel.

STA 111 Lab 5: Confidence Intervals

The Goal

The Data

Confidence intervals

Understanding Confidence levels

Changing Confidence levels