Today, we will work with a data set residential homes sales in Ames, Iowa. We will be working on using confidence intervals as a way to attempt to capture the true population parameter. We actually have the whole population, so we know what the population parameter is. We are going to be taking samples to explore properties of confidence intervals.
If you have access to data on an entire population, say the size of every house in Ames, Iowa, it’s straight forward to answer questions like, “How big is the typical house in Ames?” and “How much variation is there in sizes of houses?”. If you have access to only a sample of the population, as is often the case, the task becomes more complicated. What is your best guess for the typical size if you only know the sizes of several dozen houses? This sort of situation requires that you use your sample to make inference on what your population looks like.
For today, we are going to start with a simple random sample of size 60 from the population. We see that there are quite a few variables in the data set, enough to do a very in-depth analysis. For this lab, we’ll restrict our attention to just one of the variables: the above ground living area of the house in square feet. To save some effort throughout the lab, we're going to save all the population values of Gr.Liv.Area
in a vector called population
. We are then going to save our random sample in a vector called sample
. Running all the of the code below will load the data as well as create these two vectors.
load(url("http://www.openintro.org/stat/data/ames.RData"))
population <- ames$Gr.Liv.Area
set.seed(344)
samp <- sample(population, 60)
In our last lab, we use simulation methods to attempt to determine the variability of the sample mean. We drew many samples from the population, computed the samples means of each, and then computed the sampling distribution using these sample means. Through this, we obtained a measure of the center and spread of the sampling distribution, helping us use the samples to try and make some conclusions about the population.
Today, we are going to use a different technique to connect the results of our single sample to the population value. Specifically, we are going to use confidence intervals. We recall that confidence intervals are ranges of plausible values where we believe the population parameter of interest lies. For today, that population parameter of interest is a population mean.
Recall that we can calculate and store the mean of the sample using
sample_mean <- mean(samp)
Return for a moment to the question that first motivated this lab: based on this sample, what can we infer about the population? Based only on this single sample, the best estimate of the average living area of houses sold in Ames would be the sample mean, usually denoted as \(\bar{x}\) (here we’re calling it sample_mean
). This value serves as a good point estimate for the population mean. However, in statistics, we always report more than just a sample mean. Specifically, it is critical to communicate how uncertain we are of that point estimate before we use it to make any decisions. This uncertainty can be captured by using a confidence interval.
We have learned that there are two types of confidence intervals. A approximate confidence interval is based on the fact that we tend to consider anything more than two standard deviations away from the mean unusual. Accordingly, we build an approximate 95% confidence interval by adding and subtracting 2 standard errors from our point estimate. We expect that roughly 95% of the data will be within 2 standard errors of the mean.
However, if we can conclude that the central limit theorem applies, we can build a more precise confidence interval . We can calculate a 95% confidence interval for a sample mean by adding and subtracting 1.96 standard errors to the point estimate. It turns out that exactly 95% of the data in a normal is within 1.96 standard errors of the mean.
Regardless of whether our are building an approximate or more precise confidence intervals, our first step has to be to compute the standard error. We are going to be using R as a calculator here, but we will also be storing our values along the way. Keep an eye on your work space and notice that we are storing the standard error and other values as we compute them. Let's start off by computing the standard error.
For now, we need to know the population standard deviation in order to compute the standard error. We will learn soon how we can still build confidence intervals when we do not have the population standard deviation! For now, luckily we have it. The population standard deviation is 505.5089. Because of this, we can compute the standard error of the sample mean to be
se <- 505.5089/sqrt(60)
The command sqrt
compute the square root of whatever value is in the parentheses. You will notice that if you type se
into a chunk and run it, the standard error you have just computed will appear. You have stored the standard error under the name se
. Now that you have done that, you are able to use the standard error to compute your confidence interval.
If you have chosen to build an 95% confidence interval for the population mean, you will use the code:
lower <- sample_mean - 1.96 * se
upper <- sample_mean + 1.96 * se
c(lower, upper)
This is an important inference that we’ve just made: even though we don’t know what the full population looks like, we’re 95% confident that the true average size of houses in Ames lies between the values lower
and upper
.
In this case we have the luxury of knowing the true population mean since we have data on the entire population. This value can be calculated using the following command:
mean(population)
It's great to state this, but can we actually check it? In this particular case, as we have the population available to us, we can! We can take 100 samples, compute a sample mean and 95% confidence interval for each, and check to see whether or not each confidence interval catches the true mean. The proportion of these confidence intervals that capture the true mean should be roughly .95. Let's find out
A special kind of computing structure, called a loop, comes in handy here. Here is the rough outline of what we are about to do:
Our first task is to prepare for Steps 1-4 by creating empty vectors where we can save the mean that will be calculated from each sample. Remember that when you create objects in R, you are essentially creating an empty box of a specific size. You are then able to fill up the space in the box with whatever you choose. Creating an empty vector 100 units long means that you have created a box with space to store 100 numbers. We are going to create a "box" called samp_mean_store
to store the sample means. Each time we create a sample mean, we will store it in one of the slots in samp_mean_store
. Running the code below creates the vector samp_mean_store
.
samp_mean_store <- rep(NA, 100)
Now we’re ready for the loop where we calculate the means and standard deviations of 100 random samples.
set.seed(243)
for(i in 1:100){
samp1 <- sample(population, 60) # obtain a sample of size n = 60 from the population
samp_mean_store[i] <- mean(samp1) # save sample mean in ith element of samp_mean
}
Now, we have all of our sample means. We can now construct the confidence intervals.
lower <- samp_mean_store - 1.96 * 505.5089 / sqrt(60)
upper <- samp_mean_store+ 1.96 * 505.5089/ sqrt(60)
Lower bounds of these 50 confidence intervals are stored in lower
, and the upper bounds are in upper
. Now, we are going to use a special function called plot_ci
to see how many of these confidence intervals capture the true population mean.
plot_ci(lower, upper, mean(population))
In this plot, the dashed vertical line represents the population mean. Each dot is a sample mean (there are 100 of them since we took 100 samples). For each dot, we have a horizontal bar the represents the 95% confidence interval around that sample mean. Any intervals which are red fail to capture the population mean.
While 95 is perhaps the confidence level you will hear most often, there are other confidence levels we can choose. Let's try building a 99% confidence interval.
To change the confidence level of a confidence interval, we need to change the critical value
used to compute the interval.
samp
. Is the true population mean in this interval?Now, we are going to repeat our simulation. Using the 100 samples we have already collected, we are going to compute 99% confidence intervals for each, and see how many of these intervals capture the true population mean. Use the code below, but replace the critical value 1.96 with the critical value needed for a 99% confidence interval.
lower <- samp_mean_store - 1.96 * 505.5089 / sqrt(60)
upper <- samp_mean_store + 1.96 * 505.5089/ sqrt(60)
plot_ci(lower, upper, mean(population))