Sampling from Ames, Iowa

If you have access to data on an entire population, say the size of every house in Ames, Iowa, it’s straight forward to answer questions like, “How big is the typical house in Ames?” and “How much variation is there in sizes of houses?”. If you have access to only a sample of the population, as is often the case, the task becomes more complicated. What is your best guess for the typical size if you only know the sizes of several dozen houses? This sort of situation requires that you use your sample to make an inference on what your population looks like.

By now, we have quite a bit of practice coding in R. This lab doesn’t contain much code for you to copy and paste. Rather, you’ll be directed to create certain variables and/or vectors, but will need to look at old labs (or just remember) to figure out the appropriate commands.

The data

In the previous lab, “Sampling Distributions”, we looked at the population data of houses from Ames, Iowa. Let’s start by loading that data set.

download.file("http://www.openintro.org/stat/data/ames.RData", destfile = "ames.RData")
load("ames.RData")

In this lab we’ll start with a sample from the population. Specifically, this is a simple random sample of size 60. Note that the data set has information on many housing variables, but we’ll focus on the size of the house, represented by the variable Gr.Liv.Area.

First, create two variables called population and samp. The variable population should contain all of the data in the Gr.Liv.Area variable. The variable samp should contain a random sample of size 60, taken from population. After creating those variables, complete the following exercises.

Now do Exercise 1.

Now do Exercise 2.

Confidence intervals

One of the most common ways to describe the typical or central value of a distribution is to use the mean. Create a variable called sample_mean that contains the mean of your sample.

Return for a moment to the question that first motivated this lab: based on this sample, what can we infer about the population? Based only on this single sample, the best estimate of the average living area of houses sold in Ames would be the sample mean, usually denoted as \(\bar{x}\) (here we’re calling it sample_mean). This value serves as a good point estimate, but it would be useful to also communicate how uncertain we are of that estimate. This can be captured by using a confidence interval.

We can calculate a 95% confidence interval for a sample mean by adding and subtracting the margin of error. Recall that the margin of error for the sample mean is the \(z_{\alpha/2}\) value multiplied by the standard error, \(\sigma/\sqrt{n}\). Since \(\alpha = 0.05\) for a 95% confidence interval, we want \(z_{0.025}\). We can find this value using the qnorm function. Enter the following command in the console.

qnorm(0.025, lower.tail=FALSE)

Without the lower.tail=FALSE part of the command, it would return a negative \(z\) value. Now that we know how to use R to compute critical values, we can create a confidence interval.

Now do Exercise 3.

This is an important inference that we’ve just made: even though we don’t know what the full population looks like, we’re 95% confident that the true average size of houses in Ames lies between the values lower and upper. There are a few conditions that must be met for this interval to be valid.

Now do Exercise 4.

Confidence levels

We just created a 95% confidence interval, but what does 95% confidence actually mean? We’ll explore that question through the next few execises.

Usually, we use a sample mean because we don’t have access to the full population data. In this case we have the luxury of knowing the true population mean since we have data on the entire population. Compute the mean for the entire population, and then complete the following exercises.

Now do Exercise 5.

Now do Exercise 6.

Using R, we’re going to recreate many samples to learn more about how sample means and confidence intervals vary from one sample to another. Loops come in handy here (We learned about loops in the Sampling Distribution lab).

Here is the rough outline:

Obtain a random sample.
Calculate and store the sample’s mean and standard deviation.
Repeat steps (1) and (2) 50 times.
Use these stored statistics to calculate many confidence intervals.

Now do Exercise 7.

Now that we have the mean and standard deviation for each of the 50 samples, we can construct 50 different confidence intervals. While we could do this in another loop, running commands with the whole vectors samp_mean and samp_sd (instead of with individual values inside) will result in full vectors as the output.

za <- qnorm(0.025, lower.tail=FALSE)
lower_vector <- samp_mean - za * samp_sd / sqrt(n) 
upper_vector <- samp_mean + za * samp_sd / sqrt(n)

Note that we should really be using t values instead of z values in the above code. This is because we’re using the sample standard deviation, not the population standard deviation. However, for samples of size 60, the difference between s and sigma should be small and the difference between the t value and corresponding z value is very small. To see this, you could type qt(0.025,59,lower.tail=FALSE) in the console. You’ll see a t value of 2.000995, a number that is less than 0.05 away from the za value we have been using.

After running the above commands, look over in your environment. You should see entries in the Values section called lower_vector and upper_vector, each of which contains 50 entries. Lower bounds of these 50 confidence intervals are stored in lower_vector, and the upper bounds are in upper_vector. Type the following command in the console to see the lower and upper values for the first confidence interval.

c(lower_vector[1], upper_vector[1])

The user-created function plot_ci (which was downloaded with the data set) plots all of the confidence intervals. In particular, it has three inputs: a vector of lower values, a vector of upper values, and the population mean. It creates a horizontal line for each confidence interval, with the sample mean shown as a black dot. It shows the population mean as a vertical dotted line.
Any confidence interval that does not contain the true population mean is highlighted in red.

Use the plot_ci function to plot all intervals. Then complete the following exercises.

Now do Exercise 8.

Now do Exercise 9.

Acknowledgements

This lab is a modification of a product of OpenIntro that is released under a Creative Commons Attribution-ShareAlike 3.0 Unported. This lab was written for OpenIntro by Andrew Bray and Mine Çetinkaya-Rundel.

MATH 106: Introduction to Statistics

Lab 7: Confidence Intervals

Sampling from Ames, Iowa

The data

Confidence intervals

Confidence levels

Acknowledgements