Sampling from Ames, Iowa

If you have access to data on an entire population, say the size of every house in Ames, Iowa, it’s straightforward to answer questions like, “How big is the typical house in Ames?” and “How much variation is there in sizes of houses?”. If you have access to only a sample of the population, as is often the case, the task becomes more complicated. What is your best guess for the typical size if you only know the sizes of several dozen houses? This sort of situation requires that you use your sample to make an inference on what your population looks like.

By now, we have quite a bit of practice coding in R. This lab provides code for some of the more complex tasks, but for many steps you’ll be directed to create variables and/or vectors using commands you’ve learned in previous labs.

The data

In the previous lab, “Sampling Distributions”, we looked at the population data of houses from Ames, Iowa. Let’s start by loading that data set.

download.file("http://www.openintro.org/stat/data/ames.RData", destfile = "ames.RData")
load("ames.RData")

In this lab we’ll start with a sample from the population. Specifically, this is a simple random sample of size 60. Note that the data set has information on many housing variables, but we’ll focus on the size of the house, represented by the variable Gr.Liv.Area.

First, create two variables called population and samp. The variable population should contain all of the data in the Gr.Liv.Area variable. The variable samp should contain a random sample of size 60, taken from population.

Note: if you want your results to stay the same every time you knit this document, add set.seed(123) (or any number you choose) on the line before your sample() call. This is good practice for reproducible work — you’ll see it used routinely in data science. If you don’t set a seed that is okay, but keep in mind that your samples will change every time you execute the code, including when you knit it to pdf.

# set.seed(123)   # Uncomment this line for reproducible results
population <- ames$Gr.Liv.Area
samp <- sample(population, 60)
sample_mean <- mean(samp)

After creating those variables, complete the following exercises.

Now do Exercise 1.

Now do Exercise 2.

Confidence intervals

One of the most common ways to describe the typical or central value of a distribution is to use the mean. Return for a moment to the question that first motivated this lab: based on this sample, what can we infer about the population? Based only on this single sample, the best estimate of the average living area of houses sold in Ames would be the sample mean, usually denoted as \(\bar{x}\) (here we’re calling it sample_mean). This value serves as a good point estimate, but it would be useful to also communicate how uncertain we are of that estimate. This can be captured by using a confidence interval.

We can calculate a 95% confidence interval for a sample mean by adding and subtracting the margin of error. Recall that the margin of error for the sample mean is the \(z_{\alpha/2}\) value multiplied by the standard error, \(\hat{\sigma}/\sqrt{n}\), where \(\hat{\sigma}\) is the sample standard deviation sd(samp) — our best estimate of the unknown population standard deviation \(\sigma\). Since \(\alpha = 0.05\) for a 95% confidence interval, we want \(z_{0.025}\). We can find this value using the qnorm function. Enter the following command in the console.

qnorm(0.025, lower.tail=FALSE)

Without the lower.tail=FALSE part of the command, it would return a negative \(z\) value. Now that we know how to use R to compute critical values, we can create a confidence interval.

Now do Exercise 3.

This is an important inference that we’ve just made: even though we don’t know what the full population looks like, we’re 95% confident that the true average size of houses in Ames lies between the values lower and upper. There are a few conditions that must be met for this interval to be valid.

Now do Exercise 4.

Confidence levels

We just created a 95% confidence interval, but what does 95% confidence actually mean? We’ll explore that question through the next few exercises.

Usually, we use a sample mean because we don’t have access to the full population data. In this case we have the luxury of knowing the true population mean since we have data on the entire population. Use the following code to compute the true population mean:

pop_mean <- mean(population)
pop_mean

Now do Exercise 5.

Now do Exercise 6.

Using R, we’re going to recreate many samples to learn more about how sample means and confidence intervals vary from one sample to another. Loops come in handy here. The rough outline is:

Here is the basic structure of a for loop in R. Fill in the blanks (marked with ___) to complete the pattern — you will need to adapt this structure for Exercise 7.

# Initialize empty vectors to store results
___ <- rep(NA, ___)
___ <- rep(NA, ___)

# Set the sample size
n <- ___

# Loop
for (i in 1:___) {
  # Draw a random sample from your population
  samp <- sample(___, ___)
  
  # Store the mean and standard deviation of the sample
  ___[i] <- mean(samp)
  ___[i] <- sd(samp)
}

Note that rep(NA, ___) creates an empty vector of a specified length, pre-filled with missing values. This is good practice before a loop because it reserves the right amount of space for your results before you start filling them in.

Now do Exercise 7.

Now that we have the mean and standard deviation for each of the 50 samples, we can construct 50 different confidence intervals. While we could do this in another loop, running commands with the whole vectors samp_mean and samp_sd (instead of with individual values inside) will result in full vectors as the output.

za <- qnorm(0.025, lower.tail=FALSE)
lower_vector <- samp_mean - za * samp_sd / sqrt(n) 
upper_vector <- samp_mean + za * samp_sd / sqrt(n)

Note that we should really be using t values instead of z values in the above code — this is because we’re using the sample standard deviation, not the population standard deviation. You will learn about this distinction soon in class. However, for samples of size 60, the difference between s and sigma is small, and the difference between the t value and the corresponding z value is very small. (To see this, type qt(0.025, 59, lower.tail=FALSE) in the console — you’ll see a t value of 2.000995, less than 0.05 away from the za value we have been using.)

After running the above commands, look over in your environment. You should see entries in the Values section called lower_vector and upper_vector, each of which contains 50 entries. Lower bounds of these 50 confidence intervals are stored in lower_vector, and the upper bounds are in upper_vector. Type the following command in the console to see the lower and upper values for the first confidence interval.

c(lower_vector[1], upper_vector[1])

The function plot_ci (which was loaded with the data set) plots all of the confidence intervals. It takes three arguments: a vector of lower bounds, a vector of upper bounds, and the true population mean. It creates a horizontal line for each confidence interval with the sample mean shown as a black dot, and marks the population mean as a vertical dotted line. Any confidence interval that does not contain the true population mean is highlighted in red.

plot_ci(lower_vector, upper_vector, mean(population))

Now do Exercise 8.

Now do Exercise 9.

Acknowledgements

This lab is a modification of a product of OpenIntro that is released under a Creative Commons Attribution-ShareAlike 3.0 Unported. This lab was written for OpenIntro by Andrew Bray and Mine Çetinkaya-Rundel. It has been edited by Ross Magi, Stefan Scremac, and Benjamin Jackson.