If you have access to data on an entire population, say the size of every house in Ames, Iowa, it’s straightforward to answer questions like, “How big is the typical house in Ames?” and “How much variation is there in sizes of houses?”. If you have access to only a sample of the population, as is often the case, the task becomes more complicated. What is your best guess for the typical size if you only know the sizes of several dozen houses? This sort of situation requires that you use your sample to make an inference on what your population looks like.
By now, we have quite a bit of practice coding in R. This lab provides code for some of the more complex tasks, but for many steps you’ll be directed to create variables and/or vectors using commands you’ve learned in previous labs.
In the previous lab, “Sampling Distributions”, we looked at the population data of houses from Ames, Iowa. Let’s start by loading that data set.
download.file("http://www.openintro.org/stat/data/ames.RData", destfile = "ames.RData")
load("ames.RData")
In this lab we’ll start with a sample from the population.
Specifically, this is a simple random sample of size 60. Note that the
data set has information on many housing variables, but we’ll focus on
the size of the house, represented by the variable
Gr.Liv.Area.
First, create two variables called population and
samp. The variable population should contain
all of the data in the Gr.Liv.Area variable. The variable
samp should contain a random sample of size 60, taken from
population.
Note: if you want your results to stay the same every time you knit
this document, add set.seed(123) (or any number you choose)
on the line before your sample() call. This is
good practice for reproducible work — you’ll see it used routinely in
data science. If you don’t set a seed that is okay, but keep in mind
that your samples will change every time you execute the code, including
when you knit it to pdf.
# set.seed(123) # Uncomment this line for reproducible results
population <- ames$Gr.Liv.Area
samp <- sample(population, 60)
sample_mean <- mean(samp)
After creating those variables, complete the following exercises.
Now do Exercise 1.
Now do Exercise 2.
One of the most common ways to describe the typical or central value
of a distribution is to use the mean. Return for a moment to the
question that first motivated this lab: based on this sample, what can
we infer about the population? Based only on this single sample, the
best estimate of the average living area of houses sold in Ames would be
the sample mean, usually denoted as \(\bar{x}\) (here we’re calling it
sample_mean). This value serves as a good point
estimate, but it would be useful to also communicate how uncertain
we are of that estimate. This can be captured by using a confidence
interval.
We can calculate a 95% confidence interval for a sample mean by
adding and subtracting the margin of error. Recall that the margin of
error for the sample mean is the \(z_{\alpha/2}\) value multiplied by the
standard error, \(\hat{\sigma}/\sqrt{n}\), where \(\hat{\sigma}\) is the sample
standard deviation sd(samp) — our best estimate of
the unknown population standard deviation \(\sigma\). Since \(\alpha = 0.05\) for a 95% confidence
interval, we want \(z_{0.025}\). We can
find this value using the qnorm function. Enter the
following command in the console.
qnorm(0.025, lower.tail=FALSE)
Without the lower.tail=FALSE part of the command, it
would return a negative \(z\) value.
Now that we know how to use R to compute critical values, we can create
a confidence interval.
Now do Exercise 3.
This is an important inference that we’ve just made: even though we don’t know what the full population looks like, we’re 95% confident that the true average size of houses in Ames lies between the values lower and upper. There are a few conditions that must be met for this interval to be valid.
Now do Exercise 4.
We just created a 95% confidence interval, but what does 95% confidence actually mean? We’ll explore that question through the next few exercises.
Usually, we use a sample mean because we don’t have access to the full population data. In this case we have the luxury of knowing the true population mean since we have data on the entire population. Use the following code to compute the true population mean:
pop_mean <- mean(population)
pop_mean
Now do Exercise 5.
Now do Exercise 6.
Using R, we’re going to recreate many samples to learn more about how sample means and confidence intervals vary from one sample to another. Loops come in handy here. The rough outline is:
Here is the basic structure of a for loop in R. Fill in
the blanks (marked with ___) to complete the pattern — you
will need to adapt this structure for Exercise 7.
# Initialize empty vectors to store results
___ <- rep(NA, ___)
___ <- rep(NA, ___)
# Set the sample size
n <- ___
# Loop
for (i in 1:___) {
# Draw a random sample from your population
samp <- sample(___, ___)
# Store the mean and standard deviation of the sample
___[i] <- mean(samp)
___[i] <- sd(samp)
}
Note that rep(NA, ___) creates an empty vector of a
specified length, pre-filled with missing values. This is good practice
before a loop because it reserves the right amount of space for your
results before you start filling them in.
Now do Exercise 7.
Now that we have the mean and standard deviation for each of the 50
samples, we can construct 50 different confidence intervals. While we
could do this in another loop, running commands with the whole vectors
samp_mean and samp_sd (instead of with
individual values inside) will result in full vectors as the output.
za <- qnorm(0.025, lower.tail=FALSE)
lower_vector <- samp_mean - za * samp_sd / sqrt(n)
upper_vector <- samp_mean + za * samp_sd / sqrt(n)
Note that we should really be using t values instead of z values in
the above code — this is because we’re using the sample standard
deviation, not the population standard deviation. You will learn about
this distinction soon in class. However, for samples of size 60, the
difference between s and sigma is small, and the difference between the
t value and the corresponding z value is very small. (To see this, type
qt(0.025, 59, lower.tail=FALSE) in the console — you’ll see
a t value of 2.000995, less than 0.05 away from the za value we have
been using.)
After running the above commands, look over in your environment. You
should see entries in the Values section called
lower_vector and upper_vector, each of which
contains 50 entries. Lower bounds of these 50 confidence intervals are
stored in lower_vector, and the upper bounds are in
upper_vector. Type the following command in the console to
see the lower and upper values for the first confidence interval.
c(lower_vector[1], upper_vector[1])
The function plot_ci (which was loaded with the data
set) plots all of the confidence intervals. It takes three arguments: a
vector of lower bounds, a vector of upper bounds, and the true
population mean. It creates a horizontal line for each confidence
interval with the sample mean shown as a black dot, and marks the
population mean as a vertical dotted line. Any confidence interval that
does not contain the true population mean is highlighted in red.
plot_ci(lower_vector, upper_vector, mean(population))
Now do Exercise 8.
Now do Exercise 9.
This lab is a modification of a product of OpenIntro that is released under a Creative Commons Attribution-ShareAlike 3.0 Unported. This lab was written for OpenIntro by Andrew Bray and Mine Çetinkaya-Rundel. It has been edited by Ross Magi, Stefan Scremac, and Benjamin Jackson.