If you have access to data on an entire population, say the size of every house in Ames, Iowa, it’s straight forward to answer questions like, “How big is the typical house in Ames?” and “How much variation is there in sizes of houses?”. If you have access to only a sample of the population, as is often the case, the task becomes more complicated. What is your best guess for the typical size if you only know the sizes of several dozen houses? This sort of situation requires that you use your sample to make an inference on what your population looks like.
By now, we have quite a bit of practice coding in R. This lab doesn’t contain much code for you to copy and paste. Rather, you’ll be directed to create certain variables and/or vectors, but will need to look at old labs (or just remember) to figure out the appropriate commands.
In the previous lab, “Sampling Distributions”, we looked at the population data of houses from Ames, Iowa. Let’s start by loading that data set.
download.file("http://www.openintro.org/stat/data/ames.RData", destfile = "ames.RData")
load("ames.RData")
In this lab we’ll start with a sample from the population.
Specifically, this is a simple random sample of size 60. Note that the
data set has information on many housing variables, but we’ll focus on
the size of the house, represented by the variable
Gr.Liv.Area
.
First, create two variables called population
and
samp.
The variable population
should contain
all of the data in the Gr.Liv.Area
variable. The variable
samp
should contain a random sample of size 60, taken from
population.
After creating those variables, complete the
following exercises.
Now do Exercise 1.
Now do Exercise 2.
One of the most common ways to describe the typical or central value
of a distribution is to use the mean. Create a variable called
sample_mean
that contains the mean of your sample.
Return for a moment to the question that first motivated this lab:
based on this sample, what can we infer about the population? Based only
on this single sample, the best estimate of the average living area of
houses sold in Ames would be the sample mean, usually denoted as \(\bar{x}\) (here we’re calling it
sample_mean
). This value serves as a good point
estimate, but it would be useful to also communicate how uncertain
we are of that estimate. This can be captured by using a confidence
interval.
We can calculate a 95% confidence interval for a sample mean by
adding and subtracting the margin of error. Recall that the margin of
error for the sample mean is the \(z_{\alpha/2}\) value multiplied by the
standard error, \(\sigma/\sqrt{n}\).
Since \(\alpha = 0.05\) for a 95%
confidence interval, we want \(z_{0.025}\). We can find this value using
the qnorm
function. Enter the following command in the
console.
qnorm(0.025, lower.tail=FALSE)
Without the lower.tail=FALSE
part of the command, it
would return a negative \(z\) value.
Now that we know how to use R to compute critical values, we can create
a confidence interval.
Now do Exercise 3.
This is an important inference that we’ve just made: even though we don’t know what the full population looks like, we’re 95% confident that the true average size of houses in Ames lies between the values lower and upper. There are a few conditions that must be met for this interval to be valid.
Now do Exercise 4.
We just created a 95% confidence interval, but what does 95% confidence actually mean? We’ll explore that question through the next few execises.
Usually, we use a sample mean because we don’t have access to the full population data. In this case we have the luxury of knowing the true population mean since we have data on the entire population. Compute the mean for the entire population, and then complete the following exercises.
Now do Exercise 5.
Now do Exercise 6.
Using R, we’re going to recreate many samples to learn more about how sample means and confidence intervals vary from one sample to another. Loops come in handy here (We learned about loops in the Sampling Distribution lab).
Here is the rough outline:
Now do Exercise 7.
Now that we have the mean and standard deviation for each of the 50
samples, we can construct 50 different confidence intervals. While we
could do this in another loop, running commands with the whole vectors
samp_mean
and samp_sd
(instead of with
individual values inside) will result in full vectors as the output.
za <- qnorm(0.025, lower.tail=FALSE)
lower_vector <- samp_mean - za * samp_sd / sqrt(n)
upper_vector <- samp_mean + za * samp_sd / sqrt(n)
Note that we should really be using t values instead of z values
in the above code. This is because we’re using the sample standard
deviation, not the population standard deviation. However, for samples
of size 60, the difference between s and sigma should be small and the
difference between the t value and corresponding z value is very small.
To see this, you could type qt(0.025,59,lower.tail=FALSE)
in the console. You’ll see a t value of 2.000995, a number that is less
than 0.05 away from the za value we have been using.
After running the above commands, look over in your environment. You
should see entries in the Values section called
lower_vector
and upper_vector,
each of which
contains 50 entries. Lower bounds of these 50 confidence intervals are
stored in lower_vector
, and the upper bounds are in
upper_vector
. Type the following command in the console to
see the lower and upper values for the first confidence interval.
c(lower_vector[1], upper_vector[1])
The user-created function plot_ci
(which was downloaded
with the data set) plots all of the confidence intervals. In particular,
it has three inputs: a vector of lower values, a vector of upper values,
and the population mean. It creates a horizontal line for each
confidence interval, with the sample mean shown as a black dot. It shows
the population mean as a vertical dotted line.
Any confidence interval that does not contain the true population mean
is highlighted in red.
Use the plot_ci
function to plot all intervals. Then
complete the following exercises.
Now do Exercise 8.
Now do Exercise 9.
This lab is a modification of a product of OpenIntro that is released under a Creative Commons Attribution-ShareAlike 3.0 Unported. This lab was written for OpenIntro by Andrew Bray and Mine Çetinkaya-Rundel.