In this lab, we investigate the ways in which the statistics from a random sample of data can serve as point estimates for population parameters. We’re interested in formulating a sampling distribution of our estimate in order to learn about the properties of the estimate, such as its distribution.
We consider real estate data from the city of Ames, Iowa. The details of every real estate transaction in Ames is recorded by the City Assessor’s office. Our particular focus for this lab will be all residential home sales in Ames between 2006 and 2010. This collection represents our population of interest. In this lab we would like to learn about these home sales by taking smaller samples from the full population. Let’s load the data.
download.file("http://www.openintro.org/stat/data/ames.RData", destfile = "ames.RData")
load("ames.RData")
Remember to also do this in a code chunk at the beginning of your R notebook!
We see that there are quite a few variables in the data set, enough
to do a very in-depth analysis. For this lab, we’ll restrict our
attention to just two of the variables: the above ground living area of
the house in square feet (Gr.Liv.Area
) and the sale price
(SalePrice
). To save some effort throughout the lab, create
two variables with short names that represent these two variables. Do
this in your R notebook also.
area <- ames$Gr.Liv.Area
price <- ames$SalePrice
Let’s look at the distribution of area in our population of home sales by calculating a few summary statistics and making a histogram. Then complete the following exercise.
summary(area)
hist(area, breaks=25)
Remember that the breaks=25 statement specifies the number of bars in the histogram.
Now do Exercise 1.
In this lab we have access to the entire population, but this is rarely the case in real life. Gathering information on an entire population is often extremely costly or impossible. Because of this, we often take a sample of the population and use that to understand the properties of the population.
If we were interested in estimating the mean living area in Ames based on a sample, we can use the following command to survey the population.
samp1 <- sample(area, 50)
This command collects a simple random sample of size 50 from the
vector area
, which is assigned to samp1
. This
is like going into the City Assessor’s database and pulling up the files
on 50 random home sales. Working with these 50 files would be
considerably simpler than working with all 2930 home sales.
If we’re interested in estimating the average living area in homes in Ames using the sample, our best single guess is the sample mean.
mean(samp1)
Depending on which 50 homes you selected, your estimate could be a bit above or a bit below the true population mean of 1499.69 square feet. In general, though, the sample mean usually turns out to be a pretty good estimate of the average living area, and we were able to get it by sampling less than 3% of the population.
Now do Exercise 2.
Not surprisingly, every time we take another random sample, we get a different sample mean. It’s useful to get a sense of just how much variability we should expect when estimating the population mean this way. The distribution of sample means, called the sampling distribution, can help us understand this variability. In this lab, because we have access to the population, we can build up the sampling distribution for the sample mean by repeating the above steps many times. Here we will generate 5000 samples and compute the sample mean of each.
sample_means50 <- rep(0, 5000)
for(i in 1:5000){
samp <- sample(area, 50)
sample_means50[i] <- mean(samp)
}
hist(sample_means50)
Note: If you are typing the above commands into the console, you can
use shift
+enter
to go to a new line without
evaluating any commands.
If you would like to adjust the bin width of your histogram to show a
little more detail, you can do so by changing the breaks
argument.
hist(sample_means50, breaks = 25)
Here we use R to take 5000 samples of size 50 from the population,
calculate the mean of each sample, and store each result in a vector
called sample_means50
. On the next page, we’ll review how
this set of code works.
Now do Exercise 3.
for
loopLet’s take a break from the statistics for a moment to let that last
block of code sink in. You have just run your first for
loop, a cornerstone of computer programming. The idea behind the for
loop is iteration: it allows you to execute code as many times
as you want without having to type out every iteration. In the case
above, we wanted to iterate the two lines of code inside the curly
braces that take a random sample of size 50 from area
then
save the mean of that sample into the sample_means50
vector. Without the for
loop, this would be painful:
sample_means50 <- rep(NA, 5000)
samp <- sample(area, 50)
sample_means50[1] <- mean(samp)
samp <- sample(area, 50)
sample_means50[2] <- mean(samp)
samp <- sample(area, 50)
sample_means50[3] <- mean(samp)
samp <- sample(area, 50)
sample_means50[4] <- mean(samp)
and so on…
With the for loop, these thousands of lines of code are compressed into the following handful of lines.
sample_means50 <- rep(0, 5000)
for(i in 1:5000){
samp <- sample(area, 50)
sample_means50[i] <- mean(samp)
}
Let’s consider this code line by line to figure out what it does. In
the first line we initialized a vector. In this case, we
created a vector of 5000 zeros called sample_means50
. This
vector will store values generated within the for
loop.
We’ll replace each of the 5000 zeros with the mean of a sample.
The second line calls the for
loop itself. The syntax
can be loosely read as, “for every element i
from 1 to
5000, run the following lines of code”. You can think of i
as the counter that keeps track of which loop you’re on. Therefore, more
precisely, the loop will run once when i = 1
, then once
when i = 2
, and so on up to i = 5000
.
The body of the for
loop is the part inside the curly
braces, and this set of code is run for each value of i
.
Here, on every loop, we take a random sample of size 50 from
area
, take its mean, and store it as the \(i\)th element of
sample_means50
.
The for
loop allows us to not just run the code 5000
times, but to neatly package the results, element by element, into the
empty vector that we initialized at the outset.
Now do Exercise 4.
Mechanics aside, let’s return to the reason we used a
for
loop: to compute a sampling distribution, specifically,
this one.
hist(sample_means50)
The sampling distribution that we computed tells us much about estimating the average living area in homes in Ames. Because the sample mean is an unbiased estimator, the sampling distribution is centered at the true average living area of the the population, and the spread of the distribution indicates how much variability is induced by sampling only 50 home sales.
To get a sense of the effect that sample size has on our distribution, let’s build up two more sampling distributions: one based on a sample size of 10 and another based on a sample size of 100.
sample_means10 <- rep(0, 5000)
sample_means100 <- rep(0, 5000)
for(i in 1:5000){
samp <- sample(area, 10)
sample_means10[i] <- mean(samp)
samp <- sample(area, 100)
sample_means100[i] <- mean(samp)
}
Here we’re able to use a single for
loop to build two
distributions by adding additional lines inside the curly braces. Don’t
worry about the fact that samp
is used for the name of two
different objects. In the second command of the for
loop,
the mean of samp
is saved to the relevant place in the
vector sample_means10
. With the mean saved, we’re now free
to overwrite the object samp
with a new sample, this time
of size 100. In general, anytime you create an object using a name that
is already in use, the old object will get replaced with the new one.
(Many of you have accidentally learned this fact in previous labs…)
To see the effect that different sample sizes have on the sampling distribution, plot the three distributions on top of one another.
par(mfrow = c(3, 1))
xlimits <- range(sample_means10)
hist(sample_means10, breaks = 20, xlim = xlimits)
hist(sample_means50, breaks = 20, xlim = xlimits)
hist(sample_means100, breaks = 20, xlim = xlimits)
par(mfrow = c(1,1))
The first command specifies that you’d like to divide the plotting
area into 3 rows and 1 column of plots. We then plot 3 histograms, one
in each of the rows of the plot. The final command returns to the
default setting of making one plot at a time. The breaks
argument specifies the number of bins used in constructing the
histogram. The xlim
argument specifies the range of the
x-axis of the histogram, and by setting it equal to xlimits
for each histogram, we ensure that all three histograms will be plotted
with the same limits on the x-axis.
Now do Exercise 5.
So far, we have only focused on estimating the mean living area in homes in Ames. Now you’ll try to estimate the mean home price. Complete the following exercises.
Now do Exercise 6.
Now do Exercise 7.
Now do Exercise 8.
Now do Exercise 9.
The exercises above have been pointing us towards an amazing and surprising result in statistics called the Central Limit Theorem. The theorem says that for sufficiently large samples (the usual cutoff is 30 or bigger), the sampling distribution of sample means will look approximately normal. Specifically, it will look like a normal distribution whose mean is the same as the population mean \(\mu\) and whose standard deviation is equal to \(\sigma/\sqrt{n}\), the population standard deviation divided by the square root of the size of the sample. Amazingly, the sampling distribution is approximately normal even if the underlying population isn’t normally distributed. Let’s see how this plays out with our area data.
To help with this analysis, the following function is provided. Copy this code into a code block in your lab workbook and execute it.
plot_normal_overlay <- function(data_vector) {
vector_name <- deparse(substitute(data_vector))
lb <- min(data_vector) - 100
ub <- max(data_vector) + 100
list_histo <- hist(data_vector,plot=FALSE)
y_max <- max(list_histo$density) + max(list_histo$density)/10
hist(data_vector, probability = TRUE, ylim = c(0,y_max),
main = paste("Histogram of" , vector_name), xlab = vector_name)
x <- lb:ub
y <- dnorm(x = x, mean = mean(data_vector), sd = sd(data_vector))
lines(x = x, y = y, col = 'red')
}
When you execute the above code block nothing will happen except a new function named “plot_normal_overlay” will be added to your work space.
Now do Exercise 10.
This lab is a modification of a product of OpenIntro that is released under a Creative Commons Attribution-ShareAlike 3.0 Unported. This lab was written for OpenIntro by Andrew Bray and Mine Çetinkaya-Rundel.