Foundations for statistical inference - Sampling distributions

In this lab, we investigate the ways in which the statistics from a random sample of data can serve as point estimates for population parameters. We’re interested in formulating a sampling distribution of our estimate in order to learn about the properties of the estimate, such as its distribution.

The data

We consider real estate data from the city of Ames, Iowa. The details of every real estate transaction in Ames is recorded by the City Assessor’s office. Our particular focus for this lab will be all residential home sales in Ames between 2006 and 2010. This collection represents our population of interest. In this lab we would like to learn about these home sales by taking smaller samples from the full population. Let’s load the data.

library(mosaic)
library(oilabs)
data(ames)
head(ames)
# rename two variables to make life easier
ames <- ames %>%
  rename(area = Gr.Liv.Area) %>%
  rename(price = SalePrice)

We see that there are quite a few variables in the data set, enough to do a very in-depth analysis. For this lab, we’ll restrict our attention to just two of the variables: the above ground living area of the house in square feet (area) and the sale price (price).

We can explore the distribution of areas of homes in the population of home sales visually and with summary statistics. Let’s first create a visualization, a histogram:

histogram(~area, data = ames)

Let’s also obtain some summary statistics. Note that we can do this using the favstats function. Finding these values are useful for describing the distribution, as we can use them for descriptions like “the middle 50% of the homes have areas between such and such square feet”.

favstats(~area, data = ames)

Describe this population distribution using the visualization and the summary statistics. You don’t have to use all of the summary statistics in your description, you will need to decide which ones are relevant based on the shape of the distribution. Make sure to include the plot and the summary statistics output in your report along with your narrative.

The unknown sampling distribution

In this lab we have access to the entire population, but this is rarely the case in real life. Gathering information on an entire population is often extremely costly or impossible. Because of this, we often take a sample of the population and use that to understand the properties of the population.

If we were interested in estimating the mean living area in Ames based on a sample, we can use the following command to survey the population.

samp1 <- ames %>%
  sample_n(50)

The sample_n function collects a simple random sample of size 50 from the ames dataset area, which is assigned to samp1. This is like going into the City Assessor’s database and pulling up the files on 50 random home sales. Working with these 50 files would be considerably simpler than working with all 2930 home sales.

Describe the distribution of area in this sample. How does it compare to the distribution of the population? Hint: the sample_n function takes a random sample of observations (i.e. rows) from the dataset, you can still refer to the variables in the dataset with the same names. Code you used in the previous exercise will also be helpful for visualizing and summarizing the sample, however be careful to not label values mu and sigma anymore since these are sample statistics, not population parameters. You can customize the labels of any of the statistics to indicate that these come from the sample.

If we’re interested in estimating the average living area in homes in Ames using the sample, our best single guess is the sample mean.

mean(~area, data = samp1)

Depending on which 50 homes you selected, your estimate could be a bit above or a bit below the true population mean of 1499.69 square feet. In general, though, the sample mean turns out to be a pretty good estimate of the average living area, and we were able to get it by sampling less than 3% of the population.

Take a second sample, also of size 50, and call it samp2. How does the mean of samp2 compare with the mean of samp1? Suppose we took two more samples, one of size 100 and one of size 1000. Which would you think would provide a more precise estimate of the population mean?

Not surprisingly, every time we take another random sample, we get a different sample mean. It’s useful to get a sense of just how much variability we should expect when estimating the population mean this way. The distribution of sample means, called the sampling distribution, can help us understand this variability. In this lab, because we have access to the population, we can build up the sampling distribution for the sample mean by repeating the above steps many times. Here we will generate 2000 samples and compute the sample mean of each. Note that we since we are sampling with replacement, we use the resample function instead of sample_n.

sample_means50 <- do(2000) * mean(~area, data = resample(ames, 50))
histogram(~mean, data = sample_means50)

Here we use R to take 2000 samples of size 50 from the population, calculate the mean of each sample, and store each result in a vector called sample_means50. On the next page, we’ll review how this set of code works.

How many rows are there in sample_means50? Describe the sampling distribution, and be sure to specifically note its center. Would you expect the distribution to change if we instead collected 10,000 sample means?

Interlude: The `do` function

Let’s take a break from the statistics for a moment to let that last block of code sink in. The idea behind the do function is repetition: it allows you to execute a line of code as many times as you want and put the results in a data frame. In the case above, we wanted to repeatedly take a random sample of size 50 from area and then save the mean of that sample into the sample_means50 vector.

Without the do function, this would be painful. First, we’d have to create an empty vector filled with NAs to hold the 2000 sample means. Then, we’d have to compute each of the 2000 sample means one line at a time, putting them individually into the slots of the sample_means50 vector:

sample_means50 <- rep(NA, 2000)

sample_means50[1] <- mean(~area, data = resample(ames, 50))
sample_means50[2] <- mean(~area, data = resample(ames, 50))
sample_means50[3] <- mean(~area, data = resample(ames, 50))
sample_means50[4] <- mean(~area, data = resample(ames, 50))
# ...and so on, 2000 times

With the do function, these thousands of lines of code are compressed into one line:

sample_means50 <- do(2000) * mean(~area, data = resample(ames, 50))

Note that for each of the 2,000 times we computed a mean, we did so from a different sample!

To make sure you understand what the resample and do function do, try modifying the code to take only 25 sample means from samples of size 10, and put them in a data frame named sample_means_small. Print the output. How many elements are there in this object called sample_means_small? What does each element represent?

Sample size and the sampling distribution

Mechanics aside, let’s return to the reason we used the do function: to compute a sampling distribution, specifically, this one.

histogram(~mean, data = sample_means50)

The sampling distribution that we computed tells us much about estimating the average living area in homes in Ames. Because the sample mean is an unbiased estimator, the sampling distribution is centered at the true average living area of the the population, and the spread of the distribution indicates how much variability is induced by sampling only 50 home sales. This spread is the standard error.

In the remainder of this section we will work on getting a sense of the effect that sample size has on our sampling distribution

Step 1: Create three sampling distributions, each containing 2,000 sample means, with sample means coming from samples of size \(n = 10\), \(n = 50\), and \(n = 100\).

sample_means10 <- do(2000) * mean(~area, data = resample(ames, 10))
sample_means50 <- do(2000) * mean(~area, data = resample(ames, 50))
sample_means100 <- do(2000) * mean(~area, data = resample(ames, 100))

Step 2: Plot these three sampling distributions on top of each other on the same scale so that we can easily compare their shapes, centers, and spreads to each other.

– Step 2a: Combine these three sampling distributions (three \(2000 \times 1\) data frames) into one \(6000 \times 1\) data frame. Note that we’re just doing this for ease in plotting later. To combine data frames by row, we use the bind_rows function.

sampling_dist <- bind_rows(sample_means10, sample_means50, sample_means100)

– Step 2b: Add a new column called sample_size to the data frame you just created that indicates the sample size that each case (each sample mean) came from. Remember the first 2,000 sample means came from samples of size 10, the next 2,000 sample means came from samples of size 50, and the last 2,000 sample means came from samples of size 100. Hence, this new variable is simply \(10\) repeated 2,000 times, followed by \(50\) repeated 2000 times, followed by \(100\) repeated 2,000 times. The use of the factor function will ensure that R considers this to be a categorical variable, and not a numerical one. Also remember that we use the mutate function to create new variables in data frames.

sampling_dist <- sampling_dist %>%
  mutate(sample_size = factor(c(rep(10, 2000), rep(50, 2000), rep(100, 2000))))

Type View(sampling_dist) to see sampling_dist’s contents.

– Step 2c: Finally, draw three histograms or densityplots representing these three sampling distributions. We can do this by using the | operator to create a separate facet in the plot for each of the three histograms. Remember that we identify the distributions with the sample_size variable we created earlier.

histogram(~mean | sample_size, data = sampling_dist, layout=c(1,3))

When the sample size is larger, what happens to the center? What about the spread? Make sure to include the plots in your answer.

This is a product of OpenIntro that is released under a Creative Commons Attribution-ShareAlike 3.0 Unported. This lab was written for OpenIntro by Andrew Bray and Mine Çetinkaya-Rundel.

Foundations for statistical inference - Sampling distributions

The data

The unknown sampling distribution

Interlude: The do function

Sample size and the sampling distribution

Interlude: The `do` function