Foundations for statistical inference - Sampling distributions

In this lab, we investigate the ways in which the statistics from a random sample of data can serve as point estimates for population parameters. We’re interested in formulating a sampling distribution of our estimate in order to learn about the properties of the estimate, such as its distribution.

Setting a seed: We will take some random samples and build sampling distributions in this lab, which means you should set a seed at the beginning of your lab.

Getting Started

Load packages

In this lab we will explore the data using the dplyr package and visualize it using the ggplot2 package for data visualization.

Let’s load the packages.

library(tidyverse)
library(here)

The data

We consider real estate data from the city of Ames, Iowa. The details of every real estate transaction in Ames is recorded by the City Assessor’s office. Our particular focus for this lab will be all residential home sales in Ames between 2006 and 2010. This collection represents our population of interest. In this lab we would like to learn about these home sales by taking smaller samples from the full population.

Let’s load the data. First download the ames.csv file, save it in your data folder, and then read it into RStudio.

ames <- read_csv(here('data', 'ames.csv'))

We see that there are quite a few variables in the data set, enough to do a very in-depth analysis. For this lab, we’ll restrict our attention to just two of the variables: the above ground living area of the house in square feet (area) and the sale price (price).

We can explore the distribution of areas of homes in the population of home sales visually and with summary statistics. Let’s first create a visualization.

Create a histogram of the areas variable (set binwidth = 250).

Let’s also obtain some summary statistics. Note that we can do this using the summarize function. We can calculate as many statistics as we want using this function, and just combine the results. Some of the functions below should be familiar (like mean, median, sd, IQR, min, and max). A new function here is the quantile function which we can use to calculate values corresponding to specific percentile cutoffs in the distribution. For example quantile(x, 0.25) will yield the cutoff value for the 25th percentile (Q1) in the distribution of x. Finding these values is useful for describing the distribution, as we can use them for descriptions like “the middle 50% of the homes have areas between such and such square feet”.

ames %>%
  summarize(mu = mean(area), pop_med = median(area), 
            sigma = sd(area), pop_iqr = IQR(area),
            pop_min = min(area), pop_max = max(area),
            pop_q1 = quantile(area, 0.25),  # first quartile, 25th percentile
            pop_q3 = quantile(area, 0.75))  # third quartile, 75th percentile

Describe this population distribution based on the visualization above and these summary statistics. You don’t have to use all of the summary statistics in your description, you will need to decide which ones are relevant based on the shape of the distribution.

The unknown sampling distribution

In this lab we have access to the entire population, but this is rarely the case in real life. Gathering information on an entire population is often extremely costly or impossible. Because of this, we often take a sample of the population and use that sample to understand the properties of, or to infer something about, the population.

If we were interested in estimating the mean living area of houses in Ames based on a sample, we can use the sample_n command to survey the population.

Use sample_n to select a random sample of 50 houses from our data frame. Store the results in a new variable called samp1.

This is like going into the City Assessor’s database and pulling up the files on 50 random home sales. Working with these 50 files would be considerably simpler than working with all 2930 home sales.

Describe the distribution of area in this sample. How does it compare to the distribution of the population? Hint: the sample_n function takes a random sample of observations (i.e. rows) from the dataset, you can still refer to the variables in the dataset with the same names. Code you used in the previous exercise will also be helpful for visualizing and summarizing the sample, however be careful to not label values mu and sigma anymore since these are sample statistics, NOT population parameters. You can change the labels of any of the statistics to indicate that these come from the sample.

If we’re interested in estimating the average living area in homes in Ames using the sample, our best single guess is the sample mean.

Calculate the mean area of the homes in this sample of 50.
Calculate the mean area of all the homes in our population.

Depending on which 50 homes you selected, your estimate could be a bit above or a bit below the true population mean. In general, though, the sample mean turns out to be a pretty good estimate of the average living area, and we were able to get it by sampling less than 3% of the population.

Would you expect the mean of your sample to match the mean of another classmate’s sample? Why, or why not? If the answer is no, would you expect the means to just be somewhat different or very different? Confirm your answer by comparing with a classmate.
Take a second sample, also of size 50, and call it samp2. How does the mean of samp2 compare with the mean of samp1?
Suppose we took two more samples, one of size 100 and one of size 1000. Which would you think would provide a more accurate estimate of the population mean? Check your answer by taking the two samples and calculating the mean of each.

Not surprisingly, every time we take another random sample, we get a different sample mean. It’s useful to get a sense of just how much variability we should expect when estimating the population mean this way. The distribution of sample means, called the sampling distribution (of the mean), can help us understand this variability. In this lab, because we have access to the population, we can build up the sampling distribution for the sample mean by repeating the above steps many times. Here we will generate 15,000 samples and compute the sample mean of each.

Note that we specify that replace = TRUE since sampling distributions are constructed by sampling with replacement.

sample_means50 <- tibble(sample_means = 
                           replicate(15000, 
                                     mean(sample(ames$area, 50,replace = TRUE))))

Create a histogram of the results stored in sample_means50.

Here we use R to take 15,000 different samples (the replicate(15000, ..) part) of size 50 (the sample(..., 50, ...) part) from the population, calculate the mean of each sample (the mean(...) part), and store each result in a data frame called sample_means50.

How many elements are there in sample_means50? Describe the sampling distribution, and be sure to specifically note its center.

Interlude: Sampling distributions

The idea behind the code above is repetition. Earlier we took a single sample of size n = 50 from the population of all houses in Ames. With the code above we are able to repeat this sampling procedure as many times as we’d like in order to build a distribution of a series of sample means, which is called the sampling distribution.

Note that in practice one rarely gets to build true sampling distributions, because we rarely have access to data from the entire population.

Note that for each of the 15,000 times we computed a mean, we did so from a different sample!

To make sure you understand how sampling distributions are built, try modifying the code to create a sampling distribution of 25 sample means from samples of size 10, and put them in a data frame named sample_means_small. Plot the results. How many observations are there in this object called sample_means_small? What does each observation represent?

Sample size and the sampling distribution

The sampling distribution that we computed tells us much about estimating the average living area in homes in Ames. The sampling distribution is centered at the true average living area of the population, and the spread of the distribution indicates how much variability is incurred by sampling only 50 home sales.

In the remainder of this section we will work on getting a sense of the effect that sample size has on our sampling distribution.

Use the code below to create sampling distributions of means of areas from samples of size 10, 50, and 100. Use 5,000 simulations. What does each observation in the sampling distribution represent? How does the mean, standard error (i.e. the standard deviation of the sampling distribution), and shape of the sampling distribution change as the sample size increases? For a sample size of 30, does the shape of the distribution change if you increase the number of simulations from 50 to 1050 in steps of 250?

More Practice

So far, we have only focused on estimating the mean living area in homes in Ames. Now you’ll try to estimate the mean home price.

Take a sample of size 15 from the population and calculate the mean price of the homes in this sample. Using this sample, what is your best point estimate of the population mean of prices of homes?
Since you have access to the population, simulate the sampling distribution of \(\overline{price}\) for samples of size 15 by taking 2000 samples from the population of size 15 and computing 2000 sample means. Store these means in a vector called sample_means15. Plot the data, then describe the shape of this sampling distribution. Based on this sampling distribution, what would you guess the mean home price of the population to be? Finally, calculate and report the population mean.
Change your sample size from 15 to 150, then compute the sampling distribution using the same method as above, and store these means in a new vector called sample_means150. Describe the shape of this sampling distribution, and compare it to the sampling distribution for a sample size of 15. Based on this sampling distribution, what would you guess to be the mean sale price of homes in Ames?
Of the two sampling distributions calculated for price, which has a smaller spread? If we’re concerned with making estimates that are more often close to the true value, would we prefer a sampling distribution with a large or small spread?

This is modified version of a product of OpenIntro that is released under a Creative Commons Attribution-ShareAlike 3.0 Unported. This lab was written for OpenIntro by Andrew Bray and Mine Çetinkaya-Rundel.