Foundations for statistical inference - Sampling distributions

In this lab, we investigate the ways in which the statistics from a random sample of data can serve as point estimates for population parameters. We’re interested in formulating a sampling distribution of our estimate in order to learn about the properties of the estimate, such as its distribution.

The data

We consider real estate data from the city of Ames, Iowa. The details of every real estate transaction in Ames is recorded by the City Assessor’s office. Our particular focus for this lab will be all residential home sales in Ames between 2006 and 2010. This collection represents our population of interest. In this lab we would like to learn about these home sales by taking smaller samples from the full population. Let’s load the data.

load("more/ames.RData")

We see that there are quite a few variables in the data set, enough to do a very in-depth analysis. For this lab, we’ll restrict our attention to just two of the variables: the above ground living area of the house in square feet (Gr.Liv.Area) and the sale price (SalePrice). To save some effort throughout the lab, create two variables with short names that represent these two variables.

area <- ames$Gr.Liv.Area
price <- ames$SalePrice

Let’s look at the distribution of area in our population of home sales by calculating a few summary statistics and making a histogram.

summary(area)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     334    1126    1442    1500    1743    5642

hist(area)

Describe this population distribution.

Answer: The population is roughly normal, centered on a mean of 1500.

set.seed(123)
samp1 <- sample(area, 50)
summary(samp1)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     808    1060    1444    1521    1798    3395

hist(samp1)

Describe the distribution of this sample. How does it compare to the distribution of the population?

Answer: The mean of this sample is slightly higher (1521) and it is a bit less normally distributed–there is a slihght left-skew. This is beacuse the sample size is so much smaller.

mean(samp1)

## [1] 1520.62

Take a second sample, also of size 50, and call it samp2. How does the mean of samp2 compare with the mean of samp1? Suppose we took two more samples, one of size 100 and one of size 1000. Which would you think would provide a more accurate estimate of the population mean?

samp2 <- sample(area, 50)
mean(samp2)

## [1] 1476.4

Answer: As of this writing, samp2 < samp1. Typically, every time this code is run a new sample is pulled with a new sample mean; however I added the line set.seed(123) so that the random samples will stay the same upon re-running. The larger the samples, the more closely each sample mean should hew to the population sample. A sample of N = 1000 is more likely to be closer to the true mean than one of N = 100, but because of the randomness of the sample in each case, it’s possible that it wouldn’t be.

sample_means50 <- rep(NA, 5000)

for(i in 1:5000){
   samp <- sample(area, 50)
   sample_means50[i] <- mean(samp)
   }

hist(sample_means50)

hist(sample_means50, breaks = 25)

mean(sample_means50)

## [1] 1499.23

Here we use R to take 5000 samples of size 50 from the population, calculate the mean of each sample, and store each result in a vector called sample_means50. On the next page, we’ll review how this set of code works.

How many elements are there in sample_means50? Describe the sampling distribution, and be sure to specifically note its center. Would you expect the distribution to change if we instead collected 50,000 sample means?

Answer: The sampling distribution is a normal distribution with a mean of 1500–the same mean (or a very close approximation) as the true population. The distribution is already very close to a oerfect normal and the mean is within ~0.5% of the true mean, so taking 50,000 samples would make only a very small difference, although we would expect that difference to be in the direction of closer to the true mean and closer to a perfect normal curve.

sample_means50 <- rep(NA, 5000)

samp <- sample(area, 50)
sample_means50[1] <- mean(samp)

samp <- sample(area, 50)
sample_means50[2] <- mean(samp)

samp <- sample(area, 50)
sample_means50[3] <- mean(samp)

samp <- sample(area, 50)
sample_means50[4] <- mean(samp)

and so on…

With the for loop, these thousands of lines of code are compressed into a handful of lines. We’ve added one extra line to the code below, which prints the variable i during each iteration of the for loop. Run this code.

sample_means50 <- rep(NA, 5000)

for(i in 1:5000){
   samp <- sample(area, 50)
   sample_means50[i] <- mean(samp)
   #print(i)
   }

To make sure you understand what you’ve done in this loop, try running a smaller version. Initialize a vector of 100 zeros called sample_means_small. Run a loop that takes a sample of size 50 from area and stores the sample mean in sample_means_small, but only iterate from 1 to 100. Print the output to your screen (type sample_means_small into the console and press enter). How many elements are there in this object called sample_means_small? What does each element represent?

Answer There are 100 elements in the vector sample_means_small, each one representing one sample mean drawn from a different random sample of area.

sample_means_small <- rep(0, 100)

for(i in 1:100){
  samp <- sample(area, 50)
  sample_means_small[i] <- mean(samp) 
}

sample_means_small

##   [1] 1463.38 1427.88 1610.28 1463.06 1360.56 1408.22 1433.02 1521.14
##   [9] 1528.26 1513.94 1539.82 1626.32 1400.56 1535.46 1472.12 1481.70
##  [17] 1558.20 1455.66 1505.76 1434.90 1380.14 1331.14 1354.64 1460.60
##  [25] 1492.60 1514.86 1533.10 1529.56 1516.34 1498.74 1501.96 1479.36
##  [33] 1441.56 1431.00 1491.22 1572.42 1420.94 1464.06 1412.28 1594.86
##  [41] 1418.58 1466.76 1444.38 1571.82 1425.78 1534.66 1521.08 1611.02
##  [49] 1521.40 1428.92 1540.46 1451.22 1531.78 1455.32 1580.06 1577.18
##  [57] 1470.82 1383.42 1536.88 1540.78 1619.26 1470.48 1495.52 1562.30
##  [65] 1528.76 1543.78 1457.20 1477.66 1659.18 1569.00 1472.14 1484.30
##  [73] 1565.40 1440.92 1413.96 1508.28 1473.14 1479.24 1442.52 1463.74
##  [81] 1611.94 1372.94 1606.96 1528.76 1496.46 1553.46 1373.42 1440.48
##  [89] 1528.08 1541.38 1545.88 1503.40 1442.22 1487.66 1536.74 1666.74
##  [97] 1642.26 1559.60 1605.92 1575.18

Sample size and the sampling distribution

Mechanics aside, let’s return to the reason we used a for loop: to compute a sampling distribution, specifically, this one.

hist(sample_means50)

The sampling distribution that we computed tells us much about estimating the average living area in homes in Ames. Because the sample mean is an unbiased estimator, the sampling distribution is centered at the true average living area of the the population, and the spread of the distribution indicates how much variability is induced by sampling only 50 home sales.

To get a sense of the effect that sample size has on our distribution, let’s build up two more sampling distributions: one based on a sample size of 10 and another based on a sample size of 100.

sample_means10 <- rep(NA, 5000)
sample_means100 <- rep(NA, 5000)

for(i in 1:5000){
  samp <- sample(area, 10)
  sample_means10[i] <- mean(samp)
  samp <- sample(area, 100)
  sample_means100[i] <- mean(samp)
}

Here we’re able to use a single for loop to build two distributions by adding additional lines inside the curly braces. Don’t worry about the fact that samp is used for the name of two different objects. In the second command of the for loop, the mean of samp is saved to the relevant place in the vector sample_means10. With the mean saved, we’re now free to overwrite the object samp with a new sample, this time of size 100. In general, anytime you create an object using a name that is already in use, the old object will get replaced with the new one.

To see the effect that different sample sizes have on the sampling distribution, plot the three distributions on top of one another.

par(mfrow = c(3, 1))

xlimits <- range(sample_means10)

hist(sample_means10, breaks = 20, xlim = xlimits)
hist(sample_means50, breaks = 20, xlim = xlimits)
hist(sample_means100, breaks = 20, xlim = xlimits)

The first command specifies that you’d like to divide the plotting area into 3 rows and 1 column of plots (to return to the default setting of plotting one at a time, use par(mfrow = c(1, 1))). The breaks argument specifies the number of bins used in constructing the histogram. The xlim argument specifies the range of the x-axis of the histogram, and by setting it equal to xlimits for each histogram, we ensure that all three histograms will be plotted with the same limits on the x-axis.

When the sample size is larger, what happens to the center? What about the spread?

Answer: The centers are all fairly close to 1500–the true mean. The spead is noticeably wider with the smaller sample sizes.

On your own

So far, we have only focused on estimating the mean living area in homes in Ames. Now you’ll try to estimate the mean home price.

Take a random sample of size 50 from price. Using this sample, what is your best point estimate of the population mean?

samp <- sample(price, 50)
mean(samp)

## [1] 182664.3

Answer: Best point estimate is the sample mean, which is $181,968.70.

Since you have access to the population, simulate the sampling distribution for $\bar{x}_{price}$ by taking 5000 samples from the population of size 50 and computing 5000 sample means. Store these means in a vector called sample_means50. Plot the data, then describe the shape of this sampling distribution. Based on this sampling distribution, what would you guess the mean home price of the population to be? Finally, calculate and report the population mean.

sample_means50 <- rep(NA, 5000)

for(i in 1:5000){
  samp <- sample(price, 50)
  sample_means50[i] <- mean(samp)
}

hist(sample_means50)

mean(sample_means50)

## [1] 180676.4

Answer: This is a normal distribution, centered on approximately $180,000, which should be (roughly) the population mean.

Change your sample size from 50 to 150, then compute the sampling distribution using the same method as above, and store these means in a new vector called sample_means150. Describe the shape of this sampling distribution, and compare it to the sampling distribution for a sample size of 50. Based on this sampling distribution, what would you guess to be the mean sale price of homes in Ames?

sample_means150 <- rep(NA, 5000)

for(i in 1:5000){
  samp <- sample(price, 150)
  sample_means150[i] <- mean(samp)
}

hist(sample_means150, breaks = 25)

mean(sample_means150)

## [1] 180750.6

Answer: Once again, we see a normal distribution centered on approximately $180,000. (The exact mean is less than 1% different than the previos calculation.) I suppose to be more precise we could say the best guess is that the mean price is $180,690.40

Of the sampling distributions from 2 and 3, which has a smaller spread? If we’re concerned with making estimates that are more often close to the true value, would we prefer a distribution with a large or small spread?

Answer: The larger sample size produces a smaller spread in the sampling distribution, which we would preffer to make a more accurate estimate of the true mean.