Foundations for statistical inference - Sampling distributions

In this lab, we investigate the ways in which the statistics from a random sample of data can serve as point estimates for population parameters. We’re interested in formulating a sampling distribution of our estimate in order to learn about the properties of the estimate, such as its distribution.

The data

We consider real estate data from the city of Ames, Iowa. The details of every real estate transaction in Ames is recorded by the City Assessor’s office. Our particular focus for this lab will be all residential home sales in Ames between 2006 and 2010. This collection represents our population of interest. In this lab we would like to learn about these home sales by taking smaller samples from the full population. Let’s load the data.

load("more/ames.RData")

We see that there are quite a few variables in the data set, enough to do a very in-depth analysis. For this lab, we’ll restrict our attention to just two of the variables: the above ground living area of the house in square feet (Gr.Liv.Area) and the sale price (SalePrice). To save some effort throughout the lab, create two variables with short names that represent these two variables.

area <- ames$Gr.Liv.Area
price <- ames$SalePrice

Let’s look at the distribution of area in our population of home sales by calculating a few summary statistics and making a histogram.

summary(area)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     334    1126    1442    1500    1743    5642

hist(area)

Describe this population distribution.

Answer: The distribution is unimodal, and almost symmetric. The mean of 1500 SF is greater than the median of 1442 SF, from which we can infer that the distribution is slightly skewed to the right. Also the left tail is truncated, since housing areas have to be greater than zero.

The unknown sampling distribution

In this lab we have access to the entire population, but this is rarely the case in real life. Gathering information on an entire population is often extremely costly or impossible. Because of this, we often take a sample of the population and use that to understand the properties of the population.

If we were interested in estimating the mean living area in Ames based on a sample, we can use the following command to survey the population.

samp1 <- sample(area, 50)

This command collects a simple random sample of size 50 from the vector area, which is assigned to samp1. This is like going into the City Assessor’s database and pulling up the files on 50 random home sales. Working with these 50 files would be considerably simpler than working with all 2930 home sales.

Describe the distribution of this sample. How does it compare to the distribution of the population?

Answer: The sample distribution is multi-modal and asymmetric, with a pronounced skew toward the right. Compared to the population distribution, the sample distribution (based on the numbers from my last knitting of this document) has:
- A comparable median (1483 vs. 1442) and mean (1506 vs. 1500)
- A smaller range of values (800 min to 2514 max vs. 334 min to 5642 max)
- A comparable IQR (604 vs. 617)
- Less well-defined shape of the distribution; i.e., lumpier and less smooth than the population distribution.
```
summary(samp1)
```
```
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     800    1190    1483    1506    1794    2514
```
```
hist(samp1)
```

If we’re interested in estimating the average living area in homes in Ames using the sample, our best single guess is the sample mean.

mean(samp1)

## [1] 1506.08

Depending on which 50 homes you selected, your estimate could be a bit above or a bit below the true population mean of 1499.69 square feet. In general, though, the sample mean turns out to be a pretty good estimate of the average living area, and we were able to get it by sampling less than 3% of the population.

Take a second sample, also of size 50, and call it samp2. How does the mean of samp2 compare with the mean of samp1? Suppose we took two more samples, one of size 100 and one of size 1000. Which would you think would provide a more accurate estimate of the population mean?
Answer:
- The mean of samp2 is 1684, which is greater than the mean of samp1 at 1506.
- Generally, the larger the sample size, the closer the sample mean will be to the population mean. In this case, both samp100 and samp1000 have sample means that are more accurate than that of samp2 (although in this instance, the sample mean from samp100 is more accurate than that from samp1000).
```
samp2 <- sample(area, 50)
summary(samp2)
```
```
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     844    1339    1557    1684    1868    2978
```
```
samp100 <- sample(area, 100)
summary(samp100)
```
```
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     848    1164    1449    1500    1679    4316
```
```
samp1000 <- sample(area, 1000)
summary(samp1000)
```
```
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     438    1144    1458    1514    1734    5642
```

Not surprisingly, every time we take another random sample, we get a different sample mean. It’s useful to get a sense of just how much variability we should expect when estimating the population mean this way. The distribution of sample means, called the sampling distribution, can help us understand this variability. In this lab, because we have access to the population, we can build up the sampling distribution for the sample mean by repeating the above steps many times. Here we will generate 5000 samples and compute the sample mean of each.

sample_means50 <- rep(NA, 5000)

for(i in 1:5000){
   samp <- sample(area, 50)
   sample_means50[i] <- mean(samp)
   }

hist(sample_means50)

If you would like to adjust the bin width of your histogram to show a little more detail, you can do so by changing the breaks argument.

hist(sample_means50, breaks = 25)

Here we use R to take 5000 samples of size 50 from the population, calculate the mean of each sample, and store each result in a vector called sample_means50. On the next page, we’ll review how this set of code works.

How many elements are there in sample_means50? Describe the sampling distribution, and be sure to specifically note its center. Would you expect the distribution to change if we instead collected 50,000 sample means?
Answer:
- There are 5,000 elements in sample_means50.
- The sampling distribution is unimodal and nearly symmetric, and approximately resembles a normal distribution. The distribution is centered about 1500, with a median of 1498 and mean of 1500.
- If we collect 50,000 sample means, the distribution should appear smoother and more closely resemble a normal distribution centered about the population mean of 1500. Note, however, that between the 5,000 and the 50,000 sampling distributions, there is no change in the estimated median, mean, 1st quartile or 3rd quartile.
```
length(sample_means50)
```
```
## [1] 5000
```
```
summary(sample_means50)
```
```
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1259    1451    1498    1500    1546    1814
```
```
# sample of 50,000 means
sample_means50K <- rep(NA, 50000)
for (i in 1:50000){
    samp <- sample(area, 50)
    sample_means50K[i] <- mean(samp)
}
hist(sample_means50K, breaks = 25)
```
```
summary(sample_means50K)
```
```
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1243    1451    1498    1500    1546    1900
```

Interlude: The `for` loop

Let’s take a break from the statistics for a moment to let that last block of code sink in. You have just run your first for loop, a cornerstone of computer programming. The idea behind the for loop is iteration: it allows you to execute code as many times as you want without having to type out every iteration. In the case above, we wanted to iterate the two lines of code inside the curly braces that take a random sample of size 50 from area then save the mean of that sample into the sample_means50 vector. Without the for loop, this would be painful:

sample_means50 <- rep(NA, 5000)

samp <- sample(area, 50)
sample_means50[1] <- mean(samp)

samp <- sample(area, 50)
sample_means50[2] <- mean(samp)

samp <- sample(area, 50)
sample_means50[3] <- mean(samp)

samp <- sample(area, 50)
sample_means50[4] <- mean(samp)

and so on…

With the for loop, these thousands of lines of code are compressed into a handful of lines. We’ve added one extra line to the code below, which prints the variable i during each iteration of the for loop. Run this code.

sample_means50 <- rep(NA, 5000)

for(i in 1:5000){
   samp <- sample(area, 50)
   sample_means50[i] <- mean(samp)
   # comment out to reduce doc length
   # print(i)
}

Let’s consider this code line by line to figure out what it does. In the first line we initialized a vector. In this case, we created a vector of 5000 zeros called sample_means50. This vector will will store values generated within the for loop.

The second line calls the for loop itself. The syntax can be loosely read as, “for every element i from 1 to 5000, run the following lines of code”. You can think of i as the counter that keeps track of which loop you’re on. Therefore, more precisely, the loop will run once when i = 1, then once when i = 2, and so on up to i = 5000.

The body of the for loop is the part inside the curly braces, and this set of code is run for each value of i. Here, on every loop, we take a random sample of size 50 from area, take its mean, and store it as the $i$^th element of sample_means50.

In order to display that this is really happening, we asked R to print i at each iteration. This line of code is optional and is only used for displaying what’s going on while the for loop is running.

The for loop allows us to not just run the code 5000 times, but to neatly package the results, element by element, into the empty vector that we initialized at the outset.

To make sure you understand what you’ve done in this loop, try running a smaller version. Initialize a vector of 100 zeros called sample_means_small. Run a loop that takes a sample of size 50 from area and stores the sample mean in sample_means_small, but only iterate from 1 to 100. Print the output to your screen (type sample_means_small into the console and press enter). How many elements are there in this object called sample_means_small? What does each element represent?

Answer:

There are 100 elements in sample_means_small.
Each element represents the sample mean of one of the 100 random samples of size 50 observations taken from the area population.

# initialize vector of NA's
sample_means_small <- rep(NA, 100)
for (i in 1:100){
   samp <- sample(area, 50)
   sample_means_small[i] <- mean(samp)
}
sample_means_small

##   [1] 1508.48 1417.60 1567.46 1570.48 1662.18 1486.34 1635.84 1625.04
##   [9] 1500.80 1609.30 1453.76 1560.22 1459.38 1557.88 1353.28 1413.94
##  [17] 1596.92 1521.80 1536.24 1416.18 1541.68 1395.30 1430.78 1647.84
##  [25] 1590.54 1445.00 1576.74 1611.30 1489.42 1514.66 1608.46 1520.84
##  [33] 1369.88 1447.92 1491.34 1500.34 1680.48 1558.30 1609.24 1430.22
##  [41] 1539.18 1494.34 1551.82 1475.70 1568.40 1526.52 1423.06 1496.52
##  [49] 1560.74 1405.12 1567.16 1589.38 1600.42 1467.76 1436.24 1608.26
##  [57] 1417.82 1552.20 1491.60 1579.26 1407.60 1627.24 1463.04 1499.36
##  [65] 1504.76 1465.58 1618.78 1626.02 1502.28 1540.46 1588.52 1464.26
##  [73] 1457.36 1417.62 1601.78 1453.22 1449.16 1449.46 1439.76 1541.34
##  [81] 1424.80 1504.86 1554.78 1455.46 1591.18 1400.26 1516.66 1528.42
##  [89] 1520.38 1622.86 1492.06 1550.84 1537.58 1399.42 1456.08 1436.40
##  [97] 1518.04 1528.18 1580.34 1557.80

summary(sample_means_small)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1353    1455    1517    1516    1569    1680

hist(sample_means_small)

Sample size and the sampling distribution

Mechanics aside, let’s return to the reason we used a for loop: to compute a sampling distribution, specifically, this one.

hist(sample_means50)

The sampling distribution that we computed tells us much about estimating the average living area in homes in Ames. Because the sample mean is an unbiased estimator, the sampling distribution is centered at the true average living area of the the population, and the spread of the distribution indicates how much variability is induced by sampling only 50 home sales.

To get a sense of the effect that sample size has on our distribution, let’s build up two more sampling distributions: one based on a sample size of 10 and another based on a sample size of 100.

sample_means10 <- rep(NA, 5000)
sample_means100 <- rep(NA, 5000)

for(i in 1:5000){
  samp <- sample(area, 10)
  sample_means10[i] <- mean(samp)
  samp <- sample(area, 100)
  sample_means100[i] <- mean(samp)
}

Here we’re able to use a single for loop to build two distributions by adding additional lines inside the curly braces. Don’t worry about the fact that samp is used for the name of two different objects. In the second command of the for loop, the mean of samp is saved to the relevant place in the vector sample_means10. With the mean saved, we’re now free to overwrite the object samp with a new sample, this time of size 100. In general, anytime you create an object using a name that is already in use, the old object will get replaced with the new one.

To see the effect that different sample sizes have on the sampling distribution, plot the three distributions on top of one another.

par(mfrow = c(3, 1))

xlimits <- range(sample_means10)

hist(sample_means10, breaks = 20, xlim = xlimits)
hist(sample_means50, breaks = 20, xlim = xlimits)
hist(sample_means100, breaks = 20, xlim = xlimits)

The first command specifies that you’d like to divide the plotting area into 3 rows and 1 column of plots (to return to the default setting of plotting one at a time, use par(mfrow = c(1, 1))). The breaks argument specifies the number of bins used in constructing the histogram. The xlim argument specifies the range of the x-axis of the histogram, and by setting it equal to xlimits for each histogram, we ensure that all three histograms will be plotted with the same limits on the x-axis.

When the sample size is larger, what happens to the center? What about the spread?
Answer:
- As the sample size increases, the center of the sampling distribution converges to the true population mean.
- Likewise, as the sample size increases, the spread of the sampling distribution becomes more narrow. From this week’s reading, we know that the standard error of the sampling distribution is inversely proportional to the square root of the sample size ($SE = s / \sqrt{n}$), which means that the standard deviation will narrow for larger sample sizes. This can also be seen below, as the IQR shrinks as the sample size increase from 10 to 50 to 100.
```
summary(sample_means10)
```
```
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1028    1388    1493    1501    1601    2391
```
```
summary(sample_means50)
```
```
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1284    1452    1499    1500    1546    1773
```
```
summary(sample_means100)
```
```
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1336    1464    1499    1499    1533    1696
```

On your own

So far, we have only focused on estimating the mean living area in homes in Ames. Now you’ll try to estimate the mean home price.

Take a random sample of size 50 from price. Using this sample, what is your best point estimate of the population mean?
Answer:
- The sample mean below is $173,680, which we can use to estimate the population mean.
```
samp1 <- sample(price, 50)
summary(samp1)
```
```
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   76000  127250  159217  173680  192125  418000
```
Since you have access to the population, simulate the sampling distribution for $\bar{x}_{price}$ by taking 5000 samples from the population of size 50 and computing 5000 sample means. Store these means in a vector called sample_means50. Plot the data, then describe the shape of this sampling distribution. Based on this sampling distribution, what would you guess the mean home price of the population to be? Finally, calculate and report the population mean.
Answer: See code below.
- The sampling distribution is approximately normal, with a mean of $181K and standard deviation of $11K.
- We can use the mean of the sampling distribution to estimate the true population mean, i.e., $181,035.
- The population mean is in fact $180,796, which is very close to the mean of the sampling distribution (0.13% lower).
```
sample_means50 <- rep(NA, 5000)
for (i in 1:5000){
    samp <- sample(price, 50)
    sample_means50[i] <- mean(samp)
}
hist(sample_means50, breaks = 30)
```
```
summary(sample_means50)
```
```
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  149771  173153  180612  181035  188325  240940
```
```
sd(sample_means50)
```
```
## [1] 11203.66
```
```
# calc stats for population
summary(price)
```
```
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   12789  129500  160000  180796  213500  755000
```
Change your sample size from 50 to 150, then compute the sampling distribution using the same method as above, and store these means in a new vector called sample_means150. Describe the shape of this sampling distribution, and compare it to the sampling distribution for a sample size of 50. Based on this sampling distribution, what would you guess to be the mean sale price of homes in Ames?
Answer: See code below.
- Like the sampling distribution above (sample size of 50), the sampling distribution here (sample size of 150) approximately follows a normal distribution, with a mean of $181K and standard deviation of $6K. Note that the standard deviation has declined by a factor of $1 / \sqrt{3}$ (from $11,204 to $6,355) as the sample size has increased by a factor of 3 (from 50 to 100).
- Compared to the sampling distribution for a sample size of 50, the sampling distribution for a sample size of 150 has a mean that is closer to the population mean ($180,726 vs. $181,035, compared to population mean of $180,796). Also, the sampling distribution for sample size of 150 has a narrower spread / lower variability than the sampling distribution for sample size of 50.
- Based on this sampling distribution, we could estimate the population mean using the mean of the sampling distribution, i.e., $180,726, which is only 0.04% less than the population mean of $180,796.
```
sample_means150 <- rep(NA, 5000)
for (i in 1:5000){
    samp <- sample(price, 150)
    sample_means150[i] <- mean(samp)
}
hist(sample_means150, breaks = 30)
```
```
summary(sample_means150)
```
```
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  160000  176421  180774  180726  184937  206655
```
```
sd(sample_means150)
```
```
## [1] 6355.491
```
Of the sampling distributions from 2 and 3, which has a smaller spread? If we’re concerned with making estimates that are more often close to the true value, would we prefer a distribution with a large or small spread?
Answer:
- The sampling distribution in part 3 (sample size of 150) has a smaller spread than the sampling distribution in part 2 (sample size of 50), i.e., standard deviation of $6K vs. $11K.
- Generally, we would prefer a sampling distribution with a smaller spread, or lower variability, in order to make more accurate estimates of population parameters.

This is a product of OpenIntro that is released under a Creative Commons Attribution-ShareAlike 3.0 Unported. This lab was written for OpenIntro by Andrew Bray and Mine Çetinkaya-Rundel.