Foundations for statistical inference - Sampling distributions

In this analysis, we investigate the ways in which the statistics from a random sample of data can serve as point estimates for population parameters. We’re interested in formulating a sampling distribution of our estimate in order to learn about the properties of the estimate, such as its distribution. The instructions for this lab can be found from https://github.com/andrewpbray/oiLabs-base-R/tree/master/sampling_distributions

The Data

We consider real estate data from the city of Ames, Iowa. The details of every real estate transaction in Ames is recorded by the City Assessor’s office. Our particular focus for this lab will be all residential home sales in Ames between 2006 and 2010. This collection represents our population of interest. In this lab we would like to learn about these home sales by taking smaller samples from the full population. Let’s load the data.

download.file("http://www.openintro.org/stat/data/ames.RData", destfile = "ames.RData")
load("ames.RData")

For this lab, we’ll restrict our attention to just two of the variables: the above ground living area of the house in square feet (Gr.Liv.Area) and the sale price (SalePrice). To save some effort throughout the lab, create two variables with short names that represent these two variables.

area <- ames$Gr.Liv.Area
price <- ames$SalePrice

Let’s look at the distribution of area in our population of home sales by calculating a few summary statistics and making a histogram.

summary(area)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     334    1126    1442    1500    1743    5642

hist(area)

The Unknown Sampling Distribution on `area`

We are interested in estimating the mean living area in Ames based on a sample size of 50. Let’s survey the population.

set.seed(12345)
samp1 <- sample(area, 50)
mean(samp1)

## [1] 1455.42

The estimated sample mean area based on 50 randomly selected samples, in this instance is 1455.42 which is close but not exactly the true population mean area of 1500.

Here we will generate 5000 samples and compute the sample mean of each.

sample_means50 <- rep(NA, 5000)

for (i in 1:5000){
        samp <- sample(area,50)
        sample_means50[i] <- mean(samp)
}
summary(sample_means50)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1304    1453    1499    1501    1548    1797

## Histogram of sample means (5000 draws, 50 samples each time)
hist(sample_means50, breaks = 25)

## Vertical line at population mean
abline(v = mean(area), col = "red", lwd = 3, lty = 2)

Here we use R to take 5000 samples of size 50 from the population, calculate the mean of each sample, and store each result in a vector called sample_means50. To see the effect that different sample sizes have on the sampling distribution, plot the three distributions on top of one another.

To get a sense of the effect that sample size has on distribution, let’s build up two more sampling distributions: one based on a sample size of 10 and another based on a sample size of 100.

sample_means10 <- rep(NA, 10)
sample_means100 <- rep(NA, 100)

for (i in 1:5000){
        samp <- sample(area,10)
        sample_means10[i] <- mean(samp)
        samp <- sample(area,100)
        sample_means100[i] <- mean(samp)
}

To see the effect that difference sample sizes have on the sampling distribution, plot the three distributions on top of one another

par(mfrow = c(3,1))

xlimits <- range(sample_means10)

hist(sample_means10, breaks = 20, xlim = xlimits)
hist(sample_means50, breaks = 20, xlim = xlimits)
hist(sample_means100, breaks = 20, xlim = xlimits)

From the three histograms above, it’s clear that the sampling distribution looks more like a normal distributed bell curve as the sample size gets larger.

The Unknown Sampling Distribution on `price`

Take a random sample of size 50 from price. Using this sample, the best point estimate of the population mean is the sample mean.

## sample mean
set.seed(54321)
samp2 <- sample(price, 50)
mean(samp2)

## [1] 162016.4

Since we have access to the population, simulate the sampling distribution for \(\bar{x}_{price}\) by taking 5000 samples from the population of size 50 and computing 5000 sample means.

sample_means50 <- rep(NA,5000)
for (i in 1:5000){
       sample_means50[i] <- mean(sample(price,50))
}

## Histogram of sample means
hist(sample_means50, breaks = 20, xlim = range(sample_means50))

## Population mean
mean(price)

## [1] 180796.1

## Vertical line of population mean
abline(v = mean(price), col = "red", lwd = 3, lty = 2)

Increase the sample size from 50 to 150.

sample_means150 <- rep(NA, 5000)
for (i in 1:5000){
        sample_means150[i] <- mean(sample(price,150))
}

hist(sample_means150, breaks = 20, xlim = range(sample_means150))
abline(v = mean(price), col = "blue", lwd = 3, lty = 2)

Of the sampling distribution from sample size of 50 to 150, the latter has a smaller spread. If we’re concerned with making estimates that are more often close to the true value, we’d prefer a distribution with a small spread.

Foundations for statistical inference - Sampling distributions

Roger (Piaoyang) Hu

6/22/2019

The Data

The Unknown Sampling Distribution on `area`

The Unknown Sampling Distribution on `price`

Foundations for statistical inference - Sampling distributions

Roger (Piaoyang) Hu

6/22/2019

The Data

The Unknown Sampling Distribution on area

The Unknown Sampling Distribution on price

The Unknown Sampling Distribution on `area`

The Unknown Sampling Distribution on `price`