In this analysis, we investigate the ways in which the statistics from a random sample of data can serve as point estimates for population parameters. We’re interested in formulating a sampling distribution of our estimate in order to learn about the properties of the estimate, such as its distribution. The instructions for this lab can be found from https://github.com/andrewpbray/oiLabs-base-R/tree/master/sampling_distributions
We consider real estate data from the city of Ames, Iowa. The details of every real estate transaction in Ames is recorded by the City Assessor’s office. Our particular focus for this lab will be all residential home sales in Ames between 2006 and 2010. This collection represents our population of interest. In this lab we would like to learn about these home sales by taking smaller samples from the full population. Let’s load the data.
download.file("http://www.openintro.org/stat/data/ames.RData", destfile = "ames.RData")
load("ames.RData")For this lab, we’ll restrict our attention to just two of the variables: the above ground living area of the house in square feet (Gr.Liv.Area) and the sale price (SalePrice). To save some effort throughout the lab, create two variables with short names that represent these two variables.
Let’s look at the distribution of area in our population of home sales by calculating a few summary statistics and making a histogram.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 334 1126 1442 1500 1743 5642
areaWe are interested in estimating the mean living area in Ames based on a sample size of 50. Let’s survey the population.
## [1] 1455.42
The estimated sample mean area based on 50 randomly selected samples, in this instance is 1455.42 which is close but not exactly the true population mean area of 1500.
Here we will generate 5000 samples and compute the sample mean of each.
sample_means50 <- rep(NA, 5000)
for (i in 1:5000){
samp <- sample(area,50)
sample_means50[i] <- mean(samp)
}
summary(sample_means50)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1304 1453 1499 1501 1548 1797
## Histogram of sample means (5000 draws, 50 samples each time)
hist(sample_means50, breaks = 25)
## Vertical line at population mean
abline(v = mean(area), col = "red", lwd = 3, lty = 2)Here we use R to take 5000 samples of size 50 from the population, calculate the mean of each sample, and store each result in a vector called sample_means50. To see the effect that different sample sizes have on the sampling distribution, plot the three distributions on top of one another.
To get a sense of the effect that sample size has on distribution, let’s build up two more sampling distributions: one based on a sample size of 10 and another based on a sample size of 100.
sample_means10 <- rep(NA, 10)
sample_means100 <- rep(NA, 100)
for (i in 1:5000){
samp <- sample(area,10)
sample_means10[i] <- mean(samp)
samp <- sample(area,100)
sample_means100[i] <- mean(samp)
}To see the effect that difference sample sizes have on the sampling distribution, plot the three distributions on top of one another
par(mfrow = c(3,1))
xlimits <- range(sample_means10)
hist(sample_means10, breaks = 20, xlim = xlimits)
hist(sample_means50, breaks = 20, xlim = xlimits)
hist(sample_means100, breaks = 20, xlim = xlimits)From the three histograms above, it’s clear that the sampling distribution looks more like a normal distributed bell curve as the sample size gets larger.
priceprice. Using this sample, the best point estimate of the population mean is the sample mean.## [1] 162016.4
sample_means50 <- rep(NA,5000)
for (i in 1:5000){
sample_means50[i] <- mean(sample(price,50))
}
## Histogram of sample means
hist(sample_means50, breaks = 20, xlim = range(sample_means50))
## Population mean
mean(price)## [1] 180796.1
sample_means150 <- rep(NA, 5000)
for (i in 1:5000){
sample_means150[i] <- mean(sample(price,150))
}
hist(sample_means150, breaks = 20, xlim = range(sample_means150))
abline(v = mean(price), col = "blue", lwd = 3, lty = 2)Of the sampling distribution from sample size of 50 to 150, the latter has a smaller spread. If we’re concerned with making estimates that are more often close to the true value, we’d prefer a distribution with a small spread.