In this analysis, we investigate the ways in which the statistics from a random sample of data can serve as point estimates for population parameters. We’re interested in formulating a sampling distribution of our estimate in order to learn about the properties of the estimate, such as its distribution. The instructions for this lab can be found from https://github.com/andrewpbray/oiLabs-base-R/tree/master/sampling_distributions

The Data

We consider real estate data from the city of Ames, Iowa. The details of every real estate transaction in Ames is recorded by the City Assessor’s office. Our particular focus for this lab will be all residential home sales in Ames between 2006 and 2010. This collection represents our population of interest. In this lab we would like to learn about these home sales by taking smaller samples from the full population. Let’s load the data.

For this lab, we’ll restrict our attention to just two of the variables: the above ground living area of the house in square feet (Gr.Liv.Area) and the sale price (SalePrice). To save some effort throughout the lab, create two variables with short names that represent these two variables.

Let’s look at the distribution of area in our population of home sales by calculating a few summary statistics and making a histogram.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     334    1126    1442    1500    1743    5642

The Unknown Sampling Distribution on area

We are interested in estimating the mean living area in Ames based on a sample size of 50. Let’s survey the population.

## [1] 1455.42

The estimated sample mean area based on 50 randomly selected samples, in this instance is 1455.42 which is close but not exactly the true population mean area of 1500.

Here we will generate 5000 samples and compute the sample mean of each.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1304    1453    1499    1501    1548    1797

Here we use R to take 5000 samples of size 50 from the population, calculate the mean of each sample, and store each result in a vector called sample_means50. To see the effect that different sample sizes have on the sampling distribution, plot the three distributions on top of one another.

To get a sense of the effect that sample size has on distribution, let’s build up two more sampling distributions: one based on a sample size of 10 and another based on a sample size of 100.

To see the effect that difference sample sizes have on the sampling distribution, plot the three distributions on top of one another

From the three histograms above, it’s clear that the sampling distribution looks more like a normal distributed bell curve as the sample size gets larger.

The Unknown Sampling Distribution on price

## [1] 162016.4
## [1] 180796.1

Of the sampling distribution from sample size of 50 to 150, the latter has a smaller spread. If we’re concerned with making estimates that are more often close to the true value, we’d prefer a distribution with a small spread.