Sampling distributions

Welcome back dear student! In this lab, we’ll lay the foundations for statistical inference. You’ll learn a lot in this two part lab, but don’t be intimidated, perseverance will be rewarded!

In Part A of this lab, we investigate the ways in which the statistics from a random sample of data can serve as point estimates for population parameters. We’re interested in formulating a sampling distribution of our estimate in order to learn about the properties of the estimate, such as its distribution.

We consider real estate data from the city of Ames, Iowa. The details of every real estate transaction in Ames is recorded by the City Assessor’s office. Our particular focus for this lab will be all residential home sales in Ames between 2006 and 2010. This collection represents our population of interest (which is rare to have access to), but we will also work with smaller samples from this population.

load(url('http://s3.amazonaws.com/assets.datacamp.com/course/dasi/ames.RData'))

names(ames)

##  [1] "Order"           "PID"             "MS.SubClass"    
##  [4] "MS.Zoning"       "Lot.Frontage"    "Lot.Area"       
##  [7] "Street"          "Alley"           "Lot.Shape"      
## [10] "Land.Contour"    "Utilities"       "Lot.Config"     
## [13] "Land.Slope"      "Neighborhood"    "Condition.1"    
## [16] "Condition.2"     "Bldg.Type"       "House.Style"    
## [19] "Overall.Qual"    "Overall.Cond"    "Year.Built"     
## [22] "Year.Remod.Add"  "Roof.Style"      "Roof.Matl"      
## [25] "Exterior.1st"    "Exterior.2nd"    "Mas.Vnr.Type"   
## [28] "Mas.Vnr.Area"    "Exter.Qual"      "Exter.Cond"     
## [31] "Foundation"      "Bsmt.Qual"       "Bsmt.Cond"      
## [34] "Bsmt.Exposure"   "BsmtFin.Type.1"  "BsmtFin.SF.1"   
## [37] "BsmtFin.Type.2"  "BsmtFin.SF.2"    "Bsmt.Unf.SF"    
## [40] "Total.Bsmt.SF"   "Heating"         "Heating.QC"     
## [43] "Central.Air"     "Electrical"      "X1st.Flr.SF"    
## [46] "X2nd.Flr.SF"     "Low.Qual.Fin.SF" "Gr.Liv.Area"    
## [49] "Bsmt.Full.Bath"  "Bsmt.Half.Bath"  "Full.Bath"      
## [52] "Half.Bath"       "Bedroom.AbvGr"   "Kitchen.AbvGr"  
## [55] "Kitchen.Qual"    "TotRms.AbvGrd"   "Functional"     
## [58] "Fireplaces"      "Fireplace.Qu"    "Garage.Type"    
## [61] "Garage.Yr.Blt"   "Garage.Finish"   "Garage.Cars"    
## [64] "Garage.Area"     "Garage.Qual"     "Garage.Cond"    
## [67] "Paved.Drive"     "Wood.Deck.SF"    "Open.Porch.SF"  
## [70] "Enclosed.Porch"  "X3Ssn.Porch"     "Screen.Porch"   
## [73] "Pool.Area"       "Pool.QC"         "Fence"          
## [76] "Misc.Feature"    "Misc.Val"        "Mo.Sold"        
## [79] "Yr.Sold"         "Sale.Type"       "Sale.Condition" 
## [82] "SalePrice"

A first distribution analysis

The names function tells us there are quite a few variables in the data set, enough to do a very in-depth analysis. For this lab, we’ll restrict our attention to just two of the variables: the above ground living area of the house in square feet (Gr.Liv.Area) and the sale price (SalePrice).

area = ames$Gr.Liv.Area
price = ames$SalePrice

# Calculate the summary and draw a histogram of 'area'
summary(area)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     334    1130    1440    1500    1740    5640

hist(area)

plot of chunk unnamed-chunk-2

Which of the following is false?

The distribution of areas of houses in Ames is unimodal and right-skewed.
50% of houses in Ames are smaller than 1,500 square feet.
The middle 50% of the houses range between approximately 1,130 square feet and 1,740 square feet.
The IQR is approximately 610 square feet.
The smallest house is 334 square feet and the largest is 5,642 square feet.

boxplot(area)

plot of chunk unnamed-chunk-3

IQR(area)

## [1] 616.8

summary(area)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     334    1130    1440    1500    1740    5640

sum(area < 1500) / length(area)

## [1] 0.5526

FALSE: 50% of houses in Ames are smaller than 1,500 square feet.

Sampling from the population

In this lab we have access to the entire population, but this is rarely the case in real life. Gathering information on an entire population is often extremely costly or impossible. Because of this, we often take a sample of the population and use that to understand the properties of the population. If we were interested in estimating the mean living area in Ames based on a sample, we can use the sample function to sample from the population: sample(area, 50).

This command collects a simple random sample of size 50 from the vector area. This is like going into the City Assessor’s database and pulling up the files on 50 random home sales. If we didn’t have access to the population data, working with these 50 files would be considerably simpler than having to go through all 2930 home sales.

samp0 = sample(area, 50)
samp1 = sample(area, 50)

hist(samp0)

plot of chunk unnamed-chunk-4

hist(samp1)

plot of chunk unnamed-chunk-4

If we’re interested in estimating the average living area of homes in Ames using the sample, our best single guess is the sample mean: mean(samp1).

Depending on which 50 homes you selected, your estimate could be a bit above or a bit below the true population mean of approximately 1,500 square feet. In general, though, the sample mean turns out to be a pretty good estimate of the average living area, and we were able to get it by sampling less than 3% of the population. Suppose we took two more samples, one of size 100 and one of size 1000.

Which would you think would provide a more accurate estimate of the population mean?

Sample size of 50
Sample size of 100
Sample size of 1000

Sample size of 1000

Yes! The bigger the sample set, the more representative of the complete population, and thus the higher its accuracy.

The sampling distribution

Not surprisingly, every time we take another random sample, we get a different sample mean. It’s useful to get a sense of just how much variability we should expect when estimating the population mean this way.

The distribution of sample means, called the sampling distribution, can help us understand this variability. In this lab, because we have access to the population, we can build up the sampling distribution for the sample mean by repeating the above steps many times. Here we will generate 5000 samples and compute the sample mean of each.

The code in the editor takes 5000 samples of size 50 from the population, calculates the mean of each sample, and stores each result in a vector called sample_means50

sample_means50 = rep(NA, 5000)

for (i in 1:5000) {
    samp = sample(area, 50)
    sample_means50[i] = mean(samp)
}

hist(sample_means50, breaks = 13)

plot of chunk unnamed-chunk-5

More on sampling

Mechanics aside, let’s return to the reason we used a for loop: to compute a sampling distribution, specifically, this one: hist(sample_means50).

The sampling distribution that we computed tells us much about estimating the average living area in homes in Ames. Because the sample mean is an unbiased estimator, the sampling distribution is centered at the true average living area of the the population, and the spread of the distribution indicates how much variability is induced by sampling only 50 home sales.

Let’s see what the effect is of the sample size on our distribution. Take a look at the code. Here we create two more sampling distributions sample_means10 and sample_means100.

sample_means10 = rep(NA, 5000)
sample_means100 = rep(NA, 5000)

# Run the for loop:
for (i in 1:5000) {
    samp = sample(area, 10)
    sample_means10[i] = mean(samp)
    samp = sample(area, 100)
    sample_means100[i] = mean(samp)
}

head(sample_means10)

## [1] 1291 1639 1578 1547 1396 1908

head(sample_means50)

## [1] 1473 1429 1566 1349 1505 1498

head(sample_means100)

## [1] 1485 1461 1417 1444 1514 1489

Influence of sample size

To see the effect that different sample sizes have on the sampling distribution, let’s plot the three distributions on top of one another.

In R you can plot all three of them on the same graph by specifying that you’d like to divide the plotting area into three rows and one column of plots. You do this with the following command: par(mfrow = c(3, 1)).

For easy comparison, we’d also like to use the same scale for each histogram. As a common scale, we’ll use to the limits (min, max) of the first sample distribution: xlimits = range(sample_means10).

Now we can set the xlim argument of the hist function to the xlimits object to specify the range of the x-axis of the histograms.

par(mfrow = c(3, 1))

# Define the limits for the x-axis:
xlimits = range(sample_means10)

# Draw the histograms:

hist(sample_means10, xlim=xlimits, breaks=20)
hist(sample_means50, xlim=xlimits, breaks=20)
hist(sample_means100, xlim=xlimits, breaks=20)

plot of chunk unnamed-chunk-7

par(mfrow = c(1, 1))

It makes intuitive sense that as the sample size increases, the center of the sampling distribution becomes a more reliable estimate for the true population mean. Also as the sample size increases, the variability of the sampling distribution _________.

decreases

Now: prices!

So far, we have only focused on estimating the mean living area in homes in Ames.

sample_50 = sample(price, 50)
mean(sample_50)

## [1] 177255

Sampling distribution of prices

Since you have access to the population, we can simulate the sampling distribution for \(\bar{x}\) by taking 5000 samples of size 50 from the population and compute 5000 sample means.

n = 5000
sample_means50 = rep(NA, n)

for (i in 1:n) {
  sample_means50[i] = mean(sample(price, 50))
}

head(sample_means50)

## [1] 210788 179081 176709 176410 208226 202882

hist(sample_means50)

plot of chunk unnamed-chunk-9

More on sampling distribution of prices

Do the same thing, but now create a vector called sample_means150 to store means of samples of size 150.

Describe the shape of this sampling distribution, and compare it to the sampling distribution for a sample size of 50. Based on this sampling distribution, what would you guess to be the mean sale price of homes in Ames?

n = 5000
sample_means150 = rep(NA, n)

for (i in 1:n) {
  sample_means150[i] = mean(sample(price, 150))
}

head(sample_means150)

## [1] 180297 173609 183966 191916 190480 194980

xlim = range(sample_means50)

par(mfrow = c(2, 1))
hist(sample_means50, xlim=xlim)
hist(sample_means150, xlim=xlim)

plot of chunk unnamed-chunk-10

median(sample_means150)

## [1] 180685

median(sample_means50)

## [1] 180582