Foundations for statistical inference - Sampling distributions

The data

download.file("http://www.openintro.org/stat/data/ames.RData", destfile = "ames.RData")
load("ames.RData")

For this lab, we’ll restrict our attention to just two of the variables: the above ground living area of the house in square feet (Gr.Liv.Area) and the sale price (SalePrice)

area <- ames$Gr.Liv.Area
price <- ames$SalePrice

Calculate a few summary statistics and making a histogram

summary(area)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     334    1126    1442    1500    1743    5642

hist(area)

Exercise 1

Describe this population distribution.

Answer: It is skewed right.

The unknown sampling distribution

we can use the following command to survey the population

samp1 <- sample(area, 50)

Exercise 2

Describe the distribution of this sample. How does it compare to the distribution of the population?

hist(samp1)

Answer: It is still skewed right, but not as much as the population distribution.

Estimate the average living area

mean(samp1)

## [1] 1538.5

Exercise 3

Take a second sample, also of size 50, and call it samp2. How does the mean of samp2 compare with the mean of samp1? Suppose we took two more samples, one of size 100 and one of size 1000. Which would you think would provide a more accurate estimate of the population mean?

Answer: The means are different, but very close in value. I would predict that each time we took more samples, the same thing will occur. The sample of size 1000 will provide a more precise estimate of the population mean.

samp2 <- sample(area, 50)
mean(samp2)

## [1] 1570.68

hist(samp2)

samp3 <- sample(area, 100)
mean(samp3)

## [1] 1455.18

hist(samp3)

samp4 <- sample(area, 1000)
mean(samp4)

## [1] 1508.034

hist(samp4)

sample_means50 <- rep(NA, 5000)

for(i in 1:5000){
   samp <- sample(area, 50)
   sample_means50[i] <- mean(samp)
   }

hist(sample_means50, breaks = 25)

Exercise 4

How many elements are there in sample_means50? Describe the sampling distribution, and be sure to specifically note its center. Would you expect the distribution to change if we instead collected 50,000 sample means?

Answer: There are 5000 elements in this distribution. The center is at about 1500. I would expect the center to stay about the same, even if we collected 50,000 samples. What should change is the spread of the distribution of all means.

Interlude: The for loop

Exercise 5

To make sure you understand what you’ve done in this loop, try running a smaller version. Initialize a vector of 100 zeros called sample_means_small. Run a loop that takes a sample of size 50 from area and stores the sample mean in sample_means_small, but only iterate from 1 to 100. Print the output to your screen (type sample_means_small into the console and press enter). How many elements are there in this object called sample_means_small? What does each element represent?

Answer: When the sample size is larger, the spread is smaller, making the distribution more precise.

Sample size and the sampling distribution

hist(sample_means50)

sample_means10 <- rep(NA, 5000)
sample_means100 <- rep(NA, 5000)

for(i in 1:5000){
  samp <- sample(area, 10)
  sample_means10[i] <- mean(samp)
  samp <- sample(area, 100)
  sample_means100[i] <- mean(samp)
}

par(mfrow = c(3, 1))

xlimits <- range(sample_means10)

hist(sample_means10, breaks = 20, xlim = xlimits)
hist(sample_means50, breaks = 20, xlim = xlimits)
hist(sample_means100, breaks = 20, xlim = xlimits)

Exercise 6

When the sample size is larger, what happens to the center? What about the spread?

Answer: The center stays about the same, but the spread narrows, to give a more precise estimate of the true mean.