Foundations for statistical inference - Sampling distributions

download.file("http://www.openintro.org/stat/data/ames.RData", destfile = "ames.RData")
load("ames.RData")
area <- ames$Gr.Liv.Area                     # sq. feet
price <- ames$SalePrice                      # dollars

summary(area)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     334    1126    1442    1500    1743    5642
hist(area, col = "lightseagreen",            # histogram
     main = "Area population histogram",
     xlab = "area (sq. ft.)")

boxplot(area, col = "lightseagreen",         # boxplot
        main = "Area population boxplot")

Exercise 1 - population distribution

Describe this population distribution.

The distribution is heavily right skewed, many high outliers create a long, thin right tail. It has a mean = 1500.

Exercise 2 - Sample 1

Create a sample and describe it’s distribution.

set.seed(21)                              # set seed          
samp1 <- sample(area, 50)                 # create sample
summary(samp1)                            # sample's statistics
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     796    1106    1390    1520    1731    2978
hist(samp1, col = "slateblue",            # histogram                                          
     main = "Area Samp1 Histogram",
     xlab = "area (sq. ft.)")

boxplot(samp1, col = "slateblue",         # boxplot
        main = "Area Samp1 boxplot")

The sample mean is 1520, slightly higher than but close to the population mean. The distribution is a bit right skewed with a high outlier.

Exercise 3 - Sample 1

set.seed(6)
samp1 <- sample(area, 50)
summary(samp1)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     733    1106    1352    1497    1827    3493
hist(samp1, col = "turquoise",
     main = "Area Samp2 Histogram",
     xlab = "area (sq. ft.)")

boxplot(samp1, col = "turquoise",
        main = "Area Samp2 boxplot")

How does the mean of samp2 compare with the mean of samp1?

The samp2 mean is slightly lower than, but closer to, the population mean than that of samp1.

Suppose we took two more samples, one of size 100 and one of size 1000. Which would you think would provide a more accurate estimate of the population mean?

Neither, the accuracy of the estimate of the population mean from a sample isn’t dependent on sample size.

summary(sample(area, 100))                      # statistics for sample size 100
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     630    1091    1438    1453    1755    2500
summary(sample(area, 1000))                     # statistics for sample size 1000
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     407    1143    1456    1510    1764    3820

Exercise 4 - sampling distribution

sample_means50 <- rep(NA, 5000)     # create vector for sampling

for(i in 1:5000){                   # collect means for sampling 
   samp <- sample(area, 50)
   sample_means50[i] <- mean(samp)
   }

hist(sample_means50,                # histogram of the sampling  
     breaks = 25,
     col = "plum",
     main = "Sampling distribution")

summary(sample_means50)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1281    1452    1498    1500    1547    1800

How many elements are there in sample_means50?

There are 5000 elements in the sample_means50 vector, each is the mean of a sample of 50 areas taken from the area population.

Describe the sampling distribution, and be sure to specifically note its center.

The sampling distribution is normal, the center is very close to the population mean; within a 1 or 2 sq. ft, depending on the execution of the sampling.

Would you expect the distribution to change if we instead collected 50,000 sample means?

It would still be a normal distribution with a narrower spread, and the mean closely reflecting the population mean.

Exercise 5 - for loop

sample_means_small <- rep(0, 100)   # initialize small vector

for(i in 1:100){                         # collect means for 100 samples
   samp <- sample(area, 50)
   sample_means_small[i] <- mean(samp)
   }

sample_means_small                       # look at the vector of sample meanes
##   [1] 1433.32 1606.66 1572.74 1496.42 1433.88 1453.48 1448.00 1542.00 1554.96
##  [10] 1537.92 1472.90 1491.74 1619.22 1446.66 1373.78 1541.90 1445.54 1502.92
##  [19] 1479.34 1448.38 1509.46 1518.56 1416.28 1457.22 1573.80 1426.44 1447.70
##  [28] 1506.98 1500.52 1417.06 1527.28 1538.06 1427.80 1429.36 1482.12 1466.06
##  [37] 1597.94 1435.04 1509.88 1522.00 1495.54 1463.12 1571.70 1396.36 1600.52
##  [46] 1439.70 1555.46 1496.98 1454.80 1574.60 1310.30 1505.44 1524.64 1486.72
##  [55] 1523.00 1557.02 1386.34 1509.52 1554.54 1480.14 1521.66 1523.42 1509.66
##  [64] 1527.04 1463.02 1556.38 1546.52 1396.88 1483.56 1434.46 1590.30 1456.24
##  [73] 1465.22 1506.84 1684.14 1392.64 1400.28 1504.24 1434.14 1582.40 1521.50
##  [82] 1658.28 1509.68 1584.62 1554.14 1471.40 1552.40 1349.28 1489.50 1481.60
##  [91] 1548.30 1537.58 1497.82 1499.66 1560.90 1514.92 1514.36 1395.62 1524.62
## [100] 1393.48

How many elements are there in this object called sample_means_small? What does each element represent?

There are 100 elements in sample_means_small, each is the mean of the sample of 50 areas taken from the area population.

Exercise 6 - sample size and the sampling distribution

sample_means10 <- rep(NA, 5000)        # initialize two vectors of
sample_means100 <- rep(NA, 5000)       # different sizes

for(i in 1:5000){                      # create sampling distributions 
  samp <- sample(area, 10)
  sample_means10[i] <- mean(samp)
  samp <- sample(area, 100)
  sample_means100[i] <- mean(samp)
}

par(mfrow = c(3, 1))                   # set display rows, columns 

xlimits <- range(sample_means10)       # set x axes limits to the widest range

hist(sample_means10, breaks = 20, xlim = xlimits, col = "plum1")
hist(sample_means50, breaks = 20, xlim = xlimits, col = "plum2")
hist(sample_means100, breaks = 20, xlim = xlimits, col = "plum3")

When the sample size is larger, what happens to the center? What about the spread?

The center doesn’t change, it still closely approximates the population mean, the spread of the sampling distribution gets narrower.

Price

1. Take a random sample of size 50 from price. Using this sample, what is your best point estimate of the population mean?

price1 <- sample(price, 50)      # create sample of price data

paste("Sample mean price", dollar(mean(price1)))   # get the sample mean
## [1] "Sample mean price $182,315"

The estimate of the population mean based on this sample is $182,315.

Simulate the sampling distribution for x¯price by taking 5000 samples from the population of size 50 and computing 5000 sample means. Store these means in a vector called sample_means50.

sample_means50 <- rep(NA, 5000)        # initialize vector

for(i in 1:5000){                      # create sampling data
   samp <- sample(price, 50)
   sample_means50[i] <- mean(samp)
   }

Plot the data, then describe the shape of this sampling distribution. Based on this sampling distribution, what would you guess the mean home price of the population to be?

paste("Sampling (n=50) mean price", dollar(mean(sample_means50)))
## [1] "Sampling (n=50) mean price $181,032"

From the price sampling with n = 50, the estimate of the mean home price is $181,032.

Finally, calculate and report the population mean.

paste("Price population mean", dollar(mean(price)))
## [1] "Price population mean $180,796"

Change your sample size from 50 to 150, then compute the sampling distribution using the same method as above, and store these means in a new vector called sample_means150.

sample_means150 <- rep(NA, 5000)   # initialize vector

for(i in 1:5000){                     # get means of samples
   samp <- sample(price, 150)
   sample_means150[i] <- mean(samp)
   }

Describe the shape of this sampling distribution, and compare it to the sampling distribution for a sample size of 50.

par(mfrow = c(2, 1))                            # set display rows, columns

hist(sample_means50, col = "darkseagreen",        
     main = "Price Sampling (n=50) Histogram",
     xlab = "price")

hist(sample_means150, col = "darkseagreen",
     main = "Price Sampling (n=150) Histogram",
     xlab = "price")

Both sampling distributions are normal, with centers around $180,00. The spread for the sampling with n=150 is smaller than the spread of the sampling with n=50.

Based on this sampling distribution, what would you guess to be the mean sale price of homes in Ames?

paste("Sampling (n=150) mean price", dollar(mean(sample_means150)))
## [1] "Sampling (n=150) mean price $180,786"

Of the sampling distributions from 2 and 3, which has a smaller spread? If we’re concerned with making estimates that are more often close to the true value, would we prefer a distribution with a large or small spread?

The sampling with n=150 has a smaller spread than the sampling with n=50. A sampling distribution with a smaller spread would give estimates that are more often close to the true value.