Lab 4a

load("more/ames.RData")
area <- ames$Gr.Liv.Area
price <- ames$SalePrice
summary(area)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     334    1126    1442    1500    1743    5642
hist(area)

  1. Describe this population distribution.

This distribution doesn’t look normal given its lack of a “bell curve” shape. Additionally, it appears that there might be some skewness to the right.

  1. Describe the distribution of this sample. How does it compare to the distribution of the population?

The mean and median of the sample don’t appear to be significantly different than the population’s. However, the distribution is very different as the sample distribution has fatter tails.

samp1 <- sample(area, 50)
summary(area)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     334    1126    1442    1500    1743    5642
summary(samp1)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     796    1198    1444    1541    1814    2828
hist(samp1)

  1. Take a second sample, also of size 50, and call it samp2. How does the mean of samp2 compare with the mean of samp1? Suppose we took two more samples, one of size 100 and one of size 1000. Which would you think would provide a more accurate estimate of the population mean?

samp1, samp2, and samp3 don’t appear to be very different (judging by their means). However, samp4 with 1000 observations is more representative of the actual population.

samp2 <- sample(area, 50)
summary(samp1)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     796    1198    1444    1541    1814    2828
summary(samp2)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     816    1277    1532    1585    1843    2855
samp3 <- sample(area,100)
samp4 <- sample(area,1000)
summary(samp3)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     641    1106    1382    1547    1760    5642
summary(samp4)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     630    1102    1453    1498    1768    5642
  1. How many elements are there in sample_means50? Describe the sampling distribution, and be sure to specifically note its center. Would you expect the distribution to change if we instead collected 50,000 sample means?

There are 5000 elements in the vector. The center, median, of the vector is 1495, slightly above the median of the population. I wouldn’t expect significant change by increasing the sample to 50k just because the size of the population is ony 2930 and 5000 observations seems like plenty of samples that have converged to the population.

sample_means50 <- rep(NA, 5000)

for(i in 1:5000){
   samp <- sample(area, 50)
   sample_means50[i] <- mean(samp)
   }
summary(area)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     334    1126    1442    1500    1743    5642
summary(sample_means50)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1261    1452    1499    1500    1548    1759
hist(sample_means50)

hist(sample_means50, breaks = 25)

  1. To make sure you understand what you’ve done in this loop, try running a smaller version. Initialize a vector of 100 zeros called sample_means_small. Run a loop that takes a sample of size 50 from area and stores the sample mean in sample_means_small, but only iterate from 1 to 100. Print the output to your screen (type sample_means_small into the console and press enter). How many elements are there in this object called sample_means_small? What does each element represent?

There are 100 elements where each one represents the mean of a small sample size of the area population.

sample_means_small <- rep(0, 100)

for(i in 1:100){
   samp <- sample(area, 50)
   sample_means_small[i] <- mean(samp)
}
print(sample_means_small)
##   [1] 1673.12 1508.70 1410.40 1477.98 1573.92 1468.52 1537.50 1544.62
##   [9] 1452.12 1583.76 1411.84 1630.64 1519.82 1558.08 1406.52 1518.58
##  [17] 1522.52 1562.40 1612.30 1559.52 1406.70 1462.06 1491.04 1504.38
##  [25] 1460.94 1517.48 1519.78 1506.36 1526.88 1542.32 1618.36 1434.70
##  [33] 1509.22 1552.76 1549.54 1673.34 1441.06 1426.84 1457.94 1563.66
##  [41] 1446.04 1533.00 1316.36 1531.62 1512.40 1540.16 1549.12 1399.96
##  [49] 1549.98 1565.32 1515.36 1454.56 1437.08 1476.62 1387.14 1466.98
##  [57] 1507.44 1474.30 1632.76 1416.86 1507.56 1457.90 1377.18 1494.30
##  [65] 1476.74 1448.02 1590.96 1345.06 1536.68 1569.88 1511.00 1472.64
##  [73] 1510.84 1564.94 1434.22 1469.96 1444.46 1550.06 1455.10 1487.48
##  [81] 1491.04 1435.04 1523.10 1534.80 1384.32 1548.66 1506.24 1524.20
##  [89] 1434.20 1471.00 1502.48 1493.66 1595.24 1531.46 1366.52 1528.06
##  [97] 1607.70 1328.38 1500.20 1420.16
sample_means10 <- rep(NA, 5000)
sample_means100 <- rep(NA, 5000)

for(i in 1:5000){
  samp <- sample(area, 10)
  sample_means10[i] <- mean(samp)
  samp <- sample(area, 100)
  sample_means100[i] <- mean(samp)
}
par(mfrow = c(3, 1))

xlimits <- range(sample_means10)

hist(sample_means10, breaks = 20, xlim = xlimits)
hist(sample_means50, breaks = 20, xlim = xlimits)
hist(sample_means100, breaks = 20, xlim = xlimits)

  1. When the sample size is larger, what happens to the center? What about the spread?

The sample size is larger, there is more concentration around the mean, which happens to get bigger, and the spread narrows.


On your own

So far, we have only focused on estimating the mean living area in homes in Ames. Now you’ll try to estimate the mean home price.

1 Take a random sample of size 50 from price. Using this sample, what is your best point estimate of the population mean?

From my randon sample of 50 observations, the best point estimate of the population mean is $204,100.

sample_price <- sample(price,50)
summary(price)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   12790  129500  160000  180800  213500  755000
summary(sample_price)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   40000  146000  173500  196600  217000  437200

2 Since you have access to the population, simulate the sampling distribution for \(\bar{x}_{price}\) by taking 5000 samples from the population of size 50 and computing 5000 sample means. Store these means in a vector called sample_means50. Plot the data, then describe the shape of this sampling distribution. Based on this sampling distribution, what would you guess the mean home price of the population to be? Finally, calculate and report the population mean.

The sample distribution almost looks like a perfect bell curve (i.e., normal distribution). From the histogram, I would guess the mean home price is $180,000. The population mean is actually $180,800.

sample_means50 <- rep(NA, 5000)

for(i in 1:5000){
  samp <- sample(price, 50)
  sample_means50[i] <- mean(samp)
}
summary(sample_means50)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  145200  173200  180400  181000  188400  222000
hist(sample_means50)

summary(price)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   12790  129500  160000  180800  213500  755000

3 Change your sample size from 50 to 150, then compute the sampling distribution using the same method as above, and store these means in a new vector called sample_means150. Describe the shape of this sampling distribution, and compare it to the sampling distribution for a sample size of 50. Based on this sampling distribution, what would you guess to be the mean sale price of homes in Ames?

I would guess the mean is slightly above $180,000.

sample_means150 <- rep(NA, 5000)

for(i in 1:5000){
  samp <- sample(price, 150)
  sample_means150[i] <- mean(samp)
}
summary(sample_means150)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  161000  176400  180600  180700  184900  204100
summary(sample_means50)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  145200  173200  180400  181000  188400  222000
par(mfrow = c(2, 1))

xlimits2 <- range(sample_means50)

hist(sample_means50, breaks = 20, xlim = xlimits2)
hist(sample_means150, breaks = 20, xlim = xlimits2)

4 Of the sampling distributions from 2 and 3, which has a smaller spread? If we’re concerned with making estimates that are more often close to the true value, would we prefer a distribution with a large or small spread?

The sample of 150 observations has the smaller spread; thus, we would want to make estimates from a distribution with a small spread (less variability).