load("more/ames.RData")
area <- ames$Gr.Liv.Area
price <- ames$SalePrice
summary(area)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 334 1126 1442 1500 1743 5642
hist(area)This distribution doesn’t look normal given its lack of a “bell curve” shape. Additionally, it appears that there might be some skewness to the right.
The mean and median of the sample don’t appear to be significantly different than the population’s. However, the distribution is very different as the sample distribution has fatter tails.
samp1 <- sample(area, 50)
summary(area)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 334 1126 1442 1500 1743 5642
summary(samp1)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 796 1198 1444 1541 1814 2828
hist(samp1)samp2. How does the mean of samp2 compare with the mean of samp1? Suppose we took two more samples, one of size 100 and one of size 1000. Which would you think would provide a more accurate estimate of the population mean?samp1, samp2, and samp3 don’t appear to be very different (judging by their means). However, samp4 with 1000 observations is more representative of the actual population.
samp2 <- sample(area, 50)
summary(samp1)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 796 1198 1444 1541 1814 2828
summary(samp2)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 816 1277 1532 1585 1843 2855
samp3 <- sample(area,100)
samp4 <- sample(area,1000)
summary(samp3)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 641 1106 1382 1547 1760 5642
summary(samp4)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 630 1102 1453 1498 1768 5642
sample_means50? Describe the sampling distribution, and be sure to specifically note its center. Would you expect the distribution to change if we instead collected 50,000 sample means?There are 5000 elements in the vector. The center, median, of the vector is 1495, slightly above the median of the population. I wouldn’t expect significant change by increasing the sample to 50k just because the size of the population is ony 2930 and 5000 observations seems like plenty of samples that have converged to the population.
sample_means50 <- rep(NA, 5000)
for(i in 1:5000){
samp <- sample(area, 50)
sample_means50[i] <- mean(samp)
}
summary(area)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 334 1126 1442 1500 1743 5642
summary(sample_means50)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1261 1452 1499 1500 1548 1759
hist(sample_means50)hist(sample_means50, breaks = 25)sample_means_small. Run a loop that takes a sample of size 50 from area and stores the sample mean in sample_means_small, but only iterate from 1 to 100. Print the output to your screen (type sample_means_small into the console and press enter). How many elements are there in this object called sample_means_small? What does each element represent?There are 100 elements where each one represents the mean of a small sample size of the area population.
sample_means_small <- rep(0, 100)
for(i in 1:100){
samp <- sample(area, 50)
sample_means_small[i] <- mean(samp)
}
print(sample_means_small)## [1] 1673.12 1508.70 1410.40 1477.98 1573.92 1468.52 1537.50 1544.62
## [9] 1452.12 1583.76 1411.84 1630.64 1519.82 1558.08 1406.52 1518.58
## [17] 1522.52 1562.40 1612.30 1559.52 1406.70 1462.06 1491.04 1504.38
## [25] 1460.94 1517.48 1519.78 1506.36 1526.88 1542.32 1618.36 1434.70
## [33] 1509.22 1552.76 1549.54 1673.34 1441.06 1426.84 1457.94 1563.66
## [41] 1446.04 1533.00 1316.36 1531.62 1512.40 1540.16 1549.12 1399.96
## [49] 1549.98 1565.32 1515.36 1454.56 1437.08 1476.62 1387.14 1466.98
## [57] 1507.44 1474.30 1632.76 1416.86 1507.56 1457.90 1377.18 1494.30
## [65] 1476.74 1448.02 1590.96 1345.06 1536.68 1569.88 1511.00 1472.64
## [73] 1510.84 1564.94 1434.22 1469.96 1444.46 1550.06 1455.10 1487.48
## [81] 1491.04 1435.04 1523.10 1534.80 1384.32 1548.66 1506.24 1524.20
## [89] 1434.20 1471.00 1502.48 1493.66 1595.24 1531.46 1366.52 1528.06
## [97] 1607.70 1328.38 1500.20 1420.16
sample_means10 <- rep(NA, 5000)
sample_means100 <- rep(NA, 5000)
for(i in 1:5000){
samp <- sample(area, 10)
sample_means10[i] <- mean(samp)
samp <- sample(area, 100)
sample_means100[i] <- mean(samp)
}par(mfrow = c(3, 1))
xlimits <- range(sample_means10)
hist(sample_means10, breaks = 20, xlim = xlimits)
hist(sample_means50, breaks = 20, xlim = xlimits)
hist(sample_means100, breaks = 20, xlim = xlimits)The sample size is larger, there is more concentration around the mean, which happens to get bigger, and the spread narrows.
So far, we have only focused on estimating the mean living area in homes in Ames. Now you’ll try to estimate the mean home price.
1 Take a random sample of size 50 from price. Using this sample, what is your best point estimate of the population mean?
From my randon sample of 50 observations, the best point estimate of the population mean is $204,100.
sample_price <- sample(price,50)
summary(price)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 12790 129500 160000 180800 213500 755000
summary(sample_price)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 40000 146000 173500 196600 217000 437200
2 Since you have access to the population, simulate the sampling distribution for \(\bar{x}_{price}\) by taking 5000 samples from the population of size 50 and computing 5000 sample means. Store these means in a vector called sample_means50. Plot the data, then describe the shape of this sampling distribution. Based on this sampling distribution, what would you guess the mean home price of the population to be? Finally, calculate and report the population mean.
The sample distribution almost looks like a perfect bell curve (i.e., normal distribution). From the histogram, I would guess the mean home price is $180,000. The population mean is actually $180,800.
sample_means50 <- rep(NA, 5000)
for(i in 1:5000){
samp <- sample(price, 50)
sample_means50[i] <- mean(samp)
}
summary(sample_means50)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 145200 173200 180400 181000 188400 222000
hist(sample_means50)summary(price)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 12790 129500 160000 180800 213500 755000
3 Change your sample size from 50 to 150, then compute the sampling distribution using the same method as above, and store these means in a new vector called sample_means150. Describe the shape of this sampling distribution, and compare it to the sampling distribution for a sample size of 50. Based on this sampling distribution, what would you guess to be the mean sale price of homes in Ames?
I would guess the mean is slightly above $180,000.
sample_means150 <- rep(NA, 5000)
for(i in 1:5000){
samp <- sample(price, 150)
sample_means150[i] <- mean(samp)
}
summary(sample_means150)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 161000 176400 180600 180700 184900 204100
summary(sample_means50)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 145200 173200 180400 181000 188400 222000
par(mfrow = c(2, 1))
xlimits2 <- range(sample_means50)
hist(sample_means50, breaks = 20, xlim = xlimits2)
hist(sample_means150, breaks = 20, xlim = xlimits2)4 Of the sampling distributions from 2 and 3, which has a smaller spread? If we’re concerned with making estimates that are more often close to the true value, would we prefer a distribution with a large or small spread?
The sample of 150 observations has the smaller spread; thus, we would want to make estimates from a distribution with a small spread (less variability).