We consider real estate data from the city of Ames, Iowa. The details of every real estate transaction in Ames is recorded by the City Assessor’s office. Our particular focus for this lab will be all residential home sales in Ames between 2006 and 2010. This collection represents our population of interest. In this lab we would like to learn about these home sales by taking smaller samples from the full population. Let’s load the data.
load("more/ames.RData")We see that there are quite a few variables in the data set, enough to do a very in-depth analysis. For this lab, we’ll restrict our attention to just two of the variables: the above ground living area of the house in square feet (Gr.Liv.Area) and the sale price (SalePrice). To save some effort throughout the lab, create two variables with short names that represent these two variables.
area <- ames$Gr.Liv.Area
price <- ames$SalePriceLet’s look at the distribution of area in our population of home sales by calculating a few summary statistics and making a histogram.
summary(area)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 334 1126 1442 1500 1743 5642
hist(area)set.seed(42)
samp1 <- sample(area, 50)
samp1## [1] 1576 1124 1764 1838 1344 1694 1992 1002 1056 943 1158 2140 2287 1582
## [15] 1362 1057 1784 1434 1073 1088 1348 1365 641 1208 1172 1656 1663 1150
## [29] 899 1248 879 1552 1992 1141 1187 1456 1173 864 1040 2013 2322 1436
## [43] 1479 1552 1152 1596 1216 1573 1939 1795
summary(samp1)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 641 1143 1364 1420 1661 2322
hist(samp1)summary(area)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 334 1126 1442 1500 1743 5642
hist(area)mean(samp1)## [1] 1420.1
samp2. How does the mean of samp2 compare with the mean of samp1? Suppose we took two more samples, one of size 100 and one of size 1000. Which would you think would provide a more accurate estimate of the population mean?set.seed(40)
samp2 <- sample(area, 50)
mean(samp2)## [1] 1434.64
mean(samp1)## [1] 1420.1
samp3<- sample(area,100)
samp4 <- sample(area,1000)
mean(samp3)## [1] 1508.87
mean(samp4)## [1] 1512.161
sample_means50 <- rep(NA, 5000)
for(i in 1:5000){
samp <- sample(area, 50)
sample_means50[i] <- mean(samp)
}
hist(sample_means50)If you would like to adjust the bin width of your histogram to show a little more detail, you can do so by changing the breaks argument.
hist(sample_means50, breaks = 25)sample_means50? Describe the sampling distribution, and be sure to specifically note its center. Would you expect the distribution to change if we instead collected 50,000 sample means?sample_means50 <- rep(NA, 5000)
for(i in 1:5000){
samp <- sample(area, 50)
sample_means50[i] <- mean(samp)
# print(i)
}The body of the for loop is the part inside the curly braces, and this set of code is run for each value of i. Here, on every loop, we take a random sample of size 50 from area, take its mean, and store it as the \(i\)th element of sample_means50.
In order to display that this is really happening, we asked R to print i at each iteration. This line of code is optional and is only used for displaying what’s going on while the for loop is running.
The for loop allows us to not just run the code 5000 times, but to neatly package the results, element by element, into the empty vector that we initialized at the outset.
sample_means_small. Run a loop that takes a sample of size 50 from area and stores the sample mean in sample_means_small, but only iterate from 1 to 100. Print the output to your screen (type sample_means_small into the console and press enter). How many elements are there in this object called sample_means_small? What does each element represent?set.seed(40)
sample_means_small <- rep(0,100)
sample(area,50)## [1] 1248 747 986 1803 1471 520 925 1496 1792 1322 960 1029 1440 1461
## [15] 2049 1670 2078 498 1120 2046 1588 1602 1908 1109 768 2362 1604 1232
## [29] 1082 1298 1671 1464 1851 1324 1242 2340 988 1330 1736 1149 1040 1852
## [43] 1258 1310 1453 1003 1884 2327 1812 1484
for (i in 1:100){
samp <- sample(area,50)
sample_means_small[i] <- mean(samp)
}
sample_means_small## [1] 1482.10 1447.46 1653.28 1416.52 1503.90 1600.72 1442.80 1524.26
## [9] 1611.54 1555.70 1553.78 1604.82 1490.56 1553.30 1509.76 1409.88
## [17] 1433.50 1352.20 1479.50 1417.98 1583.80 1431.96 1640.26 1479.94
## [25] 1548.28 1471.94 1483.36 1473.06 1523.66 1570.10 1353.12 1487.46
## [33] 1442.70 1562.64 1583.50 1567.36 1661.56 1608.14 1502.18 1480.34
## [41] 1483.24 1531.88 1532.88 1350.62 1472.38 1567.84 1410.82 1429.04
## [49] 1407.02 1644.90 1411.38 1506.46 1480.64 1461.78 1553.24 1495.86
## [57] 1634.32 1417.78 1541.88 1547.68 1471.98 1462.50 1481.88 1605.84
## [65] 1539.02 1607.32 1449.26 1670.14 1586.32 1464.44 1574.50 1534.82
## [73] 1463.50 1369.88 1514.00 1458.16 1446.64 1510.04 1472.82 1614.26
## [81] 1401.76 1596.28 1469.62 1560.30 1465.20 1682.34 1535.88 1419.58
## [89] 1606.52 1444.92 1607.46 1437.78 1581.32 1506.50 1602.22 1459.54
## [97] 1497.08 1408.38 1457.10 1505.30
hist(sample_means50)sample_means10 <- rep(NA, 5000)
sample_means100 <- rep(NA, 5000)
for(i in 1:5000){
samp <- sample(area, 10)
sample_means10[i] <- mean(samp)
samp <- sample(area, 100)
sample_means100[i] <- mean(samp)
}par(mfrow = c(3, 1))
xlimits <- range(sample_means10)
hist(sample_means10, breaks = 20, xlim = xlimits)
hist(sample_means50, breaks = 20, xlim = xlimits)
hist(sample_means100, breaks = 20, xlim = xlimits)The first command specifies that you’d like to divide the plotting area into 3 rows and 1 column of plots (to return to the default setting of plotting one at a time, use par(mfrow = c(1, 1))). The breaks argument specifies the number of bins used in constructing the histogram. The xlim argument specifies the range of the x-axis of the histogram, and by setting it equal to xlimits for each histogram, we ensure that all three histograms will be plotted with the same limits on the x-axis.
So far, we have only focused on estimating the mean living area in homes in Ames. Now you’ll try to estimate the mean home price.
price. Using this sample, what is your best point estimate of the population mean?set.seed(40)
samp <- sample(price, 50)
mean(samp) ## [1] 174593.1
sample_means50. Plot the data, then describe the shape of this sampling distribution. Based on this sampling distribution, what would you guess the mean home price of the population to be? Finally, calculate and report the population mean.sample_means_50 <- rep(NA,5000)
for (i in 1:5000){
samp <- sample(price,50)
sample_means_50[i] <- mean(samp)
}
str(sample_means_50)## num [1:5000] 167796 176265 211458 163229 179243 ...
summary(sample_means_50)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 143472 172691 180110 180568 188040 235680
hist(sample_means_50)actual_pop_mean=mean(price)
actual_pop_mean## [1] 180796.1
sample_means150. Describe the shape of this sampling distribution, and compare it to the sampling distribution for a sample size of 50. Based on this sampling distribution, what would you guess to be the mean sale price of homes in Ames?sample_means_150 <- rep(NA,5000)
for (i in 1:5000){
samp <- sample(price,150)
sample_means_150[i] <- mean(samp)
}
str(sample_means_150)## num [1:5000] 171947 181518 183157 184091 186958 ...
hist(sample_means_50, breaks = 50)hist(sample_means_150, breaks = 50)summary(sample_means_50)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 143472 172691 180110 180568 188040 235680
summary(sample_means_150)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 161713 176270 180515 180765 184909 211040
I would guess around 180000
The larger sample distribution has a more narrow range. If we want accruate estimates, a small spread is preferred. A small spread means that our sample of means are closer in accuracy to a true population mean
This is a product of OpenIntro that is released under a Creative Commons Attribution-ShareAlike 3.0 Unported. This lab was written for OpenIntro by Andrew Bray and Mine Çetinkaya-Rundel.