The data

We consider real estate data from the city of Ames, Iowa. The details of every real estate transaction in Ames is recorded by the City Assessor’s office. Our particular focus for this lab will be all residential home sales in Ames between 2006 and 2010. This collection represents our population of interest. In this lab we would like to learn about these home sales by taking smaller samples from the full population. Let’s load the data.

load("more/ames.RData")

We see that there are quite a few variables in the data set, enough to do a very in-depth analysis. For this lab, we’ll restrict our attention to just two of the variables: the above ground living area of the house in square feet (Gr.Liv.Area) and the sale price (SalePrice). To save some effort throughout the lab, create two variables with short names that represent these two variables.

area <- ames$Gr.Liv.Area
price <- ames$SalePrice

Let’s look at the distribution of area in our population of home sales by calculating a few summary statistics and making a histogram.

summary(area)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     334    1126    1442    1500    1743    5642
hist(area)

  1. Describe this population distribution.
set.seed(42)
samp1 <- sample(area, 50)
samp1
##  [1] 1576 1124 1764 1838 1344 1694 1992 1002 1056  943 1158 2140 2287 1582
## [15] 1362 1057 1784 1434 1073 1088 1348 1365  641 1208 1172 1656 1663 1150
## [29]  899 1248  879 1552 1992 1141 1187 1456 1173  864 1040 2013 2322 1436
## [43] 1479 1552 1152 1596 1216 1573 1939 1795
  1. Describe the distribution of this sample. How does it compare to the distribution of the population?
summary(samp1)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     641    1143    1364    1420    1661    2322
hist(samp1)

summary(area)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     334    1126    1442    1500    1743    5642
hist(area)

mean(samp1)
## [1] 1420.1
  1. Take a second sample, also of size 50, and call it samp2. How does the mean of samp2 compare with the mean of samp1? Suppose we took two more samples, one of size 100 and one of size 1000. Which would you think would provide a more accurate estimate of the population mean?
set.seed(40)
samp2 <- sample(area, 50)
mean(samp2)
## [1] 1434.64
mean(samp1)
## [1] 1420.1
samp3<- sample(area,100)
samp4 <- sample(area,1000)
mean(samp3)
## [1] 1508.87
mean(samp4)
## [1] 1512.161
sample_means50 <- rep(NA, 5000)

for(i in 1:5000){
   samp <- sample(area, 50)
   sample_means50[i] <- mean(samp)
   }

hist(sample_means50)

If you would like to adjust the bin width of your histogram to show a little more detail, you can do so by changing the breaks argument.

hist(sample_means50, breaks = 25)

  1. How many elements are there in sample_means50? Describe the sampling distribution, and be sure to specifically note its center. Would you expect the distribution to change if we instead collected 50,000 sample means?
sample_means50 <- rep(NA, 5000)

for(i in 1:5000){
   samp <- sample(area, 50)
   sample_means50[i] <- mean(samp)
 #  print(i)
   }

The body of the for loop is the part inside the curly braces, and this set of code is run for each value of i. Here, on every loop, we take a random sample of size 50 from area, take its mean, and store it as the \(i\)th element of sample_means50.

In order to display that this is really happening, we asked R to print i at each iteration. This line of code is optional and is only used for displaying what’s going on while the for loop is running.

The for loop allows us to not just run the code 5000 times, but to neatly package the results, element by element, into the empty vector that we initialized at the outset.

  1. To make sure you understand what you’ve done in this loop, try running a smaller version. Initialize a vector of 100 zeros called sample_means_small. Run a loop that takes a sample of size 50 from area and stores the sample mean in sample_means_small, but only iterate from 1 to 100. Print the output to your screen (type sample_means_small into the console and press enter). How many elements are there in this object called sample_means_small? What does each element represent?
set.seed(40)

sample_means_small <- rep(0,100)
sample(area,50)
##  [1] 1248  747  986 1803 1471  520  925 1496 1792 1322  960 1029 1440 1461
## [15] 2049 1670 2078  498 1120 2046 1588 1602 1908 1109  768 2362 1604 1232
## [29] 1082 1298 1671 1464 1851 1324 1242 2340  988 1330 1736 1149 1040 1852
## [43] 1258 1310 1453 1003 1884 2327 1812 1484
for (i in 1:100){
    samp <- sample(area,50)
    sample_means_small[i] <- mean(samp)
} 
sample_means_small
##   [1] 1482.10 1447.46 1653.28 1416.52 1503.90 1600.72 1442.80 1524.26
##   [9] 1611.54 1555.70 1553.78 1604.82 1490.56 1553.30 1509.76 1409.88
##  [17] 1433.50 1352.20 1479.50 1417.98 1583.80 1431.96 1640.26 1479.94
##  [25] 1548.28 1471.94 1483.36 1473.06 1523.66 1570.10 1353.12 1487.46
##  [33] 1442.70 1562.64 1583.50 1567.36 1661.56 1608.14 1502.18 1480.34
##  [41] 1483.24 1531.88 1532.88 1350.62 1472.38 1567.84 1410.82 1429.04
##  [49] 1407.02 1644.90 1411.38 1506.46 1480.64 1461.78 1553.24 1495.86
##  [57] 1634.32 1417.78 1541.88 1547.68 1471.98 1462.50 1481.88 1605.84
##  [65] 1539.02 1607.32 1449.26 1670.14 1586.32 1464.44 1574.50 1534.82
##  [73] 1463.50 1369.88 1514.00 1458.16 1446.64 1510.04 1472.82 1614.26
##  [81] 1401.76 1596.28 1469.62 1560.30 1465.20 1682.34 1535.88 1419.58
##  [89] 1606.52 1444.92 1607.46 1437.78 1581.32 1506.50 1602.22 1459.54
##  [97] 1497.08 1408.38 1457.10 1505.30
hist(sample_means50)

sample_means10 <- rep(NA, 5000)
sample_means100 <- rep(NA, 5000)

for(i in 1:5000){
  samp <- sample(area, 10)
  sample_means10[i] <- mean(samp)
  samp <- sample(area, 100)
  sample_means100[i] <- mean(samp)
}
par(mfrow = c(3, 1))

xlimits <- range(sample_means10)

hist(sample_means10, breaks = 20, xlim = xlimits)
hist(sample_means50, breaks = 20, xlim = xlimits)
hist(sample_means100, breaks = 20, xlim = xlimits)

The first command specifies that you’d like to divide the plotting area into 3 rows and 1 column of plots (to return to the default setting of plotting one at a time, use par(mfrow = c(1, 1))). The breaks argument specifies the number of bins used in constructing the histogram. The xlim argument specifies the range of the x-axis of the histogram, and by setting it equal to xlimits for each histogram, we ensure that all three histograms will be plotted with the same limits on the x-axis.

  1. When the sample size is larger, what happens to the center? What about the spread?

On your own

So far, we have only focused on estimating the mean living area in homes in Ames. Now you’ll try to estimate the mean home price.

set.seed(40)


samp <- sample(price, 50)
mean(samp) 
## [1] 174593.1

Answer 174593.1

  1. Since you have access to the population, simulate the sampling distribution for \(\bar{x}_{price}\) by taking 5000 samples from the population of size 50 and computing 5000 sample means. Store these means in a vector called sample_means50. Plot the data, then describe the shape of this sampling distribution. Based on this sampling distribution, what would you guess the mean home price of the population to be? Finally, calculate and report the population mean.
sample_means_50 <- rep(NA,5000)

for (i in 1:5000){
    samp <- sample(price,50)
    sample_means_50[i] <- mean(samp)
} 
str(sample_means_50)
##  num [1:5000] 167796 176265 211458 163229 179243 ...
summary(sample_means_50)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  143472  172691  180110  180568  188040  235680
hist(sample_means_50)

actual_pop_mean=mean(price)
actual_pop_mean
## [1] 180796.1
  1. Change your sample size from 50 to 150, then compute the sampling distribution using the same method as above, and store these means in a new vector called sample_means150. Describe the shape of this sampling distribution, and compare it to the sampling distribution for a sample size of 50. Based on this sampling distribution, what would you guess to be the mean sale price of homes in Ames?
sample_means_150 <- rep(NA,5000)

for (i in 1:5000){
    samp <- sample(price,150)
    sample_means_150[i] <- mean(samp)
} 
str(sample_means_150)
##  num [1:5000] 171947 181518 183157 184091 186958 ...
hist(sample_means_50, breaks = 50)

hist(sample_means_150, breaks = 50)

summary(sample_means_50)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  143472  172691  180110  180568  188040  235680
summary(sample_means_150)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  161713  176270  180515  180765  184909  211040

This is a product of OpenIntro that is released under a Creative Commons Attribution-ShareAlike 3.0 Unported. This lab was written for OpenIntro by Andrew Bray and Mine Çetinkaya-Rundel.