Foundations for statistical inference

The data

We consider real estate data from the city of Ames, Iowa. The details of every real estate transaction in Ames is recorded by the City Assessor’s office. Our particular focus for this lab will be all residential home sales in Ames between 2006 and 2010. This collection represents our population of interest. In this lab we would like to learn about these home sales by taking smaller samples from the full population. Let’s load the data.

load("more/ames.RData")

We see that there are quite a few variables in the data set, enough to do a very in-depth analysis. For this lab, we’ll restrict our attention to just two of the variables: the above ground living area of the house in square feet (Gr.Liv.Area) and the sale price (SalePrice). To save some effort throughout the lab, create two variables with short names that represent these two variables.

area <- ames$Gr.Liv.Area
price <- ames$SalePrice

Let’s look at the distribution of area in our population of home sales by calculating a few summary statistics and making a histogram.

summary(area)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     334    1126    1442    1500    1743    5642

hist(area)

Describe this population distribution.

The Distribution is somewhat rightward skewed but overall normal

set.seed(42)
samp1 <- sample(area, 50)
samp1

##  [1] 1576 1124 1764 1838 1344 1694 1992 1002 1056  943 1158 2140 2287 1582
## [15] 1362 1057 1784 1434 1073 1088 1348 1365  641 1208 1172 1656 1663 1150
## [29]  899 1248  879 1552 1992 1141 1187 1456 1173  864 1040 2013 2322 1436
## [43] 1479 1552 1152 1596 1216 1573 1939 1795

Describe the distribution of this sample. How does it compare to the distribution of the population?

summary(samp1)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     641    1143    1364    1420    1661    2322

hist(samp1)

summary(area)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     334    1126    1442    1500    1743    5642

hist(area)

My sample is close to normal but looks bimodal, it has longer tails on both sides and is low in its center. the IQR meidan and mean are similar

mean(samp1)

## [1] 1420.1

Take a second sample, also of size 50, and call it samp2. How does the mean of samp2 compare with the mean of samp1? Suppose we took two more samples, one of size 100 and one of size 1000. Which would you think would provide a more accurate estimate of the population mean?

set.seed(40)
samp2 <- sample(area, 50)
mean(samp2)

## [1] 1434.64

mean(samp1)

## [1] 1420.1

samp3<- sample(area,100)
samp4 <- sample(area,1000)
mean(samp3)

## [1] 1508.87

mean(samp4)

## [1] 1512.161

As N increases I would assume it would get closer to the population Average.

sample_means50 <- rep(NA, 5000)

for(i in 1:5000){
   samp <- sample(area, 50)
   sample_means50[i] <- mean(samp)
   }

hist(sample_means50)

If you would like to adjust the bin width of your histogram to show a little more detail, you can do so by changing the breaks argument.

hist(sample_means50, breaks = 25)

How many elements are there in sample_means50? Describe the sampling distribution, and be sure to specifically note its center. Would you expect the distribution to change if we instead collected 50,000 sample means?

5000
The distribution is Normal, and the center is at 1500
at 50,000 samples I would expect the range to narrow more and the height of the center at 1500 to raise

sample_means50 <- rep(NA, 5000)

for(i in 1:5000){
   samp <- sample(area, 50)
   sample_means50[i] <- mean(samp)
 #  print(i)
   }

The body of the for loop is the part inside the curly braces, and this set of code is run for each value of i. Here, on every loop, we take a random sample of size 50 from area, take its mean, and store it as the \(i\)^th element of sample_means50.

In order to display that this is really happening, we asked R to print i at each iteration. This line of code is optional and is only used for displaying what’s going on while the for loop is running.

The for loop allows us to not just run the code 5000 times, but to neatly package the results, element by element, into the empty vector that we initialized at the outset.

To make sure you understand what you’ve done in this loop, try running a smaller version. Initialize a vector of 100 zeros called sample_means_small. Run a loop that takes a sample of size 50 from area and stores the sample mean in sample_means_small, but only iterate from 1 to 100. Print the output to your screen (type sample_means_small into the console and press enter). How many elements are there in this object called sample_means_small? What does each element represent?

set.seed(40)

sample_means_small <- rep(0,100)
sample(area,50)

##  [1] 1248  747  986 1803 1471  520  925 1496 1792 1322  960 1029 1440 1461
## [15] 2049 1670 2078  498 1120 2046 1588 1602 1908 1109  768 2362 1604 1232
## [29] 1082 1298 1671 1464 1851 1324 1242 2340  988 1330 1736 1149 1040 1852
## [43] 1258 1310 1453 1003 1884 2327 1812 1484

for (i in 1:100){
    samp <- sample(area,50)
    sample_means_small[i] <- mean(samp)
} 
sample_means_small

##   [1] 1482.10 1447.46 1653.28 1416.52 1503.90 1600.72 1442.80 1524.26
##   [9] 1611.54 1555.70 1553.78 1604.82 1490.56 1553.30 1509.76 1409.88
##  [17] 1433.50 1352.20 1479.50 1417.98 1583.80 1431.96 1640.26 1479.94
##  [25] 1548.28 1471.94 1483.36 1473.06 1523.66 1570.10 1353.12 1487.46
##  [33] 1442.70 1562.64 1583.50 1567.36 1661.56 1608.14 1502.18 1480.34
##  [41] 1483.24 1531.88 1532.88 1350.62 1472.38 1567.84 1410.82 1429.04
##  [49] 1407.02 1644.90 1411.38 1506.46 1480.64 1461.78 1553.24 1495.86
##  [57] 1634.32 1417.78 1541.88 1547.68 1471.98 1462.50 1481.88 1605.84
##  [65] 1539.02 1607.32 1449.26 1670.14 1586.32 1464.44 1574.50 1534.82
##  [73] 1463.50 1369.88 1514.00 1458.16 1446.64 1510.04 1472.82 1614.26
##  [81] 1401.76 1596.28 1469.62 1560.30 1465.20 1682.34 1535.88 1419.58
##  [89] 1606.52 1444.92 1607.46 1437.78 1581.32 1506.50 1602.22 1459.54
##  [97] 1497.08 1408.38 1457.10 1505.30

Each element is the mean of a sample of size 50 taken from the area vector

hist(sample_means50)

sample_means10 <- rep(NA, 5000)
sample_means100 <- rep(NA, 5000)

for(i in 1:5000){
  samp <- sample(area, 10)
  sample_means10[i] <- mean(samp)
  samp <- sample(area, 100)
  sample_means100[i] <- mean(samp)
}

par(mfrow = c(3, 1))

xlimits <- range(sample_means10)

hist(sample_means10, breaks = 20, xlim = xlimits)
hist(sample_means50, breaks = 20, xlim = xlimits)
hist(sample_means100, breaks = 20, xlim = xlimits)

The first command specifies that you’d like to divide the plotting area into 3 rows and 1 column of plots (to return to the default setting of plotting one at a time, use par(mfrow = c(1, 1))). The breaks argument specifies the number of bins used in constructing the histogram. The xlim argument specifies the range of the x-axis of the histogram, and by setting it equal to xlimits for each histogram, we ensure that all three histograms will be plotted with the same limits on the x-axis.

When the sample size is larger, what happens to the center? What about the spread?

The cneter becomes taller and the spread becomes thinner.

On your own

So far, we have only focused on estimating the mean living area in homes in Ames. Now you’ll try to estimate the mean home price.

Take a random sample of size 50 from price. Using this sample, what is your best point estimate of the population mean?

set.seed(40)


samp <- sample(price, 50)
mean(samp)

## [1] 174593.1

Answer 174593.1

Since you have access to the population, simulate the sampling distribution for \(\bar{x}_{price}\) by taking 5000 samples from the population of size 50 and computing 5000 sample means. Store these means in a vector called sample_means50. Plot the data, then describe the shape of this sampling distribution. Based on this sampling distribution, what would you guess the mean home price of the population to be? Finally, calculate and report the population mean.

sample_means_50 <- rep(NA,5000)

for (i in 1:5000){
    samp <- sample(price,50)
    sample_means_50[i] <- mean(samp)
} 
str(sample_means_50)

##  num [1:5000] 167796 176265 211458 163229 179243 ...

summary(sample_means_50)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  143472  172691  180110  180568  188040  235680

hist(sample_means_50)

actual_pop_mean=mean(price)
actual_pop_mean

## [1] 180796.1

The distribution looks normal
It appears the mean is around 180000
the actual population mean is 180796

Change your sample size from 50 to 150, then compute the sampling distribution using the same method as above, and store these means in a new vector called sample_means150. Describe the shape of this sampling distribution, and compare it to the sampling distribution for a sample size of 50. Based on this sampling distribution, what would you guess to be the mean sale price of homes in Ames?

sample_means_150 <- rep(NA,5000)

for (i in 1:5000){
    samp <- sample(price,150)
    sample_means_150[i] <- mean(samp)
} 
str(sample_means_150)

##  num [1:5000] 171947 181518 183157 184091 186958 ...

hist(sample_means_50, breaks = 50)

hist(sample_means_150, breaks = 50)

summary(sample_means_50)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  143472  172691  180110  180568  188040  235680

summary(sample_means_150)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  161713  176270  180515  180765  184909  211040

the distributions of 50 and 150 are very similar
I would guess around 180000
Of the sampling distributions from 2 and 3, which has a smaller spread? If we’re concerned with making estimates that are more often close to the true value, would we prefer a distribution with a large or small spread?
The larger sample distribution has a more narrow range. If we want accruate estimates, a small spread is preferred. A small spread means that our sample of means are closer in accuracy to a true population mean

This is a product of OpenIntro that is released under a Creative Commons Attribution-ShareAlike 3.0 Unported. This lab was written for OpenIntro by Andrew Bray and Mine Çetinkaya-Rundel.

Foundations for statistical inference - Confidence intervals

The data

On your own

Answer 174593.1