606 Lab4a Foundations for statistical inference

The data

load("more/ames.RData")

area <- ames$Gr.Liv.Area
price <- ames$SalePrice

Let’s look at the distribution of area in our population of home sales by calculating a few summary statistics and making a histogram.

summary(area)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     334    1126    1442    1500    1743    5642

hist(area)

Describe this population distribution.

Answer: This a normal distribution with mean 1500 and median 1442, which has a bit right skew. 50% density in the interval of x is between 1126 and 1743.

The unknown sampling distribution

If we were interested in estimating the mean living area in Ames based on a sample, we can use the following command to survey the population.

samp1 <- sample(area, 50)

Describe the distribution of this sample. How does it compare to the distribution of the population?

Answer: The distribution of this sample is normal distribution with mean 1537 and median 1444. 5o% density in the interval of area is between 1070 and 1909.The sample is similar to the population. Both distribution have right skews. The sample distribution has higher mean caused by bias from the smaller size of random sampling, and wider range of 50% density which means the sharp of this sample distribution is more flat than the population’s.

summary(samp1)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     672    1154    1474    1554    1788    4316

hist(samp1, breaks=10)

If we’re interested in estimating the average living area in homes in Ames using the sample, our best single guess is the sample mean.

mean(samp1)

## [1] 1553.64

Depending on which 50 homes you selected, your estimate could be a bit above or a bit below the true population mean of 1499.69 square feet. In general, though, the sample mean turns out to be a pretty good estimate of the average living area, and we were able to get it by sampling less than 3% of the population.

Take a second sample, also of size 50, and call it samp2. How does the mean of samp2 compare with the mean of samp1? Suppose we took two more samples, one of size 100 and one of size 1000. Which would you think would provide a more accurate estimate of the population mean?

Answer: (1) Sample2 has mean at 1566 which more far from the population mean than the mean of sample1 1536.56. It has less accuracy than sample1. (2)Comparing two samples, sample1000 provide a more accurate estimate of the poplation mean since the mean of sample1000 has 99.43% propotion of the population mean and sample100’s has 97.59%.

samp2 <- sample(area, 50)
summary(samp2)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     492    1142    1405    1431    1682    3627

hist(samp2)

samp100 <- sample(area, 100)
summary(samp100)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     640    1137    1491    1505    1712    2956

hist(samp100)

samp1000 <- sample(area, 1000)
summary(samp1000)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     407    1103    1442    1499    1747    5095

hist(samp1000)

1-(abs(mean(samp100) - mean(area))/mean(area))

## [1] 0.996773

1-(abs(mean(samp1000) - mean(area))/mean(area))

## [1] 0.9994709

Here we will generate 5000 samples and compute the sample mean of each.

sample_means50 <- rep(NA, 5000)

for(i in 1:5000){
   samp <- sample(area, 50)
   sample_means50[i] <- mean(samp)
   }

hist(sample_means50)

hist(sample_means50, breaks = 25)

Here we use R to take 5000 samples of size 50 from the population, calculate the mean of each sample, and store each result in a vector called sample_means50. On the next page, we’ll review how this set of code works.

How many elements are there in sample_means50? Describe the sampling distribution, and be sure to specifically note its center. Would you expect the distribution to change if we instead collected 50,000 sample means?

Answer: There are 5000 elements in sample_means50, which each element presents a mean of a sample with 50 observations from the population of area. The sampling distribution is a normal distibution which has similar mean, median and stander error to population’s. Increasing sample size will increasing accurate estimate of the poplation.

Interlude: The `for` loop

To make sure you understand what you’ve done in this loop, try running a smaller version. Initialize a vector of 100 zeros called sample_means_small. Run a loop that takes a sample of size 50 from area and stores the sample mean in sample_means_small, but only iterate from 1 to 100. Print the output to your screen (type sample_means_small into the console and press enter). How many elements are there in this object called sample_means_small? What does each element represent?

Answer: There are 20 elements in sample_means20, which each element presents a mean of a sample with 50 observations from the population of area.

sample_means_small <- rep(NA, 100)

for(i in 1:100){
   samp <- sample(area, 50)
   sample_means_small[i] <- mean(samp)
}
sample_means_small

##   [1] 1356.42 1398.98 1462.86 1451.34 1502.36 1471.50 1517.24 1429.66
##   [9] 1455.82 1431.14 1599.38 1530.02 1538.08 1507.94 1564.38 1487.22
##  [17] 1469.24 1497.28 1606.82 1509.16 1387.26 1360.94 1545.50 1390.60
##  [25] 1621.66 1519.74 1419.46 1375.76 1717.76 1451.30 1605.04 1466.48
##  [33] 1542.80 1566.26 1530.92 1500.54 1479.22 1512.64 1451.98 1562.64
##  [41] 1493.64 1480.08 1488.66 1579.84 1550.78 1481.48 1437.46 1416.28
##  [49] 1531.60 1569.64 1488.28 1460.62 1503.80 1478.20 1456.04 1597.98
##  [57] 1525.26 1404.58 1542.76 1456.56 1463.62 1540.28 1460.74 1528.52
##  [65] 1460.14 1575.60 1470.98 1400.68 1505.44 1428.64 1435.82 1461.34
##  [73] 1490.50 1702.66 1482.50 1472.12 1510.60 1593.26 1571.12 1514.92
##  [81] 1605.14 1546.50 1429.34 1498.16 1452.12 1439.62 1557.18 1517.90
##  [89] 1426.68 1559.88 1469.84 1480.70 1561.24 1458.56 1479.40 1540.98
##  [97] 1558.90 1464.48 1539.34 1571.70

Sample size and the sampling distribution

When the sample size is larger, what happens to the center? What about the spread?

Answer: When the sample size is larger, the range of the center is narrower and the frequencies around the mean are increasing, so the spread at two tails are smaller.

On your own

So far, we have only focused on estimating the mean living area in homes in Ames. Now you’ll try to estimate the mean home price.

Take a random sample of size 50 from price. Using this sample, what is your best point estimate of the population mean?

Answer: The sample mean is 11.15% bias from the population mean. It is not a good sample to present the population.

samp1 <- sample(price, 50)
summary(samp1)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   39300  141656  171000  188792  214425  468000

hist(samp1, breaks=10)

mean(price)

## [1] 180796.1

abs(mean(samp1) - mean(price))/mean(price)

## [1] 0.04422607

Since you have access to the population, simulate the sampling distribution for \(\bar{x}_{price}\) by taking 5000 samples from the population of size 50 and computing 5000 sample means. Store these means in a vector called sample_means50. Plot the data, then describe the shape of this sampling distribution. Based on this sampling distribution, what would you guess the mean home price of the population to be? Finally, calculate and report the population mean.

Answer: The 5000 samplese mean with size 50 obervations is very accurate estimate the population. The sample mean 181112 is 0.1746% different to the population mean 180796.1.

sample_means50 <- rep(NA, 5000)

for(i in 1:5000){
  samp50 <- sample(price, 50)
  sample_means50[i] <- mean(samp50)
}
summary(sample_means50)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  145546  172742  180185  180620  188145  227605

xlimits <- range(sample_means50)

hist(sample_means50, breaks = 25, xlim = xlimits)

mean(sample_means50)

## [1] 180620

mean(price)

## [1] 180796.1

abs(mean(sample_means50)-mean(price))/mean(price)

## [1] 0.0009736935

Change your sample size from 50 to 150, then compute the sampling distribution using the same method as above, and store these means in a new vector called sample_means150. Describe the shape of this sampling distribution, and compare it to the sampling distribution for a sample size of 50. Based on this sampling distribution, what would you guess to be the mean sale price of homes in Ames?

Answer: The mean sale price of homes in Ames approximate to 180786.

sample_means150 <- rep(NA, 5000)

for(i in 1:5000){
  samp150 <- sample(price, 150)
  sample_means150[i] <- mean(samp150)
}
summary(sample_means150)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  159465  176455  180756  180809  184961  205215

par(mfrow = c(2, 1))

xlimits <- range(sample_means50)

hist(sample_means50, breaks = 20, xlim = xlimits)
hist(sample_means150, breaks = 20, xlim = xlimits)

Of the sampling distributions from 2 and 3, which has a smaller spread? If we’re concerned with making estimates that are more often close to the true value, would we prefer a distribution with a large or small spread?

Answer: sample_means150 with mean 180786 has a smaller spread. So larger sample size can reduce spread which means less error to the population. We prefer has smaller spread.

606 Lab4a Foundations for statistical inference - Sampling distributions

Chunmei Zhu

September 28, 2017

The data

The unknown sampling distribution

Interlude: The `for` loop

Sample size and the sampling distribution

On your own

606 Lab4a Foundations for statistical inference - Sampling distributions

Chunmei Zhu

September 28, 2017

The data

The unknown sampling distribution

Interlude: The for loop

Sample size and the sampling distribution

On your own

Interlude: The `for` loop