data606_lab5A

## The data

We consider real estate data from the city of Ames, Iowa. The details of every real estate transaction in Ames is recorded by the City Assessor’s office. Our particular focus for this lab will be all residential home sales in Ames between 2006 and 2010. This collection represents our population of interest. In this lab we would like to learn about these home sales by taking smaller samples from the full population. Let’s load the data.

download.file("http://www.openintro.org/stat/data/ames.RData", destfile = "ames.RData")
load("ames.RData")

We see that there are quite a few variables in the data set, enough to do a very in-depth analysis. For this lab, we’ll restrict our attention to just two of the variables: the above ground living area of the house in square feet (Gr.Liv.Area) and the sale price (SalePrice). To save some effort throughout the lab, create two variables with short names that represent these two variables.

area <- ames$Gr.Liv.Area
price <- ames$SalePrice

Let’s look at the distribution of area in our population of home sales by calculating a few summary statistics and making a histogram.

summary(area)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     334    1126    1442    1500    1743    5642

hist(area)

Describe this population distribution.

Answer: The histograms is unimodal, right skewed. Its mean is 1500.

The unknown sampling distribution

In this lab we have access to the entire population, but this is rarely the case in real life. Gathering information on an entire population is often extremely costly or impossible. Because of this, we often take a sample of the population and use that to understand the properties of the population.

If we were interested in estimating the mean living area in Ames based on a sample, we can use the following command to survey the population.

samp1 <- sample(area, 50)

This command collects a simple random sample of size 50 from the vector area, which is assigned to samp1. This is like going into the City Assessor’s database and pulling up the files on 50 random home sales. Working with these 50 files would be considerably simpler than working with all 2930 home sales.

Describe the distribution of this sample. How does it compare to the distribution of the population?

hist(samp1)

mean(samp1)

## [1] 1446.36

Answer: The histogram of the sample change its form every time when I run the code because they are randomly pick. I can see unimodal, bimodal, or multimodal when I run the code. The histogram of population is fix and it is unimodal.

Take a second sample, also of size 50, and call it samp2. How does the mean of samp2 compare with the mean of samp1? Suppose we took two more samples, one of size 100 and one of size 1000. Which would you think would provide a more accurate estimate of the population mean?

samp2 <- sample(area, 50)
mean(samp2)

## [1] 1442.14

samp3 <- sample(area, 100)
mean(samp3)

## [1] 1492.19

samp4 <- sample(area, 1000)
mean(samp4)

## [1] 1492.714

Answer: The mean of sample2 is different from sample1. When we run the data, sample1’s mean is 1454 and smpale2’s mean is 1363. If we took two more example, the bigger size sample (1000) will provide a more accurate estimate of the population mean.

Not surprisingly, every time we take another random sample, we get a different sample mean. It’s useful to get a sense of just how much variability we should expect when estimating the population mean this way. The distribution of sample means, called the sampling distribution, can help us understand this variability. In this lab, because we have access to the population, we can build up the sampling distribution for the sample mean by repeating the above steps many times. Here we will generate 5000 samples and compute the sample mean of each.

sample_means50 <- rep(NA, 5000)

for(i in 1:5000){
   samp <- sample(area, 50)
   sample_means50[i] <- mean(samp)
   }

hist(sample_means50)

If you would like to adjust the bin width of your histogram to show a little more detail, you can do so by changing the breaks argument.

hist(sample_means50, breaks = 25)

Here we use R to take 5000 samples of size 50 from the population, calculate the mean of each sample, and store each result in a vector called sample_means50. On the next page, we’ll review how this set of code works.

How many elements are there in sample_means50? Describe the sampling distribution, and be sure to specifically note its center. Would you expect the distribution to change if we instead collected 50,000 sample means?

length(sample_means50)

## [1] 5000

mean(sample_means50)

## [1] 1499.86

Answer: 5,000 elements in the sample_mean50. The sample is a normal distribution with a mean of 1499.45. It is very close to the population mean, 1500. If we collected 50,000 samples, the distribution will not change and the mean is more and more close to 1500.

To make sure you understand what you’ve done in this loop, try running a smaller version. Initialize a vector of 100 zeros called sample_means_small. Run a loop that takes a sample of size 50 from area and stores the sample mean in sample_means_small, but only iterate from 1 to 100. Print the output to your screen (type sample_means_small into the console and press enter). How many elements are there in this object called sample_means_small? What does each element represent?

sample_means_small<-rep(NA,100)

for(i in 1:100){
sample <- sample(area, 50)
sample_means_small[i]<-mean(sample)

}
sample_means_small

##   [1] 1668.58 1465.20 1409.60 1514.08 1542.04 1572.22 1417.82 1570.18
##   [9] 1470.08 1437.08 1441.86 1467.70 1485.12 1477.90 1597.36 1546.50
##  [17] 1660.94 1585.22 1501.00 1475.56 1498.08 1456.52 1514.98 1461.08
##  [25] 1422.06 1523.00 1383.02 1456.08 1485.98 1506.38 1407.64 1424.06
##  [33] 1472.12 1434.10 1598.92 1416.24 1542.94 1570.80 1498.64 1465.94
##  [41] 1324.00 1521.54 1343.24 1629.30 1514.08 1411.56 1659.82 1457.88
##  [49] 1447.42 1554.82 1560.96 1357.60 1575.18 1412.96 1470.66 1611.02
##  [57] 1525.06 1420.22 1381.76 1303.62 1397.08 1575.50 1601.18 1573.38
##  [65] 1422.88 1495.52 1652.14 1576.06 1389.10 1520.12 1502.12 1499.80
##  [73] 1511.28 1442.66 1514.86 1499.98 1480.32 1485.26 1445.76 1518.18
##  [81] 1474.24 1534.30 1528.50 1541.18 1512.48 1423.68 1421.94 1517.52
##  [89] 1534.48 1559.70 1410.56 1507.20 1538.22 1462.96 1584.76 1403.24
##  [97] 1485.36 1460.82 1515.86 1516.04

hist(sample_means_small)

Answer: 100 elements in the “sample_mean_small” and they represent every mean value of generated by the 100 samples of size 50.

Sample size and the sampling distribution

When the sample size is larger, what happens to the center? What about the spread?

Answer: When the sample size is larger, the center should closer to the population mean and the mean frequency is taller. At the same time, the spread also narrow when the size is larger.

On your own

So far, we have only focused on estimating the mean living area in homes in Ames. Now you’ll try to estimate the mean home price.

Take a random sample of size 50 from price. Using this sample, what is your best point estimate of the population mean?

price_sample1<-sample(price,50)   
hist(price_sample1)

mean(price)

## [1] 180796.1

#Best point estimate of the population mean
mean(price_sample1)

## [1] 186413

Since you have access to the population, simulate the sampling distribution for \(\bar{x}_{price}\) by taking 5000 samples from the population of size 50 and computing 5000 sample means. Store these means in a vector called sample_means50. Plot the data, then describe the shape of this sampling distribution. Based on this sampling distribution, what would you guess the mean home price of the population to be? Finally, calculate and report the population mean.

sample_means50<-rep(NA,5000)

for(i in 1:5000){
    sample<-sample(price,50)
    sample_means50[i]<-mean(sample)
}
hist(sample_means50)
#reset par settings
par(mfrow=c(1,1))
hist(sample_means50, breaks = 20)

#The sample Mean
mean(sample_means50)

## [1] 181014.7

#The population mean
mean(price)

## [1] 180796.1

Answer: The histogram shows normal distribution and the center is around 180,750 .After the calculation, the mean is 180,796.1

Change your sample size from 50 to 150, then compute the sampling distribution using the same method as above, and store these means in a new vector called sample_means150. Describe the shape of this sampling distribution, and compare it to the sampling distribution for a sample size of 50. Based on this sampling distribution, what would you guess to be the mean sale price of homes in Ames?

sample_means150<-rep(NA,5000)

for(i in 1:5000){
    sample<-sample(price,150)
    sample_means150[i]<-mean(sample)
}
hist(sample_means150)

#The 150 sample Mean
mean(sample_means150)

## [1] 180817.4

#The population mean
mean(price)

## [1] 180796.1

#reset par settings
par(mfrow=c(2,1))
xlimits <- range(sample_means150)
hist(sample_means50,breaks = 25, xlim = xlimits)
hist(sample_means150,breaks = 25, xlim = xlimits)

Answer: The sample _mean 150 is also a normal distribution. When we compare these two histogram, the histogram of sample_mean150 is more narrow and taller. The mean of the 150 samples is closer to the population mean and we can guess the price mean is around 180,810.

Of the sampling distributions from 2 and 3, which has a smaller spread? If we’re concerned with making estimates that are more often close to the true value, would we prefer a distribution with a large or small spread?

Answer: Sample 3 is size 150 and has a smaller spread. We prefer small spread when we estimate the true mean value.

data606_lab5A

Mengqin Cai

10/8/2019

## The data

The unknown sampling distribution

Sample size and the sampling distribution

On your own