We consider real estate data from the city of Ames, Iowa. The details of every real estate transaction in Ames is recorded by the City Assessor’s office. Our particular focus for this lab will be all residential home sales in Ames between 2006 and 2010. This collection represents our population of interest. In this lab we would like to learn about these home sales by taking smaller samples from the full population. Let’s load the data.
download.file("http://www.openintro.org/stat/data/ames.RData", destfile = "ames.RData")
load("ames.RData")
We see that there are quite a few variables in the data set, enough to do a very in-depth analysis. For this lab, we’ll restrict our attention to just two of the variables: the above ground living area of the house in square feet (Gr.Liv.Area) and the sale price (SalePrice). To save some effort throughout the lab, create two variables with short names that represent these two variables.
area <- ames$Gr.Liv.Area
price <- ames$SalePrice
Let’s look at the distribution of area in our population of home sales by calculating a few summary statistics and making a histogram.
summary(area)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 334 1126 1442 1500 1743 5642
hist(area)
Answer: The histograms is unimodal, right skewed. Its mean is 1500.
In this lab we have access to the entire population, but this is rarely the case in real life. Gathering information on an entire population is often extremely costly or impossible. Because of this, we often take a sample of the population and use that to understand the properties of the population.
If we were interested in estimating the mean living area in Ames based on a sample, we can use the following command to survey the population.
samp1 <- sample(area, 50)
This command collects a simple random sample of size 50 from the vector area, which is assigned to samp1. This is like going into the City Assessor’s database and pulling up the files on 50 random home sales. Working with these 50 files would be considerably simpler than working with all 2930 home sales.
hist(samp1)
mean(samp1)
## [1] 1446.36
Answer: The histogram of the sample change its form every time when I run the code because they are randomly pick. I can see unimodal, bimodal, or multimodal when I run the code. The histogram of population is fix and it is unimodal.
samp2. How does the mean of samp2 compare with the mean of samp1? Suppose we took two more samples, one of size 100 and one of size 1000. Which would you think would provide a more accurate estimate of the population mean?samp2 <- sample(area, 50)
mean(samp2)
## [1] 1442.14
samp3 <- sample(area, 100)
mean(samp3)
## [1] 1492.19
samp4 <- sample(area, 1000)
mean(samp4)
## [1] 1492.714
Answer: The mean of sample2 is different from sample1. When we run the data, sample1’s mean is 1454 and smpale2’s mean is 1363. If we took two more example, the bigger size sample (1000) will provide a more accurate estimate of the population mean.
Not surprisingly, every time we take another random sample, we get a different sample mean. It’s useful to get a sense of just how much variability we should expect when estimating the population mean this way. The distribution of sample means, called the sampling distribution, can help us understand this variability. In this lab, because we have access to the population, we can build up the sampling distribution for the sample mean by repeating the above steps many times. Here we will generate 5000 samples and compute the sample mean of each.
sample_means50 <- rep(NA, 5000)
for(i in 1:5000){
samp <- sample(area, 50)
sample_means50[i] <- mean(samp)
}
hist(sample_means50)
If you would like to adjust the bin width of your histogram to show a little more detail, you can do so by changing the breaks argument.
hist(sample_means50, breaks = 25)
Here we use R to take 5000 samples of size 50 from the population, calculate the mean of each sample, and store each result in a vector called sample_means50. On the next page, we’ll review how this set of code works.
sample_means50? Describe the sampling distribution, and be sure to specifically note its center. Would you expect the distribution to change if we instead collected 50,000 sample means?length(sample_means50)
## [1] 5000
mean(sample_means50)
## [1] 1499.86
Answer: 5,000 elements in the sample_mean50. The sample is a normal distribution with a mean of 1499.45. It is very close to the population mean, 1500. If we collected 50,000 samples, the distribution will not change and the mean is more and more close to 1500.
sample_means_small. Run a loop that takes a sample of size 50 from area and stores the sample mean in sample_means_small, but only iterate from 1 to 100. Print the output to your screen (type sample_means_small into the console and press enter). How many elements are there in this object called sample_means_small? What does each element represent?sample_means_small<-rep(NA,100)
for(i in 1:100){
sample <- sample(area, 50)
sample_means_small[i]<-mean(sample)
}
sample_means_small
## [1] 1668.58 1465.20 1409.60 1514.08 1542.04 1572.22 1417.82 1570.18
## [9] 1470.08 1437.08 1441.86 1467.70 1485.12 1477.90 1597.36 1546.50
## [17] 1660.94 1585.22 1501.00 1475.56 1498.08 1456.52 1514.98 1461.08
## [25] 1422.06 1523.00 1383.02 1456.08 1485.98 1506.38 1407.64 1424.06
## [33] 1472.12 1434.10 1598.92 1416.24 1542.94 1570.80 1498.64 1465.94
## [41] 1324.00 1521.54 1343.24 1629.30 1514.08 1411.56 1659.82 1457.88
## [49] 1447.42 1554.82 1560.96 1357.60 1575.18 1412.96 1470.66 1611.02
## [57] 1525.06 1420.22 1381.76 1303.62 1397.08 1575.50 1601.18 1573.38
## [65] 1422.88 1495.52 1652.14 1576.06 1389.10 1520.12 1502.12 1499.80
## [73] 1511.28 1442.66 1514.86 1499.98 1480.32 1485.26 1445.76 1518.18
## [81] 1474.24 1534.30 1528.50 1541.18 1512.48 1423.68 1421.94 1517.52
## [89] 1534.48 1559.70 1410.56 1507.20 1538.22 1462.96 1584.76 1403.24
## [97] 1485.36 1460.82 1515.86 1516.04
hist(sample_means_small)
Answer: 100 elements in the “sample_mean_small” and they represent every mean value of generated by the 100 samples of size 50.
Answer: When the sample size is larger, the center should closer to the population mean and the mean frequency is taller. At the same time, the spread also narrow when the size is larger.
So far, we have only focused on estimating the mean living area in homes in Ames. Now you’ll try to estimate the mean home price.
price. Using this sample, what is your best point estimate of the population mean?price_sample1<-sample(price,50)
hist(price_sample1)
mean(price)
## [1] 180796.1
#Best point estimate of the population mean
mean(price_sample1)
## [1] 186413
sample_means50. Plot the data, then describe the shape of this sampling distribution. Based on this sampling distribution, what would you guess the mean home price of the population to be? Finally, calculate and report the population mean.sample_means50<-rep(NA,5000)
for(i in 1:5000){
sample<-sample(price,50)
sample_means50[i]<-mean(sample)
}
hist(sample_means50)
#reset par settings
par(mfrow=c(1,1))
hist(sample_means50, breaks = 20)
#The sample Mean
mean(sample_means50)
## [1] 181014.7
#The population mean
mean(price)
## [1] 180796.1
Answer: The histogram shows normal distribution and the center is around 180,750 .After the calculation, the mean is 180,796.1
sample_means150. Describe the shape of this sampling distribution, and compare it to the sampling distribution for a sample size of 50. Based on this sampling distribution, what would you guess to be the mean sale price of homes in Ames?sample_means150<-rep(NA,5000)
for(i in 1:5000){
sample<-sample(price,150)
sample_means150[i]<-mean(sample)
}
hist(sample_means150)
#The 150 sample Mean
mean(sample_means150)
## [1] 180817.4
#The population mean
mean(price)
## [1] 180796.1
#reset par settings
par(mfrow=c(2,1))
xlimits <- range(sample_means150)
hist(sample_means50,breaks = 25, xlim = xlimits)
hist(sample_means150,breaks = 25, xlim = xlimits)
Answer: The sample _mean 150 is also a normal distribution. When we compare these two histogram, the histogram of sample_mean150 is more narrow and taller. The mean of the 150 samples is closer to the population mean and we can guess the price mean is around 180,810.
Answer: Sample 3 is size 150 and has a smaller spread. We prefer small spread when we estimate the true mean value.