library(dplyr)
library(plyr)
library(data.table)
library(knitr)
download.file("http://www.openintro.org/stat/data/ames.RData",destfile ="ames.RData")
load("ames.RData")
area <- ames$Gr.Liv.Area
price <- ames$SalePrice
summary(area)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 334 1126 1442 1500 1743 5642
hist(area)
Exercise 1 Describe this population distribution.
The population distribution is right skewed and unimodal. You can see the right skew as the mean is closer to the tail. The range is is about 5,300. By using a smaller bin (20) we can see some extream outliers between 4,000 to 6,000.
samp1 <- sample(area, 50)
hist(samp1)
Exercise 2
Describe the distribution of this sample. How does it compare to the distribution of the population?
mean(samp1)
## [1] 1594.16
summary(samp1)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 756 1332 1583 1594 1728 2775
The distribution of the 50 samples follow that of the population distribution. This agrees with the basic properties of point. The sample mean (point estimate) is close to the population mean. However, the range is not as wide (only about 600) and the outliers are not as extreme. However, since this is a simple random sample the outcome will differ if we do not use the set.seed function
Exercise 3 Take a second sample, also of size 50, and call it samp2. How does the mean of samp2 compare with the mean of samp1? Suppose we took two more samples, one of size 100 and one of size 1000. Which would you think would provide a more accurate estimate of the population mean?
So far, we have only focused on estimating the mean living area in homes in Ames. Now you’ll try to estimate the mean home price.
Exercise 1 Take a random sample of size 50 from price. Using this sample, what is your best point estimate of the population mean?
samp_price <- sample(price, 50)
samp_priceMean <- mean(samp_price)
Exercise 2 Since you have access to the population, simulate the sampling distribution for x¯price by taking 5000 samples from the population of size 50 and computing 5000 sample means. Store these means in a vector called sample_means50. Plot the data, then describe the shape of this sampling distribution. Based on this sampling distribution, what would you guess the mean home price of the population to be? Finally, calculate and report the population mean.
sample_means50<-rep(0,5000)
for(i in 1:5000){
samp<-sample(price,50)
sample_means50[i]<-mean(samp)
}
hist(sample_means50, breaks = 25)
summary(price)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 12790 129500 160000 180800 213500 755000
population_mean <- mean(sample_means50)
population_mean
## [1] 180365.5
The shape of the sampling distribution seems to have a normal distribution centered around a mean of $180,000. The above calculated population mean is: $180748.6.
Exercise 3 Change your sample size from 50 to 150, then compute the sampling distribution using the same method as above, and store these means in a new vector called sample_means150. Describe the shape of this sampling distribution, and compare it to the sampling distribution for a sample size of 50. Based on this sampling distribution, what would you guess to be the mean sale price of homes in Ames?
sample_means150 <- rep(NA, 5000)
for(i in 1:5000){
samp <- sample(price, 150)
sample_means150[i] <- mean(samp)
}
hist(sample_means150, breaks = 25)
Exercise 4 Of the sampling distributions from 2 and 3, which has a smaller spread? If we’re concerned with making estimates that are more often close to the true value, would we prefer a distribution with a large or small spread?
par(mfrow = c(1, 2))
xlimits <- range(sample_means50)
hist(sample_means50, breaks = 20, xlim = xlimits)
hist(sample_means150, breaks = 20, xlim = xlimits)
xlimits50 <- range(sample_means50)
xlimits150 <- range(sample_means150)
xlimits50
## [1] 145314.4 226304.1
xlimits150
## [1] 160691.0 203990.3
xlimits50[2] - xlimits50[1]
## [1] 80989.76
## [1] 83251.28
xlimits150[2] - xlimits150[1]
## [1] 43299.3
## [1] 49657.87
Sampling distributions from 3 has smaller spread but bigger sample size. We would prefer a distribution with a small spread because it is associated with low data variability.