DATA606_Lab4a_Statistical_Inference_Sampling_Distributions

library(dplyr)
library(plyr)
library(data.table)
library(knitr)

download.file("http://www.openintro.org/stat/data/ames.RData",destfile ="ames.RData")
load("ames.RData")
area <- ames$Gr.Liv.Area
price <- ames$SalePrice

summary(area)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     334    1126    1442    1500    1743    5642

hist(area)

Exercise 1 Describe this population distribution.

The population distribution is right skewed and unimodal. You can see the right skew as the mean is closer to the tail. The range is is about 5,300. By using a smaller bin (20) we can see some extream outliers between 4,000 to 6,000.

samp1 <- sample(area, 50)

hist(samp1)

Exercise 2

Describe the distribution of this sample. How does it compare to the distribution of the population?

mean(samp1)

## [1] 1594.16

summary(samp1)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     756    1332    1583    1594    1728    2775

The distribution of the 50 samples follow that of the population distribution. This agrees with the basic properties of point. The sample mean (point estimate) is close to the population mean. However, the range is not as wide (only about 600) and the outliers are not as extreme. However, since this is a simple random sample the outcome will differ if we do not use the set.seed function

Exercise 3 Take a second sample, also of size 50, and call it samp2. How does the mean of samp2 compare with the mean of samp1? Suppose we took two more samples, one of size 100 and one of size 1000. Which would you think would provide a more accurate estimate of the population mean?

On your own

So far, we have only focused on estimating the mean living area in homes in Ames. Now you’ll try to estimate the mean home price.

Exercise 1 Take a random sample of size 50 from price. Using this sample, what is your best point estimate of the population mean?

samp_price <- sample(price, 50)
samp_priceMean <- mean(samp_price)

Exercise 2 Since you have access to the population, simulate the sampling distribution for x¯price by taking 5000 samples from the population of size 50 and computing 5000 sample means. Store these means in a vector called sample_means50. Plot the data, then describe the shape of this sampling distribution. Based on this sampling distribution, what would you guess the mean home price of the population to be? Finally, calculate and report the population mean.

sample_means50<-rep(0,5000)
for(i in 1:5000){
samp<-sample(price,50)
sample_means50[i]<-mean(samp)
}

hist(sample_means50, breaks = 25)

summary(price)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   12790  129500  160000  180800  213500  755000

population_mean <- mean(sample_means50)
population_mean

## [1] 180365.5

The shape of the sampling distribution seems to have a normal distribution centered around a mean of $180,000. The above calculated population mean is: $180748.6.

Exercise 3 Change your sample size from 50 to 150, then compute the sampling distribution using the same method as above, and store these means in a new vector called sample_means150. Describe the shape of this sampling distribution, and compare it to the sampling distribution for a sample size of 50. Based on this sampling distribution, what would you guess to be the mean sale price of homes in Ames?

sample_means150 <- rep(NA, 5000)

for(i in 1:5000){
   samp <- sample(price, 150)
   sample_means150[i] <- mean(samp)
   }

hist(sample_means150, breaks = 25)

Exercise 4 Of the sampling distributions from 2 and 3, which has a smaller spread? If we’re concerned with making estimates that are more often close to the true value, would we prefer a distribution with a large or small spread?

Comparison:

par(mfrow = c(1, 2))

xlimits <- range(sample_means50)

hist(sample_means50, breaks = 20, xlim = xlimits)
hist(sample_means150, breaks = 20, xlim = xlimits)

xlimits50 <- range(sample_means50)
xlimits150 <- range(sample_means150)


xlimits50

## [1] 145314.4 226304.1

xlimits150

## [1] 160691.0 203990.3

xlimits50[2] - xlimits50[1]

## [1] 80989.76

## [1] 83251.28

xlimits150[2] - xlimits150[1]

## [1] 43299.3

## [1] 49657.87

Sampling distributions from 3 has smaller spread but bigger sample size. We would prefer a distribution with a small spread because it is associated with low data variability.

DATA606_Lab4a_Statistical_Inference_Sampling_Distributions_1

Matheesha Thambeliyagodage

April 29, 2017

On your own

Comparison: