##The Data

download.file("http://www.openintro.org/stat/data/ames.RData", destfile = "ames.RData")
load("ames.RData")

Data set for real estate in Ames. We’ll be taking samples of the population.

area <- ames$Gr.Liv.Area
price <- ames$SalePrice

The 2 variables we created are the above ground living area (“area”) and the sales price (“price”)

summary(area)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     334    1126    1442    1500    1743    5642
hist(area, col= "purple")

Exercise 1

Describe this population distribution.

The Unknown Sampling Distribution

samp1 <- sample(area, 50)

Estimating the mean living area based on the sample. n=50

Exercise 2

Describe the distribution of this sample. How does it compare to the distribution of the population?

mean(samp1)
## [1] 1415.8

Estimating the average living area based on the sample.

Exercise 3

Take a second sample, also of size 50, and call it samp2. How does the mean of samp2 compare with the mean of samp1? Suppose we took two more samples, one of size 100 and one of size 1000. Which would you think would provide a more accurate estimate of the population mean?

samp2 <- sample(area, 50)
mean(samp2)
## [1] 1447.74
samp3 <- sample(area, 50)
mean(samp3)
## [1] 1460

A sample with a larger size of about n=1000 would provide the most accurate estimate.

sample_means50 <- rep(NA, 5000)

for(i in 1:5000){
   samp <- sample(area, 50)
   sample_means50[i] <- mean(samp)
   }

hist(sample_means50)

This is the generation of 5000 samples and taking the sample mean for each.

hist(sample_means50, breaks = 25)

Adjusting the bin width of the histogram shows a little bit more detail.