library(knitr)
load(url("http://www.openintro.org/stat/data/ames.RData"))
area <- ames$Gr.Liv.Area
price <- ames$SalePrice
summary(area)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     334    1126    1442    1500    1743    5642
hist(area)

  1. Describe this population distribution.

    The population distribution is skewed to the right and the shape of the distribution is in between the 334 and 5642.

samp1 <- sample(area, 50)
  1. Describe the distribution of this sample. How does it compare to the distribution of the population?

    This sample also is right skewed.

mean(samp1)
## [1] 1504.94
  1. Take a second sample, also of size 50, and call it samp2. How does the mean of samp2 compare with the mean of samp1? Suppose we took two more samples, one of size 100 and one of size 1000. Which would you think would provide a more accurate estimate of the population mean?
mu1<-mean(samp1)
samp2<-sample(area,50)
mu2<-mean(samp2)
if(mu1<mu2){
  printo<-paste("The mean of samp2 is greater than samp1")
} else if(mu1>mu2){
  printo<-paste("The mean of samp2 is less than samp1")
} else {
  printo<-paste("The mean of samp2 is equal to samp1")
}

The mean of samp2 is 1358.04 vs. the mean of samp1 is 1504.94. The mean of samp2 is less than samp1.

sample_means50 <- rep(NA, 5000)

for(i in 1:5000){
   samp <- sample(area, 50)
   sample_means50[i] <- mean(samp)
   }

hist(sample_means50)

hist(sample_means50, breaks = 25)

  1. How many elements are there in sample_means50? Describe the sampling distribution, and be sure to specifically note its center. Would you expect the distribution to change if we instead collected 50,000 sample means?

    # of elements of sample_means50: 5000 with a mean of: 1500.17. The distribution looks normal.

sample_means50 <- rep(NA, 5000)

samp <- sample(area, 50)
sample_means50[1] <- mean(samp)

samp <- sample(area, 50)
sample_means50[2] <- mean(samp)

samp <- sample(area, 50)
sample_means50[3] <- mean(samp)

samp <- sample(area, 50)
sample_means50[4] <- mean(samp)
sample_means50 <- rep(NA, 5000)

for(i in 1:5000){
   samp <- sample(area, 50)
   sample_means50[i] <- mean(samp)
   if(i==1){
    print("from ")
    print(i)
    print(" to ")
   }
   if(i==5000)
    print(i)
   }
## [1] "from "
## [1] 1
## [1] " to "
## [1] 5000
  1. To make sure you understand what you’ve done in this loop, try running a smaller version. Initialize a vector of 100 zeros called sample_means_small. Run a loop that takes a sample of size 50 from area and stores the sample mean in sample_means_small, but only iterate from 1 to 100. Print the output to your screen (type sample_means_small into the console and press enter). How many elements are there in this object called sample_means_small? What does each element represent?
sample_means_small <- rep(NA, 100)

for(i in 1:100){
   samp <- sample(area, 50)
   sample_means_small[i] <- mean(samp)
   if(i==1){
    print("from ")
    print(i)
    print(" to ")
   }
   if(i==100)
    print(i)
}
## [1] "from "
## [1] 1
## [1] " to "
## [1] 100

# of elements of sample_means_small: 100 with a mean of: 1492.16. Each element represents the mean of the sample of 50, rolled 100 times.

hist(sample_means50)

sample_means10 <- rep(NA, 5000)
sample_means100 <- rep(NA, 5000)

for(i in 1:5000){
  samp <- sample(area, 10)
  sample_means10[i] <- mean(samp)
  samp <- sample(area, 100)
  sample_means100[i] <- mean(samp)
}
par(mfrow = c(3, 1))

xlimits <- range(sample_means10)

hist(sample_means10, breaks = 20, xlim = xlimits)
hist(sample_means50, breaks = 20, xlim = xlimits)
hist(sample_means100, breaks = 20, xlim = xlimits)

  1. When the sample size is larger, what happens to the center? What about the spread?

On your own

So far, we have only focused on estimating the mean living area in homes in Ames. Now you’ll try to estimate the mean home price.

sample_price50<-sample(price,50)
mean_sample_price50<-mean(sample_price50)
point_estimate_sample_price50<-format(sample_price50,scientific=FALSE)
  • sample_price_of_50 = 64000, 229800, 137000, 130000, 126000, 126000, 245700, 225000, 275000, 270000, 174500, 135000, 155000, 124000, 184000, 236500, 71000, 138000, 359900, 402861, 272000, 103000, 190000, 140000, 149000, 158000, 104000, 158500, 197900, 116000, 203160, 264132, 153000, 132000, 119000, 155000, 93000, 75000, 133500, 151500, 148000, 193000, 275000, 279000, 155000, 180000, 132500, 139000, 135000, 176400
  • with a mean of: mean_sample_price_of_50 = 1.737970610^{5}
  • with a point estimate of: point_estimate_sample_price_of_50 = 64000, 229800, 137000, 130000, 126000, 126000, 245700, 225000, 275000, 270000, 174500, 135000, 155000, 124000, 184000, 236500, 71000, 138000, 359900, 402861, 272000, 103000, 190000, 140000, 149000, 158000, 104000, 158500, 197900, 116000, 203160, 264132, 153000, 132000, 119000, 155000, 93000, 75000, 133500, 151500, 148000, 193000, 275000, 279000, 155000, 180000, 132500, 139000, 135000, 176400.
sample_means50<-rep(NA, 5000)
for(i in 1:5000){
  samp<-sample(price, 50)
  sample_means50[i]<-mean(samp)
}
hist(sample_means50, breaks=25)

  • The mean of sample_means50 is between 180000 and 181000. mean_sample_means50 = 1.808939610^{5}
  • The shape of sample_means50 distribution is normal and at the mean_sample_means50
  • The point estimate is: point_estimate_sample_means50 = 180894
sample_means150<-rep(NA, 5000)
for(i in 1:5000){
  samp<-sample(price, 150)
  sample_means150[i]<-mean(samp)
}
  • The mean of sample_means150 is between 180000 and 181000. mean_sample_means150 = 1.807086110^{5}
  • The shape of sample_means150 distribution is normal and at the mean_sample_means150
  • The point estimate is: point_estimate_sample_means150 = 180708.6
  • The comparison will show that larger the sample is, better will be the shape of the distribution
par(mfrow = c(1, 2))
xlimits <- range(sample_means50)

hist(sample_means50, breaks = 20, xlim = xlimits)
hist(sample_means150, breaks = 20, xlim = xlimits)

xlimits50 <- range(sample_means50)
range50diff <- xlimits50[2] - xlimits50[1]

xlimits150 <- range(sample_means150)
range150diff <- xlimits150[2] - xlimits150[1]
  • The range of sample_means50 is 1.4648710^{5}, 2.332053610^{5}
  • The range of sample_means150 is 1.60778410^{5}, 2.031201310^{5}
  • range50diff = 8.67183610^{4}
  • range150diff = 4.234172710^{4}
  • the spread becomes shorter when the sample becomes larger: range150diff < range50diff