DATA 606 - Lab 4a

knitr::opts_chunk$set(echo = TRUE)
download.file("http://www.openintro.org/stat/data/ames.RData", destfile = "ames.RData")
load("ames.RData")
library(DATA606)

## 
## Welcome to CUNY DATA606 Statistics and Probability for Data Analytics 
## This package is designed to support this course. The text book used 
## is OpenIntro Statistics, 3rd Edition. You can read this by typing 
## vignette('os3') or visit www.OpenIntro.org. 
##  
## The getLabs() function will return a list of the labs available. 
##  
## The demo(package='DATA606') will list the demos that are available.

## 
## Attaching package: 'DATA606'

## The following object is masked from 'package:utils':
## 
##     demo

library(psych)

Exercise 1 - Describe this population data.

The population data is unimodal and right skewed. You can expect area per house to be around 1442 (median)

area <- ames$Gr.Liv.Area
price <- ames$SalePrice
describe(area)

##    vars    n    mean     sd median trimmed    mad min  max range skew
## X1    1 2930 1499.69 505.51   1442 1452.25 461.09 334 5642  5308 1.27
##    kurtosis   se
## X1     4.12 9.34

qqnorm(area)
qqline(area)

Exercise 2 Describe the distribution of this sample. How does it compare to the distribution of the population?

This sample is also a unimodal right skewed distribution. With an expected area of 1343 (median)

set.seed(100)
samp1 <- sample(area, 50)
describe(samp1)

##    vars  n    mean     sd median trimmed    mad min  max range skew
## X1    1 50 1441.52 448.68   1343 1381.65 360.27 729 2673  1944 1.05
##    kurtosis    se
## X1      0.4 63.45

qqnorm(samp1)
qqline(samp1)

Exercise 3 Take a second sample, also of size 50, and call it samp2. How does the mean of samp2 compare with the mean of samp1? Suppose we took two more samples, one of size 100 and one of size 1000. Which would you think would provide a more accurate estimate of the population mean?

If we keep taking samples of 50, I imagine about 95% of the means to be within two standard deviations of the true mean. As the sample size gets greater, the sample means should be closer to the real mean.

set.seed(200)
samp2 = sample(area,50)
samp3 = sample(area,100)
samp4 = sample(area,1000)
describe(samp2)

##    vars  n   mean     sd median trimmed    mad min  max range skew
## X1    1 50 1486.7 508.94   1446  1466.7 614.54 572 2501  1929 0.19
##    kurtosis    se
## X1    -0.96 71.97

describe(samp3)

##    vars   n   mean     sd median trimmed    mad min  max range skew
## X1    1 100 1547.4 491.44   1498 1509.84 435.88 599 3672  3073 1.01
##    kurtosis    se
## X1     2.44 49.14

describe(samp4)

##    vars    n    mean    sd median trimmed    mad min  max range skew
## X1    1 1000 1492.12 510.1 1427.5 1441.42 454.42 438 4676  4238 1.28
##    kurtosis    se
## X1     3.39 16.13

Exercise 4 How many elements are there in sample_means50? Describe the sampling distribution, and be sure to specifically note its center. Would you expect the distribution to change if we instead collected 50,000 sample means?

There are 5,000 elements in the sample_means50. The sampling distribution is approaching a perfect normality, unimodal without skew. The center is its mean around 1500. The distribution would become more normalized as the sample goes up.

Exercise 5 To make sure you understand what you’ve done in this loop, try running a smaller version. Initialize a vector of 100 zeros called sample_means_small. Run a loop that takes a sample of size 50 from area and stores the sample mean in sample_means_small, but only iterate from 1 to 100. Print the output to your screen (type sample_means_small into the console and press enter). How many elements are there in this object called sample_means_small? What does each element represent?

Each of the 100 elements respresents a sampled mean of area.

sample_means_small = rep(0,100)
for(i in 1:100){
   samp <- sample(area, 50)
   sample_means_small[i] <- mean(samp)
}
length(sample_means_small)

## [1] 100

Exercise 5 When the sample size is larger, what happens to the center? What about the spread?

The center, centers itself as a normal distribution does. That is to say, more results are closer to the mean. The spread becomes unimodal and less skewed.

Question 1 . Take a random sample of size 50 from price. Using this sample, what is your best point estimate of the population mean?

sampq1 = sample(price,50)
mean(sampq1)

## [1] 160984.7

Question 2 . Since you have access to the population, simulate the sampling distribution for x¯pricex¯price by taking 5000 samples from the population of size 50 and computing 5000 sample means. Store these means in a vector called sample_means50. Plot the data, then describe the shape of this sampling distribution. Based on this sampling distribution, what would you guess the mean home price of the population to be? Finally, calculate and report the population mean.

Shape of the sampling distribution is normal. I would guess the mean home price of the population to be 180,000.

set.seed(2500)
sample_means50 = rep(NA,5000)
for(i in 1:5000){
   samp <- sample(price, 50)
   sample_means50[i] <- mean(samp)
}
hist(sample_means50, breaks =30)

qqnorm(sample_means50)
qqline(sample_means50)

describe(sample_means50)

##    vars    n     mean       sd   median  trimmed     mad      min      max
## X1    1 5000 180769.7 11129.42 180149.5 180490.9 11055.2 134775.5 223689.5
##       range skew kurtosis     se
## X1 88913.98 0.25     0.14 157.39

mean(price)

## [1] 180796.1

Question 3 . Change your sample size from 50 to 150, then compute the sampling distribution using the same method as above, and store these means in a new vector called sample_means150. Describe the shape of this sampling distribution, and compare it to the sampling distribution for a sample size of 50. Based on this sampling distribution, what would you guess to be the mean sale price of homes in Ames?

The shape of this sampling distribution looks to keep everything the same as the last, only fewer outliers and all results are closer to the mean which looks again to be around 180,000.

set.seed(2500)
sample_means150 = rep(NA,5000)
for(i in 1:5000){
   samp <- sample(price, 150)
   sample_means150[i] <- mean(samp)
}
hist(sample_means150, breaks =30)

qqnorm(sample_means150)
qqline(sample_means150)

describe(sample_means150)

##    vars    n     mean      sd   median  trimmed     mad      min      max
## X1    1 5000 180854.6 6403.95 180819.7 180791.4 6291.15 158473.6 203272.4
##       range skew kurtosis    se
## X1 44798.75 0.08        0 90.57

mean(price)

## [1] 180796.1

Question 4 . Of the sampling distributions from 2 and 3, which has a smaller spread? If we’re concerned with making estimates that are more often close to the true value, would we prefer a distribution with a large or small spread?

Distribution 3 has a smaller spread, we would use this smaller spread to find closer to true values for almost all conceivable observations on the population data.