CUNY DATA606

Part b of Lab 4

load("C:/Users/ZacharyHerold/Documents/DATA606/Lab4a/more/ames.RData")

head(ames[1:6,1:7])

##   Order       PID MS.SubClass MS.Zoning Lot.Frontage Lot.Area Street
## 1     1 526301100          20        RL          141    31770   Pave
## 2     2 526350040          20        RH           80    11622   Pave
## 3     3 526351010          20        RL           81    14267   Pave
## 4     4 526353030          20        RL           93    11160   Pave
## 5     5 527105010          60        RL           74    13830   Pave
## 6     6 527105030          60        RL           78     9978   Pave

length(ames$Gr.Liv.Area)

## [1] 2930

This is the population size (total number of houses).

##  [1] 1576 1124 1764 1838 1344 1694 1992 1002 1056  943 1158 2140 2287 1582
## [15] 1362 1057 1784 1434 1073 1088 1348 1365  641 1208 1172 1656 1663 1150
## [29]  899 1248  879 1552 1992 1141 1187 1456 1173  864 1040 2013 2322 1436
## [43] 1479 1552 1152 1596 1216 1573 1939 1795 1641 1446 1710 1328 2495 2439
## [57] 1034 2060 1768 1582

And this is a sample taken of 60 house sizes.

Exercise 1

Describe the distribution of your sample. What would you say is the “typical” size within your sample? Also state precisely what you interpreted “typical” to mean.

hist(samp, breaks = 10, main = "House sizes (sample size of 60)")

ANS: When broken into 10 bins, the distribution looks to have a rightward skew. The sample size is less than 10% of population, and sample size is large enough, so we should be fine inferring normality however.

summary(samp)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     641    1152    1441    1475    1724    2495

sd(samp)

## [1] 418.4417

The difference between Median and Mean suggests we may not be dealing with an symmetrical density curve. The Max is well more than 2 SD from the mean, while the min is nearly 2. This suggests a slight rightward skew.

Perhaps the median is a better indicator of the “typical”.

qqnorm(samp)
qqline(samp)

The Q-Q plot reveals there are extreme outliers on both sides of the spectrum. A step-liked pattern appears here for no good reason.

Exercise 2

Would you expect another student’s distribution to be identical to yours? Would you expect it to be similar? Why or why not?

ANS: The sample is relatively small which would allow for more discrepancy in our results. Because the sample was randomly drawn with the sample() function, we can be assured difference draw, leading to differing means and standard deviations, the two parameters of the normal curve.

Exercise 3

For the confidence interval to be valid, the sample mean must be normally distributed and have standard error (sd()/sqrt(n)). What conditions must be met for this to be true?

ANS: Independence (less than 10% of pop size), randomly generated over nearly normal population, or with a large enough sample size to allow CLT to kick in.

Exercise 4

What does “95% confidence” mean?

ANS: This means that we can be 95% confident that the population mean resides with the confidence interval derived by these parameters: critical value (directly related to confidence), sample SD and sample size.

In this case, we calculate the confidence interval, first finding the critical value:

mean.samp <- mean(samp)
sd.samp <- sd(samp)
n <- 60

sign <- .95
alpha <- 1 - sign
z <- alpha/2
crit <- qnorm(1-z)
crit

## [1] 1.959964

and then calculating the bounds from that.

lower_vector1 <- mean.samp - crit * sd.samp / sqrt(n) 
upper_vector1 <- mean.samp + crit * sd.samp / sqrt(n)

lower_vector1

## [1] 1369.255

upper_vector1

## [1] 1581.012

Exercise 5

Does your confidence interval capture the true average size of houses in Ames?

mean(population)

## [1] 1499.69

ANS: Yes, we are safely witin the 95% confidence interval.

If you are working on this lab in a classroom, does your neighbor’s interval capture this value?

ANS: It is likely to, but not guaranteed.

Exercise 6

Each student in your class should have gotten a slightly different confidence interval. What proportion of those intervals would you expect to capture the true population mean? Why?

ANS: One would expect 95% of them to capture the population mean, as that is how confidence intervals are defined.

On Your Own

Creating 50 sample means (from 60 random observations), and constructing intervals at the 95% confidence level.

set.seed(45); samp_mean <- rep(NA, 50)
samp_sd <- rep(NA, 50)
n <- 60

for(i in 1:50){
  samp <- sample(population, n) # obtain a sample of size n = 60 from the population
  samp_mean[i] <- mean(samp)    # save sample mean in ith element of samp_mean
  samp_sd[i] <- sd(samp)        # save sample sd in ith element of samp_sd
}

lower_vector <- samp_mean - 1.96 * samp_sd / sqrt(n) 
upper_vector <- samp_mean + 1.96 * samp_sd / sqrt(n)

length(lower_vector)

## [1] 50

bounds <- NULL
bounds$lower <- lower_vector
bounds$upper <- upper_vector

bounds <- data.frame(lower = bounds$low, upper = bounds$upper)
tail(bounds)

##       lower    upper
## 45 1352.998 1566.669
## 46 1360.508 1619.626
## 47 1353.687 1574.546
## 48 1384.866 1711.568
## 49 1471.060 1722.440
## 50 1319.005 1546.695

(1)

Using the following function (which was downloaded with the data set), plot all intervals.

plot_ci(lower_vector, upper_vector, mean(population))

What proportion of your confidence intervals include the true population mean? Is this proportion exactly equal to the confidence level? If not, explain why.

pop.mean <- mean(population)
pop.mean

## [1] 1499.69

outside <- pop.mean > upper_vector | pop.mean < lower_vector
outside

##  [1] FALSE FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE
## [12] FALSE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [23] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [34] FALSE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE  TRUE FALSE FALSE
## [45] FALSE FALSE FALSE FALSE FALSE FALSE

outliers <- sum(outside)
outliers

## [1] 5

outliers / length(lower_vector)

## [1] 0.1

10% of the confidence intervals proved wrong. This was a worse result than the 5% we would have predicted from the confidence level. Likely due to chance.

(2)

Pick a confidence level of your choosing, provided it is not 95%. What is the appropriate critical value?

For this exercise, I choose a confidence level of 75%.

sign <- .75

alpha <- 1 - sign
z <- alpha/2
crit2 <- qnorm(1-z)
crit2

## [1] 1.150349

The critical value is 1.15.

(3)

Calculate 50 confidence intervals at the confidence level you chose in the previous question. You do not need to obtain new samples, simply calculate new intervals based on the sample means and standard deviations you have already collected. Using the plot_ci function, plot all intervals and calculate the proportion of intervals that include the true population mean.

lower_vector2 <- samp_mean - crit2 * samp_sd / sqrt(n) 
upper_vector2 <- samp_mean + crit2 * samp_sd / sqrt(n)
plot_ci(lower_vector2, upper_vector2, mean(population))

How does this percentage compare to the confidence level selected for the intervals?

outside2 <- pop.mean > upper_vector2 | pop.mean < lower_vector2
outside2

##  [1]  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE FALSE
## [12]  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [23] FALSE FALSE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE  TRUE FALSE
## [34]  TRUE FALSE FALSE FALSE  TRUE FALSE FALSE FALSE  TRUE  TRUE FALSE
## [45] FALSE FALSE FALSE FALSE  TRUE  TRUE

outliers2 <- sum(outside2)
outliers2

## [1] 16

outliers2 / length(lower_vector2)

## [1] 0.32

In 32% of the simulations, the confidence internal did not accurately predict the population mean, or, we were right 68% of the time. This is less than what we expected constructing 75%. An infinite number of interations would likely get us very close to 75%.

CUNY DATA606_Lab4b

Zachary Herold

October 22, 2018