load("C:/Users/ZacharyHerold/Documents/DATA606/Lab4a/more/ames.RData")
head(ames[1:6,1:7])
## Order PID MS.SubClass MS.Zoning Lot.Frontage Lot.Area Street
## 1 1 526301100 20 RL 141 31770 Pave
## 2 2 526350040 20 RH 80 11622 Pave
## 3 3 526351010 20 RL 81 14267 Pave
## 4 4 526353030 20 RL 93 11160 Pave
## 5 5 527105010 60 RL 74 13830 Pave
## 6 6 527105030 60 RL 78 9978 Pave
length(ames$Gr.Liv.Area)
## [1] 2930
This is the population size (total number of houses).
## [1] 1576 1124 1764 1838 1344 1694 1992 1002 1056 943 1158 2140 2287 1582
## [15] 1362 1057 1784 1434 1073 1088 1348 1365 641 1208 1172 1656 1663 1150
## [29] 899 1248 879 1552 1992 1141 1187 1456 1173 864 1040 2013 2322 1436
## [43] 1479 1552 1152 1596 1216 1573 1939 1795 1641 1446 1710 1328 2495 2439
## [57] 1034 2060 1768 1582
And this is a sample taken of 60 house sizes.
Describe the distribution of your sample. What would you say is the “typical” size within your sample? Also state precisely what you interpreted “typical” to mean.
hist(samp, breaks = 10, main = "House sizes (sample size of 60)")
ANS: When broken into 10 bins, the distribution looks to have a rightward skew. The sample size is less than 10% of population, and sample size is large enough, so we should be fine inferring normality however.
summary(samp)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 641 1152 1441 1475 1724 2495
sd(samp)
## [1] 418.4417
The difference between Median and Mean suggests we may not be dealing with an symmetrical density curve. The Max is well more than 2 SD from the mean, while the min is nearly 2. This suggests a slight rightward skew.
Perhaps the median is a better indicator of the “typical”.
qqnorm(samp)
qqline(samp)
The Q-Q plot reveals there are extreme outliers on both sides of the spectrum. A step-liked pattern appears here for no good reason.
Would you expect another student’s distribution to be identical to yours? Would you expect it to be similar? Why or why not?
ANS: The sample is relatively small which would allow for more discrepancy in our results. Because the sample was randomly drawn with the sample() function, we can be assured difference draw, leading to differing means and standard deviations, the two parameters of the normal curve.
For the confidence interval to be valid, the sample mean must be normally distributed and have standard error (sd()/sqrt(n)). What conditions must be met for this to be true?
ANS: Independence (less than 10% of pop size), randomly generated over nearly normal population, or with a large enough sample size to allow CLT to kick in.
What does “95% confidence” mean?
ANS: This means that we can be 95% confident that the population mean resides with the confidence interval derived by these parameters: critical value (directly related to confidence), sample SD and sample size.
In this case, we calculate the confidence interval, first finding the critical value:
mean.samp <- mean(samp)
sd.samp <- sd(samp)
n <- 60
sign <- .95
alpha <- 1 - sign
z <- alpha/2
crit <- qnorm(1-z)
crit
## [1] 1.959964
and then calculating the bounds from that.
lower_vector1 <- mean.samp - crit * sd.samp / sqrt(n)
upper_vector1 <- mean.samp + crit * sd.samp / sqrt(n)
lower_vector1
## [1] 1369.255
upper_vector1
## [1] 1581.012
Does your confidence interval capture the true average size of houses in Ames?
mean(population)
## [1] 1499.69
ANS: Yes, we are safely witin the 95% confidence interval.
If you are working on this lab in a classroom, does your neighbor’s interval capture this value?
ANS: It is likely to, but not guaranteed.
Each student in your class should have gotten a slightly different confidence interval. What proportion of those intervals would you expect to capture the true population mean? Why?
ANS: One would expect 95% of them to capture the population mean, as that is how confidence intervals are defined.
Creating 50 sample means (from 60 random observations), and constructing intervals at the 95% confidence level.
set.seed(45); samp_mean <- rep(NA, 50)
samp_sd <- rep(NA, 50)
n <- 60
for(i in 1:50){
samp <- sample(population, n) # obtain a sample of size n = 60 from the population
samp_mean[i] <- mean(samp) # save sample mean in ith element of samp_mean
samp_sd[i] <- sd(samp) # save sample sd in ith element of samp_sd
}
lower_vector <- samp_mean - 1.96 * samp_sd / sqrt(n)
upper_vector <- samp_mean + 1.96 * samp_sd / sqrt(n)
length(lower_vector)
## [1] 50
bounds <- NULL
bounds$lower <- lower_vector
bounds$upper <- upper_vector
bounds <- data.frame(lower = bounds$low, upper = bounds$upper)
tail(bounds)
## lower upper
## 45 1352.998 1566.669
## 46 1360.508 1619.626
## 47 1353.687 1574.546
## 48 1384.866 1711.568
## 49 1471.060 1722.440
## 50 1319.005 1546.695
Using the following function (which was downloaded with the data set), plot all intervals.
plot_ci(lower_vector, upper_vector, mean(population))
What proportion of your confidence intervals include the true population mean? Is this proportion exactly equal to the confidence level? If not, explain why.
pop.mean <- mean(population)
pop.mean
## [1] 1499.69
outside <- pop.mean > upper_vector | pop.mean < lower_vector
outside
## [1] FALSE FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE
## [12] FALSE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [23] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [34] FALSE FALSE FALSE FALSE TRUE FALSE FALSE FALSE TRUE FALSE FALSE
## [45] FALSE FALSE FALSE FALSE FALSE FALSE
outliers <- sum(outside)
outliers
## [1] 5
outliers / length(lower_vector)
## [1] 0.1
10% of the confidence intervals proved wrong. This was a worse result than the 5% we would have predicted from the confidence level. Likely due to chance.
Pick a confidence level of your choosing, provided it is not 95%. What is the appropriate critical value?
For this exercise, I choose a confidence level of 75%.
sign <- .75
alpha <- 1 - sign
z <- alpha/2
crit2 <- qnorm(1-z)
crit2
## [1] 1.150349
The critical value is 1.15.
Calculate 50 confidence intervals at the confidence level you chose in the previous question. You do not need to obtain new samples, simply calculate new intervals based on the sample means and standard deviations you have already collected. Using the plot_ci function, plot all intervals and calculate the proportion of intervals that include the true population mean.
lower_vector2 <- samp_mean - crit2 * samp_sd / sqrt(n)
upper_vector2 <- samp_mean + crit2 * samp_sd / sqrt(n)
plot_ci(lower_vector2, upper_vector2, mean(population))
How does this percentage compare to the confidence level selected for the intervals?
outside2 <- pop.mean > upper_vector2 | pop.mean < lower_vector2
outside2
## [1] TRUE TRUE TRUE FALSE FALSE FALSE FALSE FALSE TRUE TRUE FALSE
## [12] TRUE TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [23] FALSE FALSE TRUE TRUE FALSE FALSE FALSE FALSE FALSE TRUE FALSE
## [34] TRUE FALSE FALSE FALSE TRUE FALSE FALSE FALSE TRUE TRUE FALSE
## [45] FALSE FALSE FALSE FALSE TRUE TRUE
outliers2 <- sum(outside2)
outliers2
## [1] 16
outliers2 / length(lower_vector2)
## [1] 0.32
In 32% of the simulations, the confidence internal did not accurately predict the population mean, or, we were right 68% of the time. This is less than what we expected constructing 75%. An infinite number of interations would likely get us very close to 75%.