Set working directory and source the data.
if (Sys.info()["sysname"] == "Windows") {
setwd("~/Masters/DATA606/Week4/Lab/lab4b")
} else {
setwd("~/Documents/Masters/DATA606/Week4/Lab/lab4b")
}
load("more/ames.RData")
library(DATA606)
##
## Welcome to CUNY DATA606 Statistics and Probability for Data Analytics
## This package is designed to support this course. The text book used
## is OpenIntro Statistics, 3rd Edition. You can read this by typing
## vignette('os3') or visit www.OpenIntro.org.
##
## The getLabs() function will return a list of the labs available.
##
## The demo(package='DATA606') will list the demos that are available.
##
## Attaching package: 'DATA606'
## The following object is masked from 'package:utils':
##
## demo
require(ggplot2)
## Loading required package: ggplot2
Answer:
First I will generate the necessary graphs
set.seed(160)
population <- ames$Gr.Liv.Area
samp <- sample(population, 60)
hist(samp, 15)
qqnorm(samp)
qqline(samp)
summary(samp)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 480 1050 1369 1400 1667 2799
table(samp)
## samp
## 480 630 694 704 768 789 833 848 854 894 958 987 1025 1040 1048
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 1050 1072 1074 1143 1144 1178 1179 1204 1218 1235 1261 1299 1304 1308 1430
## 1 1 1 1 1 1 1 1 2 1 1 1 1 1 1
## 1431 1456 1470 1486 1488 1498 1501 1502 1536 1611 1626 1652 1712 1738 1802
## 1 1 1 1 1 1 1 1 1 1 3 1 1 1 1
## 1827 1840 1845 1873 1949 1978 1986 2000 2253 2260 2784 2799
## 1 1 1 1 1 1 1 1 1 1 1 1
The data appears to be right skewed, as indicated by the histogram and normal probability plot, with a mean of 1400 and median of 1369. I interpret typical to mean the most reccuring number; however, this would be the mode and the values are all slightly different (may unique entries). It does not appear using mode would be good to find the “typical” value. Given the skew, I would guess that the median would be the best approximator of a typical size.
Answer:
I would expect it to be simiar, but not the same, due to the variability between the samples taken.
Answer:
The important conditions for mean to be normally distributed and the standard error to be accurate are:
Answer:
95% confidence means we are 95% confident that the actual mean is within the referenced range (+/- 1.96 * SE) of the sample data. If we took many samples and built a confidence interval from each sample, then 95% of those intervals would contain the actual mean.
Answer:
sample_mean <- mean(samp)
se <- sd(samp)/sqrt(60)
lower <- sample_mean - 1.96 * se
upper <- sample_mean + 1.96 * se
c(lower, upper)
## [1] 1276.855 1523.945
mean(population)
## [1] 1499.69
Yes, the true average size of houses was captured within my confidence interval.
Answer:
95% of the intervals would capture the true population mean. From my previous response, the definition of a confidence interval is how confident one is that the true mean is within the interval found. Since we selected a 95% confidence interval, we would expect 5% of the intervals to not contain the actual mean.
set.seed(9384)
population <- ames$Gr.Liv.Area
samp_mean <- rep(NA, 50)
samp_sd <- rep(NA, 50)
n <- 60
for (i in 1:50) {
samp <- sample(population, n) # obtain a sample of size n = 60 from the population
samp_mean[i] <- mean(samp) # save sample mean in ith element of samp_mean
samp_sd[i] <- sd(samp) # save sample sd in ith element of samp_sd
}
lower_vector <- samp_mean - 1.96 * samp_sd/sqrt(n)
upper_vector <- samp_mean + 1.96 * samp_sd/sqrt(n)
plot_ci(lower_vector, upper_vector, mean(population))
Answer:
Two intervals do not contain the true mean; therefore, \(\left( \frac { \left( 50\quad -\quad 2 \right) }{ 50 } \right) =0.96\) of the intervals contain the true mean. This value is not exactly equal to 95% because there is variability between the samples and the confidence interval is only an estimate of the population data based on a normal distribution curve.
Answer:
I will pick a 90% confidence interval. The appropriate critical value is:
p_value <- 0.9 + (1 - 0.9)/2
p_value
## [1] 0.95
z_value <- qnorm(p = p_value, mean = 0, sd = 1)
z_value
## [1] 1.644854
Answer:
set.seed(852)
lower_vector <- samp_mean - 1.64 * samp_sd/sqrt(n)
upper_vector <- samp_mean + 1.64 * samp_sd/sqrt(n)
plot_ci(lower_vector, upper_vector, mean(population))
Five intervals did not contain the true mean; therefore, \(\left( \frac { \left( 45\quad -\quad 5 \right) }{ 50 } \right) =0.90\) of the intervals contain the true mean. This percentage happens to be exactly the same as the confidence level for the selected intervals.