We begin with descripive statistics for our population and histograms for our population and our first sample.
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 334 1126 1442 1500 1743 5642


Our full population mean is 1499.6904437 and our mean from a sample of 60 houses is 1550.7666667.
head(population,30)
## [1] 1656 896 1329 2110 1629 1604 1338 1280 1616 1804 1655 1187 1465 1341
## [15] 1502 3279 1752 1856 864 2073 1844 1173 1674 1004 1078 1056 882 864
## [29] 1337 987
samp
## [1] 1344 998 2229 1665 1492 2494 1660 2158 1136 1512 854 1721 1105 1668
## [15] 1436 1540 1633 1044 1121 893 1725 1364 1326 1538 1212 1430 1656 1822
## [29] 2110 1370 1983 1646 1296 2318 1612 1632 1002 2450 630 925 1935 1203
## [43] 1127 912 1714 1607 2044 1808 2698 1614 1920 1301 1361 2504 2520 960
## [57] 1363 875 1382 1448
sample_mean<-mean(samp)
Our sample is much smaller than the population and appears to be more centered on the distribution. The population is more skewed to the right and has a huge number of values toward the left, along with some large houses that are greater than twice the median. 1500-1700 ft2 is a typical house size. Typical means a fair number of values are in the range and the histogram has a mode encompassing these values.
I would not expect another student’s distribution to be similar and mine kept changing until I set a seed. 60 is not a large enough sample to get a stable sampling.
se <- sd(samp) / sqrt(60)
lower <- sample_mean - 1.96 * se
upper <- sample_mean + 1.96 * se
c(lower, upper)
## [1] 1430.844 1670.690
To be valid, our confidence interval must be normally distributed. We check that the sample is centered around the mean of the population. We want is to not be skewed. We want it to be randomly sampled from less than 10% of the population and not be too small.
A confidence interval tell us the furthest extreme we expect the mean of a sample to be 95% of the time. We are 95% certain the mean is within these values.
our confidence interval does capture the mean of the population.
We would expect our sample to capture the true mean 95% of the time. That’s another interpretation of our definition of what a confidence interval is.
samp_mean <- rep(NA, 50)
samp_sd <- rep(NA, 50)
n <- 60
for(i in 1:50){
samp <- sample(population, n) # obtain a sample of size n = 60 from the population
samp_mean[i] <- mean(samp) # save sample mean in ith element of samp_mean
samp_sd[i] <- sd(samp) # save sample sd in ith element of samp_sd
}
lower_vector <- samp_mean - 1.96 * samp_sd / sqrt(n)
upper_vector <- samp_mean + 1.96 * samp_sd / sqrt(n)
c(lower_vector[1], upper_vector[1])
## [1] 1293.121 1602.679
plot_ci(lower_vector, upper_vector, mean(population))

In our previous analysis, we used a 95% confidence interval, corresponding to a critical value of 1.96 for a two-sided test. In our next analysis, we will use a two sided confidence interval of 90%, corresponding to a Z-score value of 1.645
lower_vector <- samp_mean - 1.645 * samp_sd / sqrt(n)
upper_vector <- samp_mean + 1.645 * samp_sd / sqrt(n)
c(lower_vector[1], upper_vector[1])
## [1] 1317.996 1577.804
plot_ci(lower_vector, upper_vector, mean(population))

Our new test found 9 samples outside of the mean based on a confidence interval of 90%. The is compared to the 5 we expected to find. This is not what we would expect to find. Some portion of the time, our confidence-confidence will waver. We can’t be 100% sure to have 90% of our samples fall within our value.
We tried sampling another time with a new seed and arrived at 7 outliers(for 95%) and 8 outliers(for 90%). This, of course can be dangerous, because it opens the door to looking for data, but in this case, it illustrates the uncertainty that can be found in this dimension of our data.