load("more/ames.RData")population <- ames$Gr.Liv.Area
set.seed(100)
samp <- sample(population, 60)
hist(samp)summary(samp)## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 729 1123 1343 1471 1632 2758
The distribution of my sample is skewed to the right with most values around 1200 square feet or so. Since the sample is skewed, a typical observation would be best described using the median. In this case, the median is 1343 square feet. I interpret “typical” to mean the median.
I would not expect another student’s distribution to be exactly identical to mine unless they used the same seed as I did. The chance of another sample including the exact same combination of values as my sample is incredibly small. That being said, I would expect the sample to be similar since both samples are drawing the same number of observations from the same underlying distribution.
Each observation must be independent.
Sample size must be at least 30.
Distribution must not be skewed.
95% confidence means that out of 100 confidence intervals made from sampling the population, 95% of those intervals will contain the true value that we are trying to estimate.
sample_mean <- mean(samp)
se <- sd(samp) / sqrt(60)
lower <- sample_mean - 1.96 * se
upper <- sample_mean + 1.96 * se
c(lower, upper)## [1] 1352.864 1588.303
mean(population)## [1] 1499.69
My confidence interval captures the true average size of houses in Ames.
95% of them should have captured the true population mean because we set a significance level of 95%.
set.seed(123)
samp_mean <- rep(NA, 50)
samp_sd <- rep(NA, 50)
n <- 60
for(i in 1:50){
samp <- sample(population, n) # obtain a sample of size n = 60 from the population
samp_mean[i] <- mean(samp) # save sample mean in ith element of samp_mean
samp_sd[i] <- sd(samp) # save sample sd in ith element of samp_sd
}
lower_vector <- samp_mean - 1.96 * samp_sd / sqrt(n)
upper_vector <- samp_mean + 1.96 * samp_sd / sqrt(n)
c(lower_vector[1], upper_vector[1])## [1] 1390.046 1677.388
```r
plot_ci(lower_vector, upper_vector, mean(population))
```
<img src="super_000-confidence_intervals_files/figure-html/plot-ci-1.png" width="672" />
98% of my confidence intervals included the true population mean. This proportion is not exactly equal to the confidence level. Since this group of confidence intervals are all random samples, they are all also inherently random themselves. Since all of the confidence intervals are created from random samples, they will very rarely exactly match the confidence level.
The critical value of a 99% confidence interval is 2.58.
plot_ci function, plot all intervals and calculate the proportion of intervals that include the true population mean. How does this percentage compare to the confidence level selected for the intervals?lower_vector <- samp_mean - 2.58 * samp_sd / sqrt(n)
upper_vector <- samp_mean + 2.58 * samp_sd / sqrt(n)
plot_ci(lower_vector, upper_vector, mean(population))At 99% confidence, all of the intervals contain the true value, whereas at 95% confidence, only 98% contained the true value. As the confidence level increases, the more likely a larger proportion of the confidence intervals will contain the true value.