population <- ames$Gr.Liv.Area
samp <- sample(population, 60)
hist(samp, breaks = 10)
The distribution is right skewed, which makes intuitive sense, as area can’t go below 0, but can get pretty high. A typical obervation would be between 750 and 1750. That’s where the bucket sizes really start to drop off at either end. This will be the heavy part of the distribution
No one is going to have an identical distribtion. There are an impossibly large number of possible samples (5000 choose 60). Using similar loosely, I would imagine most will look similar, as 60 is double 30. Although because the choice of seed number isn’t random, identical is possible (favorite numbers and such).
The sample size must be sufficiently large and taken at random. The skew of the population’s distribution will determine the number needed for a sample.
95% confidence means that 95% of the the sampling distribution’s observations will fall between those two levels. A consequence of that is that the true population mean will fall between those levels 95% of the time.
sample_mean <- mean(samp)
se <- sd(samp) / sqrt(60)
lower <- sample_mean - 1.96 * se
upper <- sample_mean + 1.96 * se
c(lower, upper)
## [1] 1491.152 1775.748
mean(population)
## [1] 1499.69
The population mean is within our confidence interval
You would expect about 95% of those samples to be within the condfience interval. If its 20 samples, it could easlity be 2 that are outside it, but if we’re talking 1 million, it will be very close to 95%, assuming the sampling distribution is okay.
samp_mean <- rep(NA, 50)
samp_sd <- rep(NA, 50)
n <- 60
for(i in 1:50){
samp <- sample(population, n) # obtain a sample of size n = 60 from the population
samp_mean[i] <- mean(samp) # save sample mean in ith element of samp_mean
samp_sd[i] <- sd(samp) # save sample sd in ith element of samp_sd
}
lower_vector <- samp_mean - 1.96 * samp_sd / sqrt(n)
upper_vector <- samp_mean + 1.96 * samp_sd / sqrt(n)
plot_ci(lower_vector, upper_vector, mean(population))
47/50 or 94% of the samples have the true mean within their 95% confidence interval. If we assume the sampling distribution is exactly normal, we could use the binomial distribution to deterine the probability getting 3 samples not containing the true mean. it would be 0.2198748 which isn’t that unlikely.
90% is 1.645
lower_vector <- samp_mean - 1.645 * samp_sd / sqrt(n)
upper_vector <- samp_mean + 1.645 * samp_sd / sqrt(n)
plot_ci(lower_vector, upper_vector, mean(population))
Here we see 4 cases when we expect about 5. The probablity of exactly 4 is 0.1809045 Again not that unlikely