download.file("http://www.openintro.org/stat/data/ames.RData", destfile = "ames.RData")
load("ames.RData")
population <- ames$Gr.Liv.Area
samp <- sample(population, 60)
hist(samp)
mean(samp)
## [1] 1557.4
# The distribution is highly skewed to the right. The typical size of the sample is 1,445.98 based on my interpretation of typical as indicating the mean or average although one could also take the median to be indicative of the typical size.
No, I would not. Even supposing the underlying population from which the samples were generated was normal, given the population is suficciently large so that each sample is independent of any other, every successive random sample of the population will be distributed differently. One sample may be approximately normal while the next may be severely skewed. The samples would only be similar to one another if n approached the size of the underlying population itself.
sample_mean <- mean(samp)
se <- sd(samp) / sqrt(60)
lower <- sample_mean - 1.96 * se
upper <- sample_mean + 1.96 * se
c(lower, upper)
## [1] 1429.025 1685.775
Other than the usual conditions of random sampling and indepenent observations, there are different conditions based on the size of the sample and the shape of the population distribution. If n is small (i.e. n < 15), the population must be normally distributed. If 15 < n < 30, the population should be approximately normal (i.e. should not have any significant outliers). And, if n > 30, the population can even be non-normal. However, since the t-distribution is typically used in cases where n < 30, in cases where n > 30 you can invoke the CLT and use a normal distribution.
95% confidence means that if we restrict observations from the 0.025 percentile to the 0.975 percentile for a given sample size n and take an indefinite number of samples, 95% of those sample distributions will contain the true population mean. For a normal distribution, this means the population mean is within 1.96 standard deviations of the sample mean. For a t-distribution, the number of standard deviations depends on the size of n or the degrees of freedom. The specific number of standard errors would, like the standard deviation, also depend on the size of n. Strictly speaking, confidence intervals are levels of confidence or probability for finding a population mean in a sampling distribution or a distribution of sample means. To find the specific confidence interval for a sampling distribution based on a sample mean and sample size, you would simply divide the standard deviation by the square root of n and multiply this by the t-value or z-value that denotes the standard deviation that corresponds to the 95th percentile range (range from 0.025 percentile to the 0.975 percentile). In terms of probability, it means of all samples of size n that you take from a given population, the population mean will be contained in 95% of them regardless of the sample size n. That is, there is a 95% chance the sampling distribution will contain the population mean.
mean(population)
## [1] 1499.69
Yes, it does. The population mean of 1499.69 is actually fairly close to the sample mean of 1445.98 and well within the bounds of the confidence interval which has a lower bound of 1324.73 and an upper bound of 1567.24.
I would expect 95% of my classmates sample distributions with 95% confidence intervals to contain the true population mean. The reason is because we can regard these different samples of size 60 as forming a sampling distribution and because the confidence interval for each sample contains 95% of the total observations, it follows that for a sampling distribution, 95% of the confidence intervals of the samples that make up the sampling distribution would contain the population mean. That is, among the distribution of all possible sample means, the true population mean would be contained within the 0.025 percentile and 0.975 percentile of this distribution of sample means. Thus, 95% of samples with a 95% confidence interval will contain the true population mean.
samp_mean <- rep(NA, 50)
samp_sd <- rep(NA, 50)
n <- 60
for(i in 1:50){
samp <- sample(population,n)
samp_mean[i] <- mean(samp)
samp_sd[i] <-sd(samp)
}
lower_vector <- samp_mean - 1.96 * samp_sd / sqrt(n)
upper_vector <- samp_mean + 1.96 * samp_sd / sqrt(n)
c(lower_vector[1], upper_vector[1])
## [1] 1456.178 1677.322
plot_ci(lower_vector, upper_vector, mean(population))
94% of the confidence intervals included the true population mean. This is not exactly equal to he confidence level but is obviously very close. 50 is a relatively small amount of samplings/simulations to consitute a sampling distribution even with such a large sample size of 60. If we had increased the number of samples to even 60 or 70 rather than 50, the proportion would likely be exactly equal to 95%.
90% is the confidence level I choose. The appropriate cirtical value is +- 1.64.
lower_vector <- samp_mean - 1.64 * samp_sd / sqrt(n)
upper_vector <- samp_mean + 1.64 * samp_sd / sqrt(n)
plot_ci(lower_vector, upper_vector, mean(population))
In this case, 92% of the confidence intervals contained the true population mean. This percentage is slightly larger than the confidence level selected for the intervals of 90%. As with the previous plot with a confidence level of 95%, if we increased the number of samples to over 50, the proportion of intervals that include the true population mean would likely decrease to 90%.