The Data

download.file("http://www.openintro.org/stat/data/ames.RData", destfile = "ames.RData")
load("ames.RData")
population <- ames$Gr.Liv.Area
samp <- sample(population, 60)

Exercise 1: Describe the distribution of your sample. What would you say is the “typical” size within your sample? Also state precisely what you interpreted “typical” to mean.

hist(samp)

mean(samp)
## [1] 1557.4
# The distribution is highly skewed to the right. The typical size of the sample is 1,445.98 based on my interpretation of typical as indicating the mean or average although one could also take the median to be indicative of the typical size.

Exercise 2: Would you expect another student’s distribution to be identical to yours? Would you expect it to be similar? Why or why not?

No, I would not. Even supposing the underlying population from which the samples were generated was normal, given the population is suficciently large so that each sample is independent of any other, every successive random sample of the population will be distributed differently. One sample may be approximately normal while the next may be severely skewed. The samples would only be similar to one another if n approached the size of the underlying population itself.

Confidence Intervals

sample_mean <- mean(samp)
se <- sd(samp) / sqrt(60)
lower <- sample_mean - 1.96 * se
upper <- sample_mean + 1.96 * se
c(lower, upper)
## [1] 1429.025 1685.775

Exercise 3: For the confidence interval to be valid, the sample mean must be normally distributed and have standard error s/√n. What conditions must be met for this to be true?

Other than the usual conditions of random sampling and indepenent observations, there are different conditions based on the size of the sample and the shape of the population distribution. If n is small (i.e. n < 15), the population must be normally distributed. If 15 < n < 30, the population should be approximately normal (i.e. should not have any significant outliers). And, if n > 30, the population can even be non-normal. However, since the t-distribution is typically used in cases where n < 30, in cases where n > 30 you can invoke the CLT and use a normal distribution.

Exercise 4: What does “95% confidence” mean? If you’re not sure, see Section 4.2.2.

95% confidence means that if we restrict observations from the 0.025 percentile to the 0.975 percentile for a given sample size n and take an indefinite number of samples, 95% of those sample distributions will contain the true population mean. For a normal distribution, this means the population mean is within 1.96 standard deviations of the sample mean. For a t-distribution, the number of standard deviations depends on the size of n or the degrees of freedom. The specific number of standard errors would, like the standard deviation, also depend on the size of n. Strictly speaking, confidence intervals are levels of confidence or probability for finding a population mean in a sampling distribution or a distribution of sample means. To find the specific confidence interval for a sampling distribution based on a sample mean and sample size, you would simply divide the standard deviation by the square root of n and multiply this by the t-value or z-value that denotes the standard deviation that corresponds to the 95th percentile range (range from 0.025 percentile to the 0.975 percentile). In terms of probability, it means of all samples of size n that you take from a given population, the population mean will be contained in 95% of them regardless of the sample size n. That is, there is a 95% chance the sampling distribution will contain the population mean.

Confidence Levels

mean(population)
## [1] 1499.69

Exercise 5: Does your confidence interval capture the true average size of houses in Ames? If you are working on this lab in a classroom, does your neighbor’s interval capture this value?

Yes, it does. The population mean of 1499.69 is actually fairly close to the sample mean of 1445.98 and well within the bounds of the confidence interval which has a lower bound of 1324.73 and an upper bound of 1567.24.

Exercise 6: Each student in your class should have gotten a slightly different confidence interval. What proportion of those intervals would you expect to capture the true population mean? Why? If you are working in this lab in a classroom, collect data on the intervals created by other students in the class and calculate the proportion of intervals that capture the true population mean.

I would expect 95% of my classmates sample distributions with 95% confidence intervals to contain the true population mean. The reason is because we can regard these different samples of size 60 as forming a sampling distribution and because the confidence interval for each sample contains 95% of the total observations, it follows that for a sampling distribution, 95% of the confidence intervals of the samples that make up the sampling distribution would contain the population mean. That is, among the distribution of all possible sample means, the true population mean would be contained within the 0.025 percentile and 0.975 percentile of this distribution of sample means. Thus, 95% of samples with a 95% confidence interval will contain the true population mean.

samp_mean <- rep(NA, 50)
samp_sd <- rep(NA, 50)
n <- 60
for(i in 1:50){
  samp <- sample(population,n)
  samp_mean[i] <- mean(samp)
  samp_sd[i] <-sd(samp)
}

lower_vector <- samp_mean - 1.96 * samp_sd / sqrt(n)
upper_vector <- samp_mean + 1.96 * samp_sd / sqrt(n)
c(lower_vector[1], upper_vector[1])
## [1] 1456.178 1677.322

On Your Own

1. Using the following function (which was downloaded with the data set), plot all intervals. What proportion of your confidence intervals include the true population mean? Is this proportion exactly equal to the confidence level? If not, explain why.

plot_ci(lower_vector, upper_vector, mean(population))

94% of the confidence intervals included the true population mean. This is not exactly equal to he confidence level but is obviously very close. 50 is a relatively small amount of samplings/simulations to consitute a sampling distribution even with such a large sample size of 60. If we had increased the number of samples to even 60 or 70 rather than 50, the proportion would likely be exactly equal to 95%.

2. Pick a confidence level of your choosing, provided it is not 95%. What is the appropriate critical value?

90% is the confidence level I choose. The appropriate cirtical value is +- 1.64.

3. Calculate 50 confidence intervals at the confidence level you chose in the previous question. You do not need to obtain new samples, simply calculate new intervals based on the sample means and standard deviations you have already collected. Using the plot_ci function, plot all intervals and calculate the proportion of intervals that include the true population mean. How does this percentage compare to the confidence level selected for the intervals?

lower_vector <- samp_mean - 1.64 * samp_sd / sqrt(n)
upper_vector <- samp_mean + 1.64 * samp_sd / sqrt(n)
plot_ci(lower_vector, upper_vector, mean(population))

In this case, 92% of the confidence intervals contained the true population mean. This percentage is slightly larger than the confidence level selected for the intervals of 90%. As with the previous plot with a confidence level of 95%, if we increased the number of samples to over 50, the proportion of intervals that include the true population mean would likely decrease to 90%.