Lab4b_StephenJones

load("more/ames.RData")

population <- ames$Gr.Liv.Area
set.seed(999999)
samp <- sample(population, 60)

hist(samp)

Describe the distribution of your sample. What would you say is the “typical” size within your sample? Also state precisely what you interpreted “typical” to mean.

The distribution is essentially unimodal, and nearing a normal distribution. Typical size of the sample is 1500, as indicated by the frequency indicated in the histogram.

Would you expect another student’s distribution to be identical to yours? Would you expect it to be similar? Why or why not?

The distribution will be different because we are generating a random sample. I have set a seed to prevent a different outcome when running this markdown, or else running the code would alter the sample, and by definition the distribution.

sample_mean <- mean(samp) se <- sd(samp) / sqrt(60) lower <- sample_mean - 1.96 * se upper <- sample_mean + 1.96 * se c(lower, upper)

## [1] 1322.756 1504.977

For the confidence interval to be valid, the sample mean must be normally distributed and have standard error \(s / \sqrt{n}\). What conditions must be met for this to be true?

Samples must be independent and sampling must be random with sample size greater than 30.

What does “95% confidence” mean? If you’re not sure, see Section 4.2.2.

This confidence refers to the probability that the population mean lies in the range specified by the confidence interval.

mean(population)

## [1] 1499.69

Does your confidence interval capture the true average size of houses in Ames? If you are working on this lab in a classroom, does your neighbor’s interval capture this value?

With mean 1499.6904437 between 1322.756475 and 1504.9768583 the confidence interval captures the true average.

Each student in your class should have gotten a slightly different confidence interval. What proportion of those intervals would you expect to capture the true population mean? Why? If you are working in this lab in a classroom, collect data on the intervals created by other students in the class and calculate the proportion of intervals that capture the true population mean.

I would expect that the mean would lie in the condfidence interval 95% of the time by definition.

samp_mean <- rep(NA, 50) samp_sd <- rep(NA, 50) n <- 60 for(i in 1:50){ samp <- sample(population, n) # obtain a sample of size n = 60 from the population samp_mean[i] <- mean(samp) # save sample mean in ith element of samp_mean samp_sd[i] <- sd(samp) # save sample sd in ith element of samp_sd } lower_vector <- samp_mean - 1.96 * samp_sd / sqrt(n) upper_vector <- samp_mean + 1.96 * samp_sd / sqrt(n) c(lower_vector[1], upper_vector[1])

## [1] 1267.564 1495.136

On your own

Using the following function (which was downloaded with the data set), plot all intervals. What proportion of your confidence intervals include the true population mean? Is this proportion exactly equal to the confidence level? If not, explain why.

plot_ci(lower_vector, upper_vector, mean(population))

Only 3 of the 50 (6%) confidence intervals do not encompass the population, resulting in correct estimation 94% of the time.

Pick a confidence level of your choosing, provided it is not 95%. What is the appropriate critical value?

For a 99% confidence level, the critical value is 2.575.

Calculate 50 confidence intervals at the confidence level you chose in the previous question. You do not need to obtain new samples, simply calculate new intervals based on the sample means and standard deviations you have already collected. Using the plot_ci function, plot all intervals and calculate the proportion of intervals that include the true population mean. How does this percentage compare to the confidence level selected for the intervals?

confidence<-2.575 for(i in 1:50){ samp <- sample(population, n) # obtain a sample of size n = 60 from the population samp_mean[i] <- mean(samp) # save sample mean in ith element of samp_mean samp_sd[i] <- sd(samp) # save sample sd in ith element of samp_sd } lower_vector <- samp_mean - confidence * samp_sd / sqrt(n) upper_vector <- samp_mean + confidence * samp_sd / sqrt(n) plot_ci(lower_vector, upper_vector, mean(population))

I chose a 99% confidence level and this result confirms the value, with 100% of the intervals containing the mean. Let’s try 120 intervals.

confidence<-2.575 for(i in 1:120){ samp <- sample(population, n) # obtain a sample of size n = 60 from the population samp_mean[i] <- mean(samp) # save sample mean in ith element of samp_mean samp_sd[i] <- sd(samp) # save sample sd in ith element of samp_sd } lower_vector <- samp_mean - confidence * samp_sd / sqrt(n) upper_vector <- samp_mean + confidence * samp_sd / sqrt(n) plot_ci(lower_vector, upper_vector, mean(population))

All means are still encompassed. Let’s try 500.

confidence<-2.575 for(i in 1:500){ samp <- sample(population, n) # obtain a sample of size n = 60 from the population samp_mean[i] <- mean(samp) # save sample mean in ith element of samp_mean samp_sd[i] <- sd(samp) # save sample sd in ith element of samp_sd } lower_vector <- samp_mean - confidence * samp_sd / sqrt(n) upper_vector <- samp_mean + confidence * samp_sd / sqrt(n) plot_ci(lower_vector, upper_vector, mean(population))

Finally, with 500 confidence intervals, we have 2 of 500 intervals that do not encompass the mean, yielding 99.6% accuracy.