Data 606 Lab 4b

Exercises

load("more/ames.RData")
set.seed(13)
population <- ames$Gr.Liv.Area
samp <- sample(population, 60)
hist(samp,breaks=10)

Describe the distribution of your sample. What would you say is the “typical” size within your sample? Also state precisely what you interpreted “typical” to mean.

The distribution of the sample is skewed right, bimodal, and the ‘typical’ size is about 1250. I interpreted ‘typical’ to mean the mode.

Would you expect another student’s distribution to be identical to yours? Would you expect it to be similar? Why or why not?

I would not expect another distribution to be identical due to the variation between each sample

For the confidence interval to be valid, the sample mean must be normally distributed and have standard error \(s / \sqrt{n}\). What conditions must be met for this to be true?

The observations must be independant, ideally over 30 observations, and the population distribution not skewed

What does “95% confidence” mean? If you’re not sure, see Section 4.2.2.

“95% Confidence” means that the population mean will be within the confidence interval of the point estimate 95% of the time

sample_mean <- mean(samp)
se <- sd(samp) / sqrt(60)
lower <- sample_mean - 1.96 * se
upper <- sample_mean + 1.96 * se
c(lower, upper)

## [1] 1288.893 1526.940

mean(population)

## [1] 1499.69

Does your confidence interval capture the true average size of houses in Ames? If you are working on this lab in a classroom, does your neighbor’s interval capture this value?

In this case the confidence interval does capture the true average.

Each student in your class should have gotten a slightly different confidence interval. What proportion of those intervals would you expect to capture the true population mean? Why? If you are working in this lab in a classroom, collect data on the intervals created by other students in the class and calculate the proportion of intervals that capture the true population mean.

We would expect 95% percent of the confidence intervals to capture the true population mean because the interval is based on the value bondaries related to that probability’s z-score.

set.seed(868)
samp_mean <- rep(NA, 50)
samp_sd <- rep(NA, 50)
n <- 60
for(i in 1:50){
  samp <- sample(population, n) # obtain a sample of size n = 60 from the population
  samp_mean[i] <- mean(samp)    # save sample mean in ith element of samp_mean
  samp_sd[i] <- sd(samp)        # save sample sd in ith element of samp_sd
}
lower_vector <- samp_mean - 1.96 * samp_sd / sqrt(n) 
upper_vector <- samp_mean + 1.96 * samp_sd / sqrt(n)

On your own

Using the following function (which was downloaded with the data set), plot all intervals. What proportion of your confidence intervals include the true population mean? Is this proportion exactly equal to the confidence level? If not, explain why.
```
plot_ci(lower_vector, upper_vector, mean(population))
```

The proportion of confidence intervals that include the true mean are not excactly equal to the confidence level. This is due to chance and as the number of samples increases the percentage does approach the exact confidence level.

Pick a confidence level of your choosing, provided it is not 95%. What is the appropriate critical value?

For a confidence level of 99%, the critical value is 2.58

Calculate 50 confidence intervals at the confidence level you chose in the previous question. You do not need to obtain new samples, simply calculate new intervals based on the sample means and standard deviations you have already collected. Using the plot_ci function, plot all intervals and calculate the proportion of intervals that include the true population mean. How does this percentage compare to the confidence level selected for the intervals?

With a confidence interval of 99%, in this example, all of the intervals include the true population mean. This is in line with what we would expect with the 99% confidence level, and as the number of samples increases we would see the proportion to get close to 99%

lower_vector <- samp_mean - 2.58 * samp_sd / sqrt(n) 
upper_vector <- samp_mean + 2.58 * samp_sd / sqrt(n)
plot_ci(lower_vector, upper_vector, mean(population))

Data 606 Lab 4b

John Perez

3/12/2019

Exercises

On your own