Lab 4 - Confidence intervals

Lab report

Load data:

load(url("https://raw.githubusercontent.com/GarciaRios/govt_3990/gh-pages/Labs/lab4/data/ames.RData"))

Set a seed:

set.seed(9438024)

Exercises:

Exercise 1:

n <- 60
samp <- sample_n(ames, n)

ggplot(data=samp, aes(x = Lot.Area)) + 
  geom_histogram(binwidth = 500) + 
  theme_bw()

samp %>% 
  select(Lot.Area) %>% 
  summarise(mean_area = mean(Lot.Area))

##   mean_area
## 1  10069.13

Describe the distribution of homes in your sample. What would you say is the “typical” size within your sample? Also state precisely what you interpreted “typical” to mean.

It appears from the plot that the distribution is unimodal and roughly normal, with a slight right skew.

The typical size of homes in this sample is 10,069.13 square feet. I interpreted “typical” to mean the average size of the homes in the sample, or more specifically, the mean.

Exercise 2:

Would you expect another classmate’s distribution to be identical to yours? Would you expect it to be similar? Why or why not?

If other classmates have set the same seed, then yes, we should expect the same results. If we are all pulling random samples, however, there should be some differences.

Exercise 3:

z_star_95 <- qnorm(0.975)
z_star_95

## [1] 1.959964

Now we can calculate the confidence interval

samp %>%
  summarise(lower = mean(Lot.Area) - z_star_95 * (sd(Lot.Area) / sqrt(n)),
            mean = mean(Lot.Area),
            upper = mean(Lot.Area) + z_star_95 * (sd(Lot.Area) / sqrt(n)))

##     lower     mean    upper
## 1 8390.28 10069.13 11747.99

We are 95% confident that the true average size of houses in Ames lies between the values 8390.28 and 11747.99.

For the confidence interval to be valid, the sample mean must be normally distributed and have standard error s/√n. What conditions must be met for this to be true?

For this to be true, sampled observations need to be independent. Independence is more likely if random samplng is used and, if sampling without replacement, the sample size should be less than 10% of the population. The popluation distribution should either be normal or n>30 and the population distribution is not extremely skewed.

Exercise 4:

What does 95% confidence mean?

This refers to the long term success rate of this method, so it means that 95% of the confidence intervals produced will successfully capture the population parameter of interest, in this case, the mean Lot Area of homes in Ames.

Looking at the population

ames %>% 
  select(Lot.Area) %>% 
  summarise(mean_area = mean(Lot.Area))

##   mean_area
## 1  10147.92

Exercise 5:

params <- ames %>%
  summarise(mu = mean(Lot.Area))

Does your confidence interval capture the true average size of houses in Ames? If you are working on this lab in a classroom, does your neighbor’s interval capture this value?

The true average size of houses in Ames is 10147.92. My confidence interval has a lower value of 8390.28 and an upper value of 11747.99, so it does capture the true average size.

Exercise 6:

Each student should have gotten a slightly different confidence interval. What proportion of those intervals would you expect to capture the true population mean? Why?

Because the whole class did not set the same seed before drawing their sample, everyone constructed their confidence intervals off of different randomly selected samples. I would expect 95% of the constructed confidence intervals to capture the true population mean.

Exercise 7:

ci <- do(50) * ames %>%
                  sample_n(n) %>%  
  summarise(lower = mean(Lot.Area) - z_star_95 * (sd(Lot.Area) / sqrt(n)), 
            upper = mean(Lot.Area) + z_star_95 * (sd(Lot.Area) / sqrt(n)))

ci %>%
  slice(1:5)

##      lower     upper .row .index
## 1 8800.031 10805.569    1      1
## 2 9018.124 11819.843    1      2
## 3 8355.396  9887.704    1      3
## 4 8708.582 10808.218    1      4
## 5 8964.277 10923.823    1      5

ci <- ci %>%
  mutate(capture_mu = ifelse(lower < params$mu & upper > params$mu, "yes", "no"))

ci_data <- data.frame(ci_id = c(1:50, 1:50),
                      ci_bounds = c(ci$lower, ci$upper),
                      capture_mu = c(ci$capture_mu, ci$capture_mu))

ggplot(ci_data, aes(x = ci_bounds, y = ci_id, 
      group = ci_id, color = capture_mu)) +
  geom_point(size = 2) +  # add points at the ends, size = 2
  geom_line() +           # connect with lines
  geom_vline(xintercept = params$mu, color = "darkgray") # draw vertical line

What proportion of your confidence intervals include the true population mean? Is this proportion exactly equal to the confidence level? If not, explain why. Make sure to include your plot in your answer.

Four confidence intervals out of 50 did not capture the true population mean, which means 0.92 of my confidence intervals did include the true population mean. This is not exactly equal to the confidence level because the confidence level is an estimate and not a perfect predictor. The plot shows that the majority of confidence intervals did capture the population mean.

On your own:

1:

I am choosing a 99% confidence level and using a two-tailed test to find the critical value.

alpha <- 1-(99/100)
t <- (alpha/2) 
middle <- 1-t

z_star_99 <- qnorm(middle)

Our critical value is 2.58.

2:

ci99 <- do(50) * ames %>%
                  sample_n(n) %>%  
  summarise(lower = mean(Lot.Area) - z_star_99 * (sd(Lot.Area) / sqrt(n)), 
            upper = mean(Lot.Area) + z_star_99 * (sd(Lot.Area) / sqrt(n)))

ci99 %>%
  slice(1:5)

##      lower    upper .row .index
## 1 8238.424 10693.34    1      1
## 2 8830.868 10850.16    1      2
## 3 7868.305 11983.06    1      3
## 4 6954.670 16566.93    1      4
## 5 7844.596 10547.04    1      5

ci99 <- ci99 %>%
  mutate(capture_mu = ifelse(lower < params$mu & upper > params$mu, "yes", "no"))

ci99_data <- data.frame(ci_id = c(1:50, 1:50),
                      ci_bounds = c(ci99$lower, ci99$upper),
                      capture_mu = c(ci99$capture_mu, ci99$capture_mu))

ggplot(ci99_data, aes(x = ci_bounds, y = ci_id, 
      group = ci_id, color = capture_mu)) +
  geom_point(size = 2) +  # add points at the ends, size = 2
  geom_line() +           # connect with lines
  geom_vline(xintercept = params$mu, color = "darkgray") # draw vertical line

Calculate 50 confidence intervals at the confidence level you chose in the previous question, and plot all intervals on one plot, and calculate the proportion of intervals that include the true population mean. How does this percentage compare to the confidence level selected for the intervals? Make sure to include your plot in your answer.

All of the confidence intervals at the 99% confidence level captured the true population mean. 50 out of 50 is a 100% success rate, or a proportion of 1. This is very close to the 99% confidence level selected for the intervals.