Chapter 16 Interactive Notes

Load the libraries and data

library(tidyverse)
library(openintro)
library(tidymodels)
data("resume")

Rename the factor levels for callbacks

resume1 <- resume |>
  mutate(called = ifelse(received_callback == 1, "yes", "no"))

View the counts for applicants receiving callbacks

ggplot(resume1, aes(x = called)) +
  geom_bar()

calculate exact proportion of the sample that responded this way

p_hat_call <- resume1 |>
  summarize(prop_call = mean(called == "yes")) |>
  pull()

p_hat_call

## [1] 0.08049281

The proportion who received a callback in the dataset was 0.080.

The 95% confidence interval can be calculated as the sample proportion plus or minus two standard errors of the sample proportion

The bootstrap is done with the function: specify()

We do this many times to create many bootstrap replicate data sets.

Do this with the function generate()

Next, for each replicate, we calculate the sample statistic, in this case: the proportion of respondents that said “yes” to receiving callbacks.

Do this with the function calculate()

boot_dist_call <- resume1 |>
  specify(response = called, success = "yes") |>
  generate(reps = 500, type = "bootstrap") |>
  calculate(stat = "prop")

boot_dist_call

## Response: called (factor)
## # A tibble: 500 × 2
##    replicate   stat
##        <int>  <dbl>
##  1         1 0.0825
##  2         2 0.0791
##  3         3 0.0844
##  4         4 0.0805
##  5         5 0.0760
##  6         6 0.0838
##  7         7 0.0801
##  8         8 0.0799
##  9         9 0.0830
## 10        10 0.0778
## # ℹ 490 more rows

Plot the density curve of this distribution

ggplot(boot_dist_call, aes(x = stat)) +
  geom_density()

The standard deviation of the stat variable in this data frame (the bootstrap distribution) is the bootstrap standard error and it can be calculated using the summarize() function.

SE_call <- boot_dist_call |>
  summarize(se = sd(stat)) |>
  pull()

SE_call

## [1] 0.003842203

We can use this value, along with our point estimate, to roughly calculate a 95% confidence interval:

\(\hat{p} \pm z^*se\)

c(p_hat_call - 2 * SE_call, p_hat_call + 2 * SE_call)

## [1] 0.07280841 0.08817722

We are 95% confident that the true proportion of applicants receiving callbacks is between 7.28% and 8.81%.

The normal distribution

Another option for calculating the CI is by estimating it by using the Normal Distribution (the bell curve)

observations are independent
n is large

Then

\(\hat{p}\) follows a normal distribution

Steps

Calculate proportion receiving callback

n <- nrow(resume1)
p_hat_call <- resume1|>
  summarize(prop_call = mean(called == "yes")) |>
  pull()

Check conditions (both should be >= 10)

n * p_hat_call

## [1] 392

n * (1 - p_hat_call)

## [1] 4478

Calculate SE

se_call_approx <- sqrt(p_hat_call * (1 - p_hat_call) / n)

Calculate

\(z^*\)

z_star <- qnorm(.975, m=0, sd = 1)

Form 95% CI

c(p_hat_call - z_star * se_call_approx, p_hat_call + z_star * se_call_approx)

## [1] 0.07285200 0.08813363

Hypothesis Test

Use the General Social Survey (gss) data

data("gss")
head(gss)

## # A tibble: 6 × 11
##    year   age sex    college   partyid hompop hours income  class finrela weight
##   <dbl> <dbl> <fct>  <fct>     <fct>    <dbl> <dbl> <ord>   <fct> <fct>    <dbl>
## 1  2014    36 male   degree    ind          3    50 $25000… midd… below …  0.896
## 2  1994    34 female no degree rep          4    31 $20000… work… below …  1.08 
## 3  1998    24 male   degree    ind          1    40 $25000… work… below …  0.550
## 4  1996    42 male   no degree ind          4    40 $25000… work… above …  1.09 
## 5  1994    31 male   degree    rep          2    40 $25000… midd… above …  1.08 
## 6  1996    32 female no degree rep          4    53 $25000… midd… average  1.09

Do 1/3 of Americans have a college degree?

Use a hypothesis test to study another question on the gss. Respondents were asked if have a college degree. You can look at the distribution of answers by forming a bar chart. You see that of the 500 respondents, about 180 have a degree.

ggplot(gss, aes(x=college)) +
  geom_bar()

State the null and alternative hypotheses

\(H_o: p = 1/3\)

\(H_a: p \neq 1/3\)

Calculate the proportion

p_hat <- gss |>
  summarize(mean(college == "degree")) |>
  pull()
p_hat

## [1] 0.348

Now perform the hypothesis test

null <- gss |>
  specify(response = college, success = "degree") |>
  hypothesize(null = "point", p = 1/3) |> 
  generate(reps = 500, type = "draw") |> 
  calculate(stat = "prop")

null

## Response: college (factor)
## Null Hypothesis: point
## # A tibble: 500 × 2
##    replicate  stat
##        <int> <dbl>
##  1         1 0.314
##  2         2 0.322
##  3         3 0.344
##  4         4 0.332
##  5         5 0.344
##  6         6 0.298
##  7         7 0.306
##  8         8 0.36 
##  9         9 0.362
## 10        10 0.346
## # ℹ 490 more rows

Create a density curve of the null distribution

Add a vertical line (geom_vline) to indicate where p-hat lies on the curve.

ggplot(null, aes(x = stat)) +
  geom_density() +
  geom_vline(
    xintercept = p_hat, 
    color = "red"
  )

Calculate a two-tailed p-value (multiply by two)

null |>
  summarize(mean(stat > p_hat)) |>
  pull() * 2

## [1] 0.448

The p-value is very large, at 0.46. We fail to reject the null. There is no evidence that the proportion of US adults with a college degree is different that 1/3 the population.