library(tidyverse)
library(openintro)
library(tidymodels)
data("resume")
resume1 <- resume |>
mutate(called = ifelse(received_callback == 1, "yes", "no"))
ggplot(resume1, aes(x = called)) +
geom_bar()
p_hat_call <- resume1 |>
summarize(prop_call = mean(called == "yes")) |>
pull()
p_hat_call
## [1] 0.08049281
The proportion who received a callback in the dataset was 0.080.
The bootstrap is done with the function: specify()
Do this with the function generate()
Do this with the function calculate()
boot_dist_call <- resume1 |>
specify(response = called, success = "yes") |>
generate(reps = 500, type = "bootstrap") |>
calculate(stat = "prop")
boot_dist_call
## Response: called (factor)
## # A tibble: 500 × 2
## replicate stat
## <int> <dbl>
## 1 1 0.0825
## 2 2 0.0791
## 3 3 0.0844
## 4 4 0.0805
## 5 5 0.0760
## 6 6 0.0838
## 7 7 0.0801
## 8 8 0.0799
## 9 9 0.0830
## 10 10 0.0778
## # ℹ 490 more rows
ggplot(boot_dist_call, aes(x = stat)) +
geom_density()
The standard deviation of the stat variable in this data frame (the bootstrap distribution) is the bootstrap standard error and it can be calculated using the summarize() function.
SE_call <- boot_dist_call |>
summarize(se = sd(stat)) |>
pull()
SE_call
## [1] 0.003842203
\(\hat{p} \pm z^*se\)
c(p_hat_call - 2 * SE_call, p_hat_call + 2 * SE_call)
## [1] 0.07280841 0.08817722
We are 95% confident that the true proportion of applicants receiving callbacks is between 7.28% and 8.81%.
Another option for calculating the CI is by estimating it by using the Normal Distribution (the bell curve)
If
Then
\(\hat{p}\) follows a normal distribution
n <- nrow(resume1)
p_hat_call <- resume1|>
summarize(prop_call = mean(called == "yes")) |>
pull()
n * p_hat_call
## [1] 392
n * (1 - p_hat_call)
## [1] 4478
se_call_approx <- sqrt(p_hat_call * (1 - p_hat_call) / n)
\(z^*\)
z_star <- qnorm(.975, m=0, sd = 1)
c(p_hat_call - z_star * se_call_approx, p_hat_call + z_star * se_call_approx)
## [1] 0.07285200 0.08813363
Use the General Social Survey (gss) data
data("gss")
head(gss)
## # A tibble: 6 × 11
## year age sex college partyid hompop hours income class finrela weight
## <dbl> <dbl> <fct> <fct> <fct> <dbl> <dbl> <ord> <fct> <fct> <dbl>
## 1 2014 36 male degree ind 3 50 $25000… midd… below … 0.896
## 2 1994 34 female no degree rep 4 31 $20000… work… below … 1.08
## 3 1998 24 male degree ind 1 40 $25000… work… below … 0.550
## 4 1996 42 male no degree ind 4 40 $25000… work… above … 1.09
## 5 1994 31 male degree rep 2 40 $25000… midd… above … 1.08
## 6 1996 32 female no degree rep 4 53 $25000… midd… average 1.09
Use a hypothesis test to study another question on the gss. Respondents were asked if have a college degree. You can look at the distribution of answers by forming a bar chart. You see that of the 500 respondents, about 180 have a degree.
ggplot(gss, aes(x=college)) +
geom_bar()
\(H_o: p = 1/3\)
\(H_a: p \neq 1/3\)
p_hat <- gss |>
summarize(mean(college == "degree")) |>
pull()
p_hat
## [1] 0.348
null <- gss |>
specify(response = college, success = "degree") |>
hypothesize(null = "point", p = 1/3) |>
generate(reps = 500, type = "draw") |>
calculate(stat = "prop")
null
## Response: college (factor)
## Null Hypothesis: point
## # A tibble: 500 × 2
## replicate stat
## <int> <dbl>
## 1 1 0.314
## 2 2 0.322
## 3 3 0.344
## 4 4 0.332
## 5 5 0.344
## 6 6 0.298
## 7 7 0.306
## 8 8 0.36
## 9 9 0.362
## 10 10 0.346
## # ℹ 490 more rows
Add a vertical line (geom_vline) to indicate where p-hat lies on the curve.
ggplot(null, aes(x = stat)) +
geom_density() +
geom_vline(
xintercept = p_hat,
color = "red"
)
null |>
summarize(mean(stat > p_hat)) |>
pull() * 2
## [1] 0.448
The p-value is very large, at 0.46. We fail to reject the null. There is no evidence that the proportion of US adults with a college degree is different that 1/3 the population.