library(tidyverse)
library(openintro)
library(tidymodels)
data("gss")
Chapter 17 interactive notes
Load the libraries and data
A question in two variables
Do women and men join political parties at different rates?
Let p be the proportion that are democrats.
\(H_o: p_f = p_m\) \(H_a: p_f \neq p_m\)
We’re curious to know: do men and women join political parties at different rates?
Let’s let p be the true proportion that join democrats.
We can then phrase this question as the null hypothesis that the difference in the proportion of men and the proportion of women that join democrats is zero.
The alternative hypothesis would then be that that difference is non-zero.
Do women and men have different political affiliations?
ggplot(gss, aes(x = sex, fill = partyid)) +
geom_bar()
Let’s take a look at how these proportions compare in the gss data set. The data live in two columns, partyid and sex, so we can map sex to the x-axis and their political party to the color fill of the bars. If we add a geom_bar layer, we get a stacked bar chart that shows us that we have more females in our data set than males and that opinions are split.
ggplot(gss, aes(x = sex, fill = partyid)) +
geom_bar(position = "fill")
We can convert these to proportions by adding the position equals “fill” argument. It looks like the proportion for men is a bit lower than the proportion for women.
Create p-hat values
Sample proportions for males and females to identify with democrats - then calculate the difference of the proportions using d_hat
<- gss |>
p_hats group_by(sex) |>
summarize(mean(partyid == "dem", na.rm = TRUE)) |>
pull()
<- diff(p_hats)
d_hat d_hat
[1] 0.06499174
We can calculate the difference in these proportions by using our normal summarize method of calculating a proportion, but add in a group_by line to indicate we want to calculate that proportion for men and women separately. The result is a vector of two proportions. We take their difference with the diff function and save it as d-hats, which we learn is 0.064.
Generating data from Ho
\(H_o: p_f = p_m\)
There is no association between political party affiliation sex of a subject.
The variable partyid is independent from the variable sex.
⇒ Generate data by permutation
Mutate partyid to be a binary variable
<- gss |>
gss1 mutate(two_party = ifelse(partyid == "dem", "democrat", "not-democrat"))
Do women and men have different political party affiliations?
|>
gss1 specify(
~ sex, # this line is new
two_party success = "democrat"
|>
) hypothesize(null = "independence") |>
generate(reps = 1, type = "permute")
Response: two_party (factor)
Explanatory: sex (factor)
Null Hypothesis: independence
# A tibble: 500 × 3
# Groups: replicate [1]
two_party sex replicate
<fct> <fct> <int>
1 not-democrat male 1
2 not-democrat female 1
3 not-democrat male 1
4 democrat male 1
5 not-democrat male 1
6 not-democrat female 1
7 not-democrat female 1
8 not-democrat female 1
9 not-democrat female 1
10 not-democrat female 1
# ℹ 490 more rows
Build up a full null distribution
<- gss1 |>
null specify(two_party ~ sex, success = "democrat") |>
hypothesize(null = "independence") |>
generate(reps = 500, type = "permute") |>
calculate(stat = "diff in props", order = c("female", "male"))
null
Response: two_party (factor)
Explanatory: sex (factor)
Null Hypothesis: independence
# A tibble: 500 × 2
replicate stat
<int> <dbl>
1 1 -0.0553
2 2 -0.0393
3 3 0.0329
4 4 -0.0553
5 5 -0.0152
6 6 0.00884
7 7 0.0409
8 8 -0.0553
9 9 0.0249
10 10 0.0409
# ℹ 490 more rows
Plot a density curve
ggplot(null, aes(x = stat)) +
geom_density() +
geom_vline(xintercept = d_hat, color = "red")
This distribution of data suggests that there may be a difference between sexes in proportion of affiliation to democrats with females at higher proportions than males..
We need a p-value to determine the true effect.
# Compute two-tailed p-value
|>
null summarize(
one_tailed_pval = mean(stat >= d_hat),
two_tailed_pval = 2 * one_tailed_pval
|>
) pull(two_tailed_pval)
[1] 0.136
The two-tailed p-value is 0.172. We fail to reject the null. There is no compelling evidence that there is a difference in proportions of females and males affiliated to the democratic party.
Create a 95% Bootstrap CI
# Create the bootstrap distribution
<- gss1 |>
boot # Specify the variables and success
specify(two_party ~ sex, success = "democrat") |>
# Generate 500 bootstrap reps
generate(reps = 500, type = "bootstrap") |>
# Calculate the statistics
calculate(stat = "diff in props", order = c("female", "male"))
# Compute the standard error
<- boot |>
SE summarize(se = sd(stat)) |>
pull()
# Form the CI (lower, upper)
c(d_hat - 2 * SE, d_hat + 2 * SE)
[1] -0.02261768 0.15260116
We are 95% confident that the true difference in proportions of male and female affiliation to the democratic party is between -1.79% and 14.79%. Because zero is included there is no compelling evidence that there is a difference in sex with respect to affiliation to the democratic party.