The given data file contains a hypothetical scenario that there is a gender divide when it comes to their favorate singer: all male students prefer Imagine Dragons and all female students prefer Ariana Grande.
# Load packages
library(tidyverse)
library(infer)
# Import data
music <- read.csv("~/R/business sat/DATA/music.csv") %>% as_tibble()
music %>%
# Count the rows by singer and sex
count(sex, singer)
## # A tibble: 2 x 3
## sex singer n
## <fct> <fct> <int>
## 1 female Grande 10
## 2 male Dragons 10
Interpretation
# Find proportion of each sex who were Dragons
music %>%
# Group by sex
group_by(sex) %>%
# Calculate proportion that prefer Dragons
summarise(Dragons_prop = mean(singer == "Dragons"))
## # A tibble: 2 x 2
## sex Dragons_prop
## <fct> <dbl>
## 1 female 0
## 2 male 1
# Calculate the observed difference in promotion rate
diff_orig <-
music %>%
# Group by sex
group_by(sex) %>%
# Calculate proportion that prefer Dragons by sex
summarise(prop_prom = mean(singer == "Dragons")) %>%
# Calculate difference
summarise(stat = diff(prop_prom)) %>%
pull()
# See the result
diff_orig # male - female
## [1] 1
Interpretation
Is 100% a large enough difference to conclude that male students are more likely to prefer Imagine Dragons? Or is a 100% difference is something we can see by chance?
Note that the distribution varies each time you run the program because music_perm is randomly permuted.
# Create data frame of permuted differences in promotion rates
music_perm <- music %>%
# Specify variables: singer (response variable) and sex (explanatory variable)
specify(singer ~ sex, success = "Dragons") %>%
# Set null hypothesis as independence: there is no gender musicrimination
hypothesize(null = "independence") %>%
# Shuffle the response variable, singer, one thousand times
generate(reps = 1000, type = "permute") %>%
# Calculate difference in proportion, male then female
calculate(stat = "diff in props", order = c("male", "female")) # male - female
music_perm
## # A tibble: 1,000 x 2
## replicate stat
## <int> <dbl>
## 1 1 0.200
## 2 2 -0.200
## 3 3 0.200
## 4 4 0.200
## 5 5 0
## 6 6 0
## 7 7 0.6
## 8 8 0.200
## 9 9 -0.200
## 10 10 0
## # ... with 990 more rows
# Using permutation data, plot stat
ggplot(music_perm, aes(x = stat)) +
# Add a histogram layer
geom_histogram(binwidth = 0.01) +
# Using original data, add a vertical line at stat
geom_vline(aes(xintercept = diff_orig), color = "red")
Interpretation (no need to revise)
music_perm %>%
summarize(
# Find the 0.9 quantile of diff_perm's stat
q.90 = quantile(stat, p = 0.9),
# ... and the 0.95 quantile
q.95 = quantile(stat, p = 0.95),
# ... and the 0.99 quantile
q.99 = quantile(stat, p = 0.99)
)
## # A tibble: 1 x 3
## q.90 q.95 q.99
## <dbl> <dbl> <dbl>
## 1 0.200 0.40 0.6
Interpretation
A p-value measures the degree of disagreement between the data and the null hypothesis.
# Calculate the p-value for the original dataset
music_perm %>%
get_p_value(obs_stat = diff_orig, direction = "greater")
## # A tibble: 1 x 1
## p_value
## <dbl>
## 1 0
Interpretation