The given data file contains a hypothetical scenario that there is a gender divide when it comes to their favorate singer: all male students prefer Imagine Dragons and all female students prefer Ariana Grande.

Summarizing gender split

# Load packages
library(tidyverse)
library(infer)

# Import data
music <- read.csv("~/R/business sat/DATA/music.csv") %>% as_tibble()
music %>%
  # Count the rows by singer and sex
  count(sex, singer)
## # A tibble: 2 x 3
##   sex    singer      n
##   <fct>  <fct>   <int>
## 1 female Grande     10
## 2 male   Dragons    10

Interpretation

# Find proportion of each sex who were Dragons
music %>%
  # Group by sex
  group_by(sex) %>%
  # Calculate proportion that prefer Dragons
  summarise(Dragons_prop = mean(singer == "Dragons"))
## # A tibble: 2 x 2
##   sex    Dragons_prop
##   <fct>         <dbl>
## 1 female            0
## 2 male              1

# Calculate the observed difference in promotion rate
diff_orig <- 
  music %>%
  # Group by sex
  group_by(sex) %>%
  # Calculate proportion that prefer Dragons by sex
  summarise(prop_prom = mean(singer == "Dragons")) %>%
  # Calculate difference
  summarise(stat = diff(prop_prom)) %>% 
  pull()
    
# See the result
diff_orig # male - female
## [1] 1

Interpretation

Randomizing gender differences

Is 100% a large enough difference to conclude that male students are more likely to prefer Imagine Dragons? Or is a 100% difference is something we can see by chance?

Note that the distribution varies each time you run the program because music_perm is randomly permuted.

# Create data frame of permuted differences in promotion rates
music_perm <- music %>%
  # Specify variables: singer (response variable) and sex (explanatory variable)
  specify(singer ~ sex, success = "Dragons") %>%
  # Set null hypothesis as independence: there is no gender musicrimination
  hypothesize(null = "independence") %>%
  # Shuffle the response variable, singer, one thousand times
  generate(reps = 1000, type = "permute") %>% 
  # Calculate difference in proportion, male then female
  calculate(stat = "diff in props", order = c("male", "female")) # male - female
  
music_perm
## # A tibble: 1,000 x 2
##    replicate   stat
##        <int>  <dbl>
##  1         1  0.200
##  2         2 -0.200
##  3         3  0.200
##  4         4  0.200
##  5         5  0    
##  6         6  0    
##  7         7  0.6  
##  8         8  0.200
##  9         9 -0.200
## 10        10  0    
## # ... with 990 more rows

# Using permutation data, plot stat
ggplot(music_perm, aes(x = stat)) + 
  # Add a histogram layer
  geom_histogram(binwidth = 0.01) +
  # Using original data, add a vertical line at stat
  geom_vline(aes(xintercept = diff_orig), color = "red")

Interpretation (no need to revise)

Critical region

music_perm %>% 
  summarize(
    # Find the 0.9 quantile of diff_perm's stat
    q.90 = quantile(stat, p = 0.9),
    # ... and the 0.95 quantile
    q.95 = quantile(stat, p = 0.95),
    # ... and the 0.99 quantile
    q.99 = quantile(stat, p = 0.99)
  )
## # A tibble: 1 x 3
##    q.90  q.95  q.99
##   <dbl> <dbl> <dbl>
## 1 0.200  0.40   0.6

Interpretation

Calculating the p-values

A p-value measures the degree of disagreement between the data and the null hypothesis.

# Calculate the p-value for the original dataset
music_perm %>%
  get_p_value(obs_stat = diff_orig, direction = "greater")
## # A tibble: 1 x 1
##   p_value
##     <dbl>
## 1       0

Interpretation