Hyposthesis Testing

Summarizing gender split
Randomizing gender differences
Critical region
Calculating the p-values

The given data file contains a hypothetical scenario that there is a gender divide when it comes to their favorate singer: all male students prefer Imagine Dragons and all female students prefer Ariana Grande.

Null hypothesis: gender and favorate singer are unrelated variables.
Alternative hypothesis: : male students are more likely to prefer Imagine Dragons.

Summarizing gender split

# Load packages
library(tidyverse)
library(infer)

# Import data
music <- read.csv("~/R/business sat/DATA/music.csv") %>% as_tibble()

music %>%
  # Count the rows by singer and sex
  count(sex, singer)
## # A tibble: 2 x 3
##   sex    singer      n
##   <fct>  <fct>   <int>
## 1 female Grande     10
## 2 male   Dragons    10

Interpretation

The sample includes 10 male students and 10 female students.
All female students prefer Arianne Grande.
All male students prefer Imagine Dragons.

# Find proportion of each sex who were Dragons
music %>%
  # Group by sex
  group_by(sex) %>%
  # Calculate proportion that prefer Dragons
  summarise(Dragons_prop = mean(singer == "Dragons"))
## # A tibble: 2 x 2
##   sex    Dragons_prop
##   <fct>         <dbl>
## 1 female            0
## 2 male              1

# Calculate the observed difference in promotion rate
diff_orig <- 
  music %>%
  # Group by sex
  group_by(sex) %>%
  # Calculate proportion that prefer Dragons by sex
  summarise(prop_prom = mean(singer == "Dragons")) %>%
  # Calculate difference
  summarise(stat = diff(prop_prom)) %>% 
  pull()
    
# See the result
diff_orig # male - female
## [1] 1

Interpretation

0% of female students prefer Imagine Dragons.
100% of male students prefer Imagine Dragons.
The difference in proportions is 100% or 1 (male - female).
It means that male students are more likely to prefer Imangine Dragons than female students do by 100%.

Randomizing gender differences

Is 100% a large enough difference to conclude that male students are more likely to prefer Imagine Dragons? Or is a 100% difference is something we can see by chance?

Note that the distribution varies each time you run the program because music_perm is randomly permuted.

# Create data frame of permuted differences in promotion rates
music_perm <- music %>%
  # Specify variables: singer (response variable) and sex (explanatory variable)
  specify(singer ~ sex, success = "Dragons") %>%
  # Set null hypothesis as independence: there is no gender musicrimination
  hypothesize(null = "independence") %>%
  # Shuffle the response variable, singer, one thousand times
  generate(reps = 1000, type = "permute") %>% 
  # Calculate difference in proportion, male then female
  calculate(stat = "diff in props", order = c("male", "female")) # male - female
  
music_perm
## # A tibble: 1,000 x 2
##    replicate   stat
##        <int>  <dbl>
##  1         1  0.200
##  2         2 -0.200
##  3         3  0.200
##  4         4  0.200
##  5         5  0    
##  6         6  0    
##  7         7  0.6  
##  8         8  0.200
##  9         9 -0.200
## 10        10  0    
## # ... with 990 more rows

# Using permutation data, plot stat
ggplot(music_perm, aes(x = stat)) + 
  # Add a histogram layer
  geom_histogram(binwidth = 0.01) +
  # Using original data, add a vertical line at stat
  geom_vline(aes(xintercept = diff_orig), color = "red")

Interpretation (no need to revise)

The distribution of null statistics (permuted differences) is centered around 0.
For example, the tallest bar in the center of the distribution indicates that difference of approximately 0 is the most likely be seesn by chance (about 330 times of 1,000) if there were no gender difference.

Critical region

music_perm %>% 
  summarize(
    # Find the 0.9 quantile of diff_perm's stat
    q.90 = quantile(stat, p = 0.9),
    # ... and the 0.95 quantile
    q.95 = quantile(stat, p = 0.95),
    # ... and the 0.99 quantile
    q.99 = quantile(stat, p = 0.99)
  )
## # A tibble: 1 x 3
##    q.90  q.95  q.99
##   <dbl> <dbl> <dbl>
## 1 0.200  0.40   0.6

Interpretation

The 95% quantile is 0.40. This means that 95% of null statistics (permuted differences) are at 0.4 or below. It also means that our observed statistic of 1 is larger than 95% of null statistics. That is, we would only see 1 less than 5% of the times by chance if there were no gender difference. Thus, we reject the null hypothesis that gender and singer are unrelated at the significance level of 5% and conclude that men are more likely to prefer Imagine Dragons.
The 99% quantile is 0.40. This means that 99% of null statistics (permuted differences) are at 0.40 or below.

Calculating the p-values

A p-value measures the degree of disagreement between the data and the null hypothesis.

# Calculate the p-value for the original dataset
music_perm %>%
  get_p_value(obs_stat = diff_orig, direction = "greater")
## # A tibble: 1 x 1
##   p_value
##     <dbl>
## 1       0

Interpretation

A value of 0, for example, indicates that only a 0% of the permuted distribution (null statistics) is more extreme than the observed difference. In other wods, it would be extemely unlikely (zero chance in this case) to see the observed difference by chance if there was no difference across gender.

Hyposthesis Testing

Niti Bista

Summarizing gender split

Randomizing gender differences

Critical region

Calculating the p-values