Suppose that you wanted to investigate whether there is a gender difference in musical preference among Plymouth State University students. To investigate this question, you took a small sample of PSU students. The sample data is stored in music.csv.

Import music.csv from Moodle under Date Files, and test the hypothesis as described as below.


# Load packages

library(dplyr)

library(ggplot2)

library(infer)

# Import data

music <- read.csv("/resources/rstudio/Business Statistics/music03.csv")

head(music)
##    singer    sex
## 1 Dragons   male
## 2  Grande   male
## 3  Grande female
## 4  Grande female
## 5  Grande   male
## 6 Dragons   male

Q1 How many male students in the sample reported to prefer Imagine Dragons?


music %>%

  # Count the rows by singer and sex

  count(sex, singer)
## # A tibble: 4 x 3
##   sex    singer      n
##   <fct>  <fct>   <int>
## 1 female Dragons     1
## 2 female Grande      3
## 3 male   Dragons    14
## 4 male   Grande      2

A/ 14 male students prefer imagine dragons 1 Female student prefer imagine dragons Sample includes 16 male students and 4 female students

Interpretation

Q2 What percentage of male students in the sample reported to prefer Imagine Dragons?


# Find proportion of each sex who were Dragons

music %>%

  # Group by sex

  group_by(sex) %>%

  # Calculate proportion Dragons summary stat
  
  summarise(Dragons_prop = mean(singer == "Dragons"))
## # A tibble: 2 x 2
##   sex    Dragons_prop
##   <fct>         <dbl>
## 1 female        0.25 
## 2 male          0.875

A/ 87.5% of the male students prefer Imagine Dragons as their favorite band.

Interpretation

Q3 What is the observed difference in the proportions between male and female (male - female) students in the sample?

A/ According to the sample, there is 16 males and 4 females, the difference observed is there is 12 males more than females

Q4 What does this mean?

A/ There is more males than females in the sample, which could explain why more students prefer imagine dragons, since a 87.5% of the males prefer the band.

Q5 There might be a few negative permuted differences. What would a negative difference mean?

A/The negative difference observed of 0.312, indicates that there is very few chances of the hypothesis to be wrong, the original hypothesis of gender defines music preference, indicating that 312 out of of all the 1000 scenarios placed, were showing a negative result in relation to the hypothesis


# Calculate the observed difference in promotion rate

diff_orig <- music %>%

  # Group by sex

  group_by(sex) %>%

  # Summarize to calculate fraction Dragons

  summarise(prop_prom = mean(singer == "Dragons")) %>%

  # Summarize to calculate difference

  summarise(stat = diff(prop_prom)) %>% 

  pull()

    

# See the result
diff_orig # male - female
## [1] 0.625


# Create data frame of permuted differences in promotion rates

music_perm <- music %>%

  # Specify variables: singer (response variable) and sex (explanatory variable)

  specify(singer ~ sex, success = "Dragons") %>%

  # Set null hypothesis as independence: there is no gender musicrimination

  hypothesize(null = "independence") %>%

  # Shuffle the response variable, singer, one thousand times

  generate(reps = 1000, type = "permute") %>%

  # Calculate difference in proportion, male then female

  calculate(stat = "diff in props", order = c("male", "female")) # male - female

  

music_perm
## # A tibble: 1,000 x 2
##    replicate   stat
##        <int>  <dbl>
##  1         1  0    
##  2         2  0    
##  3         3 -0.312
##  4         4  0    
##  5         5 -0.312
##  6         6  0    
##  7         7  0    
##  8         8  0    
##  9         9  0.312
## 10        10  0    
## # ... with 990 more rows


# Using permutation data, plot stat

ggplot(music_perm, aes(x = stat)) + 

  # Add a histogram layer

  geom_histogram(binwidth = 0.01) +

  # Using original data, add a vertical line at stat

  geom_vline(aes(xintercept = diff_orig), color = "red")




# Calculate the p-value for the original dataset

music_perm %>%

  get_p_value(obs_stat = diff_orig, direction = "greater")
## # A tibble: 1 x 1
##   p_value
##     <dbl>
## 1   0.036



music_perm %>% 
  summarize(
    # Find the 0.9 quantile of diff_perm's stat
    q.90 = quantile(stat, p = 0.9),
    # ... and the 0.95 quantile
    q.95 = quantile(stat, p = 0.95),
    # ... and the 0.99 quantile
    q.99 = quantile(stat, p = 0.99)
  )
## # A tibble: 1 x 3
##    q.90  q.95  q.99
##   <dbl> <dbl> <dbl>
## 1 0.312 0.312 0.625

Interpretation (no need to revise)

A/

Q6 What is the calculated p-value? Interpret.

A/ The value of 0.034, indicates that only 4% of the permuted distribution is more extreme than the other difference, in other words, it would be unlikely to see difference by chance if there was no difference in gender.

Q7 Based on the p-value you interpreted in Q6, would you reject the null hypothesis at the standard 5% significance level and accept the alternative hypothesis that male students are more likely to prefer Imagine Dragons?

A/ The P-value of 0.04, I reject the null hypothesis that gender and singer are unrelated at the significance level of 5% and conclude that men are more likely to prefer Imagine Dragons.


# Calculate the p-value for the original dataset

music_perm %>%

  get_p_value(obs_stat = diff_orig, direction = "greater")
## # A tibble: 1 x 1
##   p_value
##     <dbl>
## 1   0.036

Interpretation