Suppose that you wanted to investigate whether there is a gender difference in musical preference among Plymouth State University students. To investigate this question, you took a small sample of PSU students. The sample data is stored in music.csv.

Import music.csv from Moodle under Date Files, and test the hypothesis as described as below.

# Load packages
library(dplyr)
library(ggplot2)
library(infer)

# Import data
music <- read.csv("/resources/rstudio/BusinessStatistics/data/music (1).csv")
head(music)
##    singer  sex
## 1 Dragons male
## 2 Dragons male
## 3 Dragons male
## 4 Dragons male
## 5 Dragons male
## 6 Dragons male

Q1 How many male students in the sample reported to prefer Imagine Dragons?

12 Males prefer imagine dragons.

music %>%
  # Count the rows by singer and sex
  count(sex, singer)
## # A tibble: 4 x 3
##   sex    singer      n
##   <fct>  <fct>   <int>
## 1 female Dragons     4
## 2 female Grande     10
## 3 male   Dragons    12
## 4 male   Grande      3

Interpretation

Q2 What percentage of male students in the sample reported to prefer Imagine Dragons?

# Find proportion of each sex who were Dragons
music %>%
  # Group by sex
  group_by(sex) %>%
  # Calculate proportion Dragons summary stat
  summarise(Dragons_prop = mean(singer == "Dragons"))
## # A tibble: 2 x 2
##   sex    Dragons_prop
##   <fct>         <dbl>
## 1 female        0.286
## 2 male          0.8

Interpretation

Q3 What is the observed difference in the proportions between male and female (male - female) students in the sample?

The observed difference is 51.4%

Q4 What does this mean?

This means that 51.4% of all male students will prefer Imagine Dragons.

Q5 There might be a few negative permuted differences. What would a negative difference mean?

A negative difference would represent that in samples, females prefer imgaine dragons more that males.

# Calculate the observed difference in promotion rate
diff_orig <- music %>%
  # Group by sex
  group_by(sex) %>%
  # Summarize to calculate fraction Dragons
  summarise(prop_prom = mean(singer == "Dragons")) %>%
  # Summarize to calculate difference
  summarise(stat = diff(prop_prom)) %>% 
  pull()
    
# See the result
diff_orig # male - female
## [1] 0.5142857

# Create data frame of permuted differences in promotion rates
music_perm <- music %>%
  # Specify variables: singer (response variable) and sex (explanatory variable)
  specify(singer ~ sex, success = "Dragons") %>%
  # Set null hypothesis as independence: there is no gender musicrimination
  hypothesize(null = "independence") %>%
  # Shuffle the response variable, singer, one thousand times
  generate(reps = 1000, type = "permute") %>%
  # Calculate difference in proportion, male then female
  calculate(stat = "diff in props", order = c("male", "female")) # male - female
  
music_perm
## # A tibble: 1,000 x 2
##    replicate    stat
##        <int>   <dbl>
##  1         1  0.376 
##  2         2 -0.0381
##  3         3 -0.176 
##  4         4 -0.0381
##  5         5  0.100 
##  6         6 -0.0381
##  7         7 -0.314 
##  8         8  0.238 
##  9         9  0.100 
## 10        10  0.376 
## # ... with 990 more rows

# Using permutation data, plot stat
ggplot(music_perm, aes(x = stat)) + 
  # Add a histogram layer
  geom_histogram(binwidth = 0.01) +
  # Using original data, add a vertical line at stat
  geom_vline(aes(xintercept = diff_orig), color = "red")

Interpretation (no need to revise)

Q6 What is the calculated p-value? Interpret.

The calculated P-value is 0.01

Q7 Based on the p-value you interpreted in Q6, would you reject the null hypothesis at the standard 5% significance level and accept the alternative hypothesis that male students are more likely to prefer Imagine Dragons?

Based on the P-Value, the null hypothesis would be rejected.

# Calculate the p-value for the original dataset
music_perm %>%
  get_p_value(obs_stat = diff_orig, direction = "greater")
## # A tibble: 1 x 1
##   p_value
##     <dbl>
## 1   0.004

Interpretation