Suppose that you wanted to investigate whether there is a gender difference in musical preference among Plymouth State University students. To investigate this question, you took a small sample of PSU students. The sample data is stored in music.csv.
Import music.csv from Moodle under Date Files, and test the hypothesis as described as below.
Null hypothesis: gender and singer are unrelated variables.
Alternative hypothesis: male students are more likely to prefer Imagine Dragons.
# Load packages
library(dplyr)
library(ggplot2)
library(infer)
# Import data
music <- read.csv("/resources/rstudio/Business Statistics/music03.csv")
head(music)
## singer sex
## 1 Dragons male
## 2 Grande male
## 3 Grande female
## 4 Grande female
## 5 Grande male
## 6 Dragons male
music %>%
# Count the rows by singer and sex
count(sex, singer)
## # A tibble: 4 x 3
## sex singer n
## <fct> <fct> <int>
## 1 female Dragons 1
## 2 female Grande 3
## 3 male Dragons 14
## 4 male Grande 2
A/ 14 male students prefer imagine dragons 1 Female student prefer imagine dragons Sample includes 16 male students and 4 female students
Interpretation
No female students prefer Imagine Dragons. That is, all female students prefer Arianne Grande.
15 male students prefer Imagine Dragons, while only 3 male students prefer Arianne Grande.
The sample includes 18 male students and 5 female students.
# Find proportion of each sex who were Dragons
music %>%
# Group by sex
group_by(sex) %>%
# Calculate proportion Dragons summary stat
summarise(Dragons_prop = mean(singer == "Dragons"))
## # A tibble: 2 x 2
## sex Dragons_prop
## <fct> <dbl>
## 1 female 0.25
## 2 male 0.875
A/ 87.5% of the male students prefer Imagine Dragons as their favorite band.
Interpretation
No female students prefer Imagine Dragons.
83.3% of male students prefer Imagine Dragons.
The difference in proportions is 0.833 (male - female).
It means that male students are more likely to prefer Imangine Dragons than female students do by 83.3%.
A/ According to the sample, there is 16 males and 4 females, the difference observed is there is 12 males more than females
A/ There is more males than females in the sample, which could explain why more students prefer imagine dragons, since a 87.5% of the males prefer the band.
A/The negative difference observed of 0.312, indicates that there is very few chances of the hypothesis to be wrong, the original hypothesis of gender defines music preference, indicating that 312 out of of all the 1000 scenarios placed, were showing a negative result in relation to the hypothesis
# Calculate the observed difference in promotion rate
diff_orig <- music %>%
# Group by sex
group_by(sex) %>%
# Summarize to calculate fraction Dragons
summarise(prop_prom = mean(singer == "Dragons")) %>%
# Summarize to calculate difference
summarise(stat = diff(prop_prom)) %>%
pull()
# See the result
diff_orig # male - female
## [1] 0.625
# Create data frame of permuted differences in promotion rates
music_perm <- music %>%
# Specify variables: singer (response variable) and sex (explanatory variable)
specify(singer ~ sex, success = "Dragons") %>%
# Set null hypothesis as independence: there is no gender musicrimination
hypothesize(null = "independence") %>%
# Shuffle the response variable, singer, one thousand times
generate(reps = 1000, type = "permute") %>%
# Calculate difference in proportion, male then female
calculate(stat = "diff in props", order = c("male", "female")) # male - female
music_perm
## # A tibble: 1,000 x 2
## replicate stat
## <int> <dbl>
## 1 1 0
## 2 2 0
## 3 3 -0.312
## 4 4 0
## 5 5 -0.312
## 6 6 0
## 7 7 0
## 8 8 0
## 9 9 0.312
## 10 10 0
## # ... with 990 more rows
# Using permutation data, plot stat
ggplot(music_perm, aes(x = stat)) +
# Add a histogram layer
geom_histogram(binwidth = 0.01) +
# Using original data, add a vertical line at stat
geom_vline(aes(xintercept = diff_orig), color = "red")
# Calculate the p-value for the original dataset
music_perm %>%
get_p_value(obs_stat = diff_orig, direction = "greater")
## # A tibble: 1 x 1
## p_value
## <dbl>
## 1 0.036
music_perm %>%
summarize(
# Find the 0.9 quantile of diff_perm's stat
q.90 = quantile(stat, p = 0.9),
# ... and the 0.95 quantile
q.95 = quantile(stat, p = 0.95),
# ... and the 0.99 quantile
q.99 = quantile(stat, p = 0.99)
)
## # A tibble: 1 x 3
## q.90 q.95 q.99
## <dbl> <dbl> <dbl>
## 1 0.312 0.312 0.625
Interpretation (no need to revise)
The distribution of null statistics (permuted differences) is centered around 0.
For example, the tallest bar in the center of the distribution indicates that difference of approximately 0 is the most likely be seen by chance (about 380 times of 1,000) if there were no gender difference.
A/
A/ The value of 0.034, indicates that only 4% of the permuted distribution is more extreme than the other difference, in other words, it would be unlikely to see difference by chance if there was no difference in gender.
A/ The P-value of 0.04, I reject the null hypothesis that gender and singer are unrelated at the significance level of 5% and conclude that men are more likely to prefer Imagine Dragons.
# Calculate the p-value for the original dataset
music_perm %>%
get_p_value(obs_stat = diff_orig, direction = "greater")
## # A tibble: 1 x 1
## p_value
## <dbl>
## 1 0.036
Interpretation