Suppose that you wanted to investigate whether there is a gender difference in musical preference among Plymouth State University students. To investigate this question, you took a small sample of PSU students. The sample data is stored in music.csv.
Import music.csv from Moodle under Date Files, and test the hypothesis as described as below.
# Load packages
library(dplyr)
library(ggplot2)
library(infer)
# Import data
music <- read.csv("/resources/rstudio/BusinessStatistics/Data/music (1).csv")
head(music)
## singer sex
## 1 Dragons male
## 2 Dragons male
## 3 Dragons male
## 4 Dragons male
## 5 Dragons male
## 6 Dragons male
A: 12 male students reported to prefering Imagine Dragons.
music %>%
# Count the rows by singer and sex
count(sex, singer)
## # A tibble: 4 x 3
## sex singer n
## <fct> <fct> <int>
## 1 female Dragons 4
## 2 female Grande 10
## 3 male Dragons 12
## 4 male Grande 3
Interpretation
A: 80% of male students reported to prefering Imagine Dragons.
# Find proportion of each sex who were Dragons
music %>%
# Group by sex
group_by(sex) %>%
# Calculate proportion Dragons summary stat
summarise(Dragons_prop = mean(singer == "Dragons"))
## # A tibble: 2 x 2
## sex Dragons_prop
## <fct> <dbl>
## 1 female 0.286
## 2 male 0.8
A: The observed difference in the proportions is 80% - 28.6% = 51.4%
A: The difference of proportions means that 51.4% of the combined students prefer Imagine Dragons to Arianna Grande.
A: A negative difference would mean the distribution of null statistics (permuted differences) is centered around 0. For example, the tallest bar in the center of the distribution indicates that difference of approximately 0 is the most likely be seesn by chance (about 380 times of 1,000) if there were no gender difference.
# Calculate the observed difference in promotion rate
diff_orig <- music %>%
# Group by sex
group_by(sex) %>%
# Summarize to calculate fraction Dragons
summarise(prop_prom = mean(singer == "Dragons")) %>%
# Summarize to calculate difference
summarise(stat = diff(prop_prom)) %>%
pull()
# See the result
diff_orig # male - female
## [1] 0.5142857
# Create data frame of permuted differences in promotion rates
music_perm <- music %>%
# Specify variables: singer (response variable) and sex (explanatory variable)
specify(singer ~ sex, success = "Dragons") %>%
# Set null hypothesis as independence: there is no gender musicrimination
hypothesize(null = "independence") %>%
# Shuffle the response variable, singer, one thousand times
generate(reps = 1000, type = "permute") %>%
# Calculate difference in proportion, male then female
calculate(stat = "diff in props", order = c("male", "female")) # male - female
music_perm
## # A tibble: 1,000 x 2
## replicate stat
## <int> <dbl>
## 1 1 -0.0381
## 2 2 -0.0381
## 3 3 -0.0381
## 4 4 -0.0381
## 5 5 0.100
## 6 6 -0.0381
## 7 7 0.514
## 8 8 0.238
## 9 9 -0.176
## 10 10 -0.452
## # ... with 990 more rows
# Using permutation data, plot stat
ggplot(music_perm, aes(x = stat)) +
# Add a histogram layer
geom_histogram(binwidth = 0.01) +
# Using original data, add a vertical line at stat
geom_vline(aes(xintercept = diff_orig), color = "red")
A: The p-value is 0.012 or 1.2%
A: A value of 0.01, for example, indicates that only 1% of the permuted distribution (null statistics) is more extreme than the observed difference. Thus, we reject the null hypothesis that gender and singer are unrelated at the significance level of 5% and conclude that men are more likely to prefer Imagine Dragons.
# Calculate the p-value for the original dataset
music_perm %>%
get_p_value(obs_stat = diff_orig, direction = "greater")
## # A tibble: 1 x 1
## p_value
## <dbl>
## 1 0.013