Suppose that you wanted to investigate whether there is a gender difference in musical preference among Plymouth State University students. To investigate this question, you took a small sample of PSU students. The sample data is stored in music.csv.
Import music.csv from Moodle under Date Files, and test the hypothesis as described as below.
# Load packages
library(dplyr)
library(ggplot2)
library(infer)
# Import data
music <- read.csv("/resources/rstudio/Business Statistics/data/music.csv")
head(music)
## singer sex
## 1 Dragons male
## 2 Dragons male
## 3 Dragons male
## 4 Dragons male
## 5 Dragons male
## 6 Dragons male
12 male students preffered imagine dragons, while only 3 male students prefered Ariana Grande
music %>%
# Count the rows by singer and sex
count(sex, singer)
## # A tibble: 4 x 3
## sex singer n
## <fct> <fct> <int>
## 1 female Dragons 4
## 2 female Grande 10
## 3 male Dragons 12
## 4 male Grande 3
Interpretation
80% of male students prefer Imagine Dragons.
# Find proportion of each sex who were Dragons
music %>%
# Group by sex
group_by(sex) %>%
# Calculate proportion Dragons summary stat
summarise(Dragons_prop = mean(singer == "Dragons"))
## # A tibble: 2 x 2
## sex Dragons_prop
## <fct> <dbl>
## 1 female 0.286
## 2 male 0.8
Interpretation
The observed difference in proportions between male and female is .514 or 51.4%
It means that male students are more likely to prefer Imangine Dragons than female students do by 51.4%
The distribution of null statistics is centered around 0. The tallest bar in the center of the distribution indicates that difference of approximately 0 is the most likely be seesn by chance (about 380 times of 1,000) if there were no gender difference.
# Calculate the observed difference in promotion rate
diff_orig <- music %>%
# Group by sex
group_by(sex) %>%
# Summarize to calculate fraction Dragons
summarise(prop_prom = mean(singer == "Dragons")) %>%
# Summarize to calculate difference
summarise(stat = diff(prop_prom)) %>%
pull()
# See the result
diff_orig # male - female
## [1] 0.5142857
# Create data frame of permuted differences in promotion rates
music_perm <- music %>%
# Specify variables: singer (response variable) and sex (explanatory variable)
specify(singer ~ sex, success = "Dragons") %>%
# Set null hypothesis as independence: there is no gender musicrimination
hypothesize(null = "independence") %>%
# Shuffle the response variable, singer, one thousand times
generate(reps = 1000, type = "permute") %>%
# Calculate difference in proportion, male then female
calculate(stat = "diff in props", order = c("male", "female")) # male - female
music_perm
## # A tibble: 1,000 x 2
## replicate stat
## <int> <dbl>
## 1 1 0.100
## 2 2 -0.314
## 3 3 0.376
## 4 4 0.100
## 5 5 -0.0381
## 6 6 -0.0381
## 7 7 -0.176
## 8 8 -0.0381
## 9 9 0.238
## 10 10 -0.0381
## # ... with 990 more rows
# Using permutation data, plot stat
ggplot(music_perm, aes(x = stat)) +
# Add a histogram layer
geom_histogram(binwidth = 0.01) +
# Using original data, add a vertical line at stat
geom_vline(aes(xintercept = diff_orig), color = "red")
Interpretation (no need to revise)
The calculated p-value is .005 ## Q7 Based on the p-value you interpreted in Q6, would you reject the null hypothesis at the standard 5% significance level and accept the alternative hypothesis that male students are more likely to prefer Imagine Dragons?
A value of 0.005, indicates that only a 0.5% of the permuted distribution is more extreme than the observed difference. We reject the null hypothesis that gender and singer are unrelated at the significance level of 5% and conclude that men are more likely to prefer Imagine Dragons because .5% is les than 5%.
# Calculate the p-value for the original dataset
music_perm %>%
get_p_value(obs_stat = diff_orig, direction = "greater")
## # A tibble: 1 x 1
## p_value
## <dbl>
## 1 0.005
Interpretation