Quiz on Hypothesis Testing

Q1 How many male students in the sample reported to prefer Imagine Dragons?
Q2 What percentage of male students in the sample reported to prefer Imagine Dragons?
Q3 What is the observed difference in the proportions between male and female (male - female) students in the sample?
Q4 What does this mean?
Q5 There might be a few negative permuted differences. What would a negative difference mean?
Q6 What is the calculated p-value? Interpret.
Q7 Based on the p-value you interpreted in Q6, would you reject the null hypothesis at the standard 5% significance level and accept the alternative hypothesis that male students are more likely to prefer Imagine Dragons?

Suppose that you wanted to investigate whether there is a gender difference in musical preference among Plymouth State University students. To investigate this question, you took a small sample of PSU students. The sample data is stored in music.csv.

Import music.csv from Moodle under Date Files, and test the hypothesis as described as below.

Null hypothesis: gender and singer are unrelated variables.
Alternative hypothesis: male students are more likely to prefer Imagine Dragons.

# Load packages
library(dplyr)
library(ggplot2)
library(infer)

# Import data
music <- read.csv("/resources/rstudio/business statistics/data//music.csv")
head(music)
##    singer  sex
## 1 Dragons male
## 2 Dragons male
## 3 Dragons male
## 4 Dragons male
## 5 Dragons male
## 6 Dragons male

Q1 How many male students in the sample reported to prefer Imagine Dragons?

music %>%
  # Count the rows by singer and sex
  count(sex, singer)
## # A tibble: 4 x 3
##   sex    singer      n
##   <fct>  <fct>   <int>
## 1 female Dragons     4
## 2 female Grande     10
## 3 male   Dragons    12
## 4 male   Grande      3

Answer: 12 out of 15 males preffered imagine dragons

Interpretation

No female students prefer Imagine Dragons. That is, all female students prefer Arianne Grande.
15 male students prefer Imagine Dragons, while only 3 male students prefer Arianne Grande.
The sample includes 18 male students and 5 female students.

Q2 What percentage of male students in the sample reported to prefer Imagine Dragons?

# Find proportion of each sex who were Dragons
music %>%
  # Group by sex
  group_by(sex) %>%
  # Calculate proportion Dragons summary stat
  summarise(Dragons_prop = mean(singer == "Dragons"))
## # A tibble: 2 x 2
##   sex    Dragons_prop
##   <fct>         <dbl>
## 1 female        0.286
## 2 male          0.8

Answer: .8 or 80% of males students preffered imagine dragons

Interpretation

No female students prefer Imagine Dragons.
83.3% of male students prefer Imagine Dragons.
The difference in proportions is 0.833 (male - female).
It means that male students are more likely to prefer Imangine Dragons than female students do by 83.3%.

Q3 What is the observed difference in the proportions between male and female (male - female) students in the sample?

ANSWER: the difference in proportions is 51.4 % (.8-.286=.514 or 80%-28.6%=51.4%)

Q4 What does this mean?

51.4% of the combinded female(4) and male(12) students preffered imagine dragons over grande

Q5 There might be a few negative permuted differences. What would a negative difference mean?

ANSWER: The distribution of null statistics (permuted differences) is centered around 0. For example, the tallest bar in the center of the distribution indicates that difference of approximately 0 is the most likely be seesn by chance (about 380 times of 1,000) if there were no gender difference.

# Calculate the observed difference in promotion rate
diff_orig <- music %>%
  # Group by sex
  group_by(sex) %>%
  # Summarize to calculate fraction Dragons
  summarise(prop_prom = mean(singer == "Dragons")) %>%
  # Summarize to calculate difference
  summarise(stat = diff(prop_prom)) %>% 
  pull()
    
# See the result
diff_orig # male - female
## [1] 0.5142857

# Create data frame of permuted differences in promotion rates
music_perm <- music %>%
  # Specify variables: singer (response variable) and sex (explanatory variable)
  specify(singer ~ sex, success = "Dragons") %>%
  # Set null hypothesis as independence: there is no gender musicrimination
  hypothesize(null = "independence") %>%
  # Shuffle the response variable, singer, one thousand times
  generate(reps = 1000, type = "permute") %>%
  # Calculate difference in proportion, male then female
  calculate(stat = "diff in props", order = c("male", "female")) # male - female
  
music_perm
## # A tibble: 1,000 x 2
##    replicate    stat
##        <int>   <dbl>
##  1         1 -0.176 
##  2         2 -0.0381
##  3         3 -0.0381
##  4         4 -0.314 
##  5         5  0.100 
##  6         6 -0.0381
##  7         7  0.100 
##  8         8  0.100 
##  9         9  0.376 
## 10        10  0.100 
## # ... with 990 more rows

# Using permutation data, plot stat
ggplot(music_perm, aes(x = stat)) + 
  # Add a histogram layer
  geom_histogram(binwidth = 0.01) +
  # Using original data, add a vertical line at stat
  geom_vline(aes(xintercept = diff_orig), color = "red")

Interpretation (no need to revise)

The distribution of null statistics (permuted differences) is centered around 0.
For example, the tallest bar in the center of the distribution indicates that difference of approximately 0 is the most likely be seesn by chance (about 380 times of 1,000) if there were no gender difference.

Q6 What is the calculated p-value? Interpret.

ANSWER: .012 or 1.2%

Q7 Based on the p-value you interpreted in Q6, would you reject the null hypothesis at the standard 5% significance level and accept the alternative hypothesis that male students are more likely to prefer Imagine Dragons?

ANSWER: This P Value indicates a value of .012 indicates that only 1.2% of the permuted distribution is more extreme than the observed difference. this means that gender and singer are directly related at the significant level of 5% which concludes that means men are more likley to choose imagine dragons .

# Calculate the p-value for the original dataset
music_perm %>%
  get_p_value(obs_stat = diff_orig, direction = "greater")
## # A tibble: 1 x 1
##   p_value
##     <dbl>
## 1   0.004

Interpretation

A value of 0.002, for example, indicates that only a 0.2% of the permuted distribution (null statistics) is more extreme than the observed difference. In other words, it would be highly unlikely to see the observed difference by chance if there was no difference across gender. Thus, we reject the null hypothesis that gender and singer are unrelated at the significance level of 5% and conclude that men are more likely to prefer Imagine Dragons.

Quiz on Hypothesis Testing

Emily French

Q1 How many male students in the sample reported to prefer Imagine Dragons?

Q2 What percentage of male students in the sample reported to prefer Imagine Dragons?

Q3 What is the observed difference in the proportions between male and female (male - female) students in the sample?

Q4 What does this mean?

Q5 There might be a few negative permuted differences. What would a negative difference mean?

Q6 What is the calculated p-value? Interpret.

Q7 Based on the p-value you interpreted in Q6, would you reject the null hypothesis at the standard 5% significance level and accept the alternative hypothesis that male students are more likely to prefer Imagine Dragons?