Please indicate who you collaborated with on this problem set:
First load the necessary packages:
library(tidyverse)
library(infer)
For this Problem Set you will work with some grade-point-average (GPA) data for college freshman. The following will read in the data:
sat_gpa <- read_csv("https://rudeboybert.github.io/SDS220/static/PS/sat_gpa.csv")
Use the View() function in the console to take a look at the data in the viewer. Each row or case in this data frame is a student. The data includes the (binary) gender of each student; the math, verbal and total SAT scores for each student; the GPA of each student in high school (categorized as “low” or “high”); and the GPA of each student their first year of college on a numeric scale.
We will use hypothesis testing to answer the following questions:
Note, if you get stuck as you are working through this, it will be helpful to go back and read Chapter 10 in ModernDive.
Calculate the mean GPA score for each gender, using the group_by and summarize commands from the dplyr package.
sat_gpa %>%
group_by(sex) %>%
summarize(mean(gpa_fy))
## # A tibble: 2 x 2
## sex `mean(gpa_fy)`
## <chr> <dbl>
## 1 Female 2.54
## 2 Male 2.40
Questions:
Answers:
1.There is a difference of 2.544 - 2.396 = 0.148 1. I don’t think this is a statistically significant difference.
Generate a data visualization that displays the GPAs of the two groups. Be sure to include a title and label your axes.
ggplot(sat_gpa, aes(x = sex, y = gpa_fy))+
geom_boxplot() +
labs(title = 'Male and Female First-year GPAs', x = 'Gender', y = 'First-Year GPA')
We will now test the null hypothesis that there’s no difference in population mean GPA between the genders at the population level. We can write this out in mathematical notation
\[\begin{aligned} H_0:&\mu_{male} = \mu_{female} \\\ \mbox{vs }H_A:& \mu_{male} \neq \mu_{female} \end{aligned}\]
or expressed differently, that the difference is 0 or not:
\[\begin{aligned} H_0:&\mu_{male} - \mu_{female} = 0 \\\ \mbox{vs }H_A:& \mu_{male} - \mu_{female} \neq 0 \end{aligned}\]
Here’s how we use infer to run this hypothesis test:
Note that the order we choose does not matter here (female then male)…but since we used order = c("Female", "Male") here, we should do the same in subsequent calculations!
obs_diff_gpa_sex <- sat_gpa %>%
specify(gpa_fy ~ sex) %>%
calculate(stat = "diff in means", order = c("Female", "Male"))
obs_diff_gpa_sex
## # A tibble: 1 x 1
## stat
## <dbl>
## 1 0.149
Note that this is the difference in the group means we calculated earlier!
2.544587 - 2.396066
## [1] 0.148521
This step involves generating simulated values as if we lived in a world where there’s no difference between the two groups. Going back to the idea of permutation, and tactile sampling, this is akin to shuffling the GPA scores between male and female labels (i.e. removing the structure to the data) just as we could have done with index cards.
gpas_in_null_world <- sat_gpa %>%
specify(gpa_fy ~ sex) %>%
hypothesize(null = "independence") %>%
generate(reps = 5000, type = "permute")
Question:
gpas_in_null_world data frame?Answer:
The following calculates the differences in mean GPA for males and females for “shuffled” (permuted) data.
gpa_diff_under_null <- gpas_in_null_world %>%
calculate(stat = "diff in means", order = c("Female", "Male"))
gpa_diff_under_null %>%
slice(1:5)
## # A tibble: 5 x 2
## replicate stat
## <int> <dbl>
## 1 1 -0.0273
## 2 2 -0.0598
## 3 3 0.0168
## 4 4 0.0161
## 5 5 -0.00800
Question:
Answer:
The following plots the \(\delta\) values we calculated for each of the different “shuffled” replicates. This is the null distribution of \(\delta\). The red line shows the observed difference between male and female scores in the data (-0.1485209) from step 1.
gpa_diff_under_null %>%
visualize(obs_stat = obs_diff_gpa_sex) +
labs(x = "Difference in mean GPA for males and females", y = "Count",
title = "Null distribution of differences in male and female GPAs",
subtitle = "Actual difference observed in the data is marked in red")
Note that zero is the center of this null distribution. The null hypothesis is that there is no difference between males and females in GPA score. In the permutations, zero was the most common value, because all structure was removed from the data…GPA values were sorted into male and female at random. Values as large as ~ 0.1 and -0.1 occurred, but much less frequently, because they are just not as likely when structure is removed from the data.
gpa_diff_under_null %>%
get_pvalue(obs_stat = obs_diff_gpa_sex, direction = "both")
## # A tibble: 1 x 1
## p_value
## <dbl>
## 1 0.0016
This result indicates that there is a 0.1% chance (very low) chance that we would see a difference of 0.15 in GPA scores between males and females (or a bigger difference) if in fact there was truly no difference between the sexes in GPA scores at the population level.
Fill in the blanks below to write up the results & conclusions for this test:
The mean GPA scores for females in our sample (\(\bar{x}\) = 2.544) was greater than that of males (\(\bar{x}\) = 2.396). This difference was statistically significant at \(\alpha = 0.05\), (p = 0.0012). Given this, I would reject the Null hypothesis and conclude that femails have higher GPAs than males at the population level.
The following will allow us to calculate a 95% confidence interval for the difference between mean GPA scores for males and females.
ci_diff_gpa_means <- sat_gpa %>%
specify(gpa_fy ~ sex) %>%
generate(reps = 5000) %>%
calculate(stat = "diff in means", order = c("Female", "Male")) %>%
get_ci(level = 0.95)
ci_diff_gpa_means
## # A tibble: 1 x 2
## `2.5%` `97.5%`
## <dbl> <dbl>
## 1 0.0560 0.240
Note that all the above steps can be done with one line of code if a slew of assumptions like normality and equal variance of the groups are met.
t.test(gpa_fy ~ sex, var.equal = TRUE, data = sat_gpa)
##
## Two Sample t-test
##
## data: gpa_fy by sex
## t = 3.1828, df = 998, p-value = 0.001504
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.05695029 0.24009148
## sample estimates:
## mean in group Female mean in group Male
## 2.544587 2.396066
For this analysis sat_total is the outcome variable, and gpa_hs is the predictor variable, with two levels “low” and “high”.
We can first calculate the mean total SAT score for each group (i.e students with a low and high GPA), using the group_by and summarize commands from the dplyr package.
avg_sat_gpa <- sat_gpa %>%
group_by(gpa_hs) %>%
summarize(sat_total = mean(sat_total))
avg_sat_gpa
## # A tibble: 2 x 2
## gpa_hs sat_total
## <chr> <dbl>
## 1 high 109.
## 2 low 98.2
We will next generate a data visualization that displays the total SAT scores of the two groups. Be sure to include a title and label your axes.
ggplot(sat_gpa, aes(x = gpa_hs, y = sat_total)) +
geom_boxplot(fill = "light blue") +
labs(title = "SAT scores based on high school GPA scores",
x = "GPA ranking", y = "SAT score")
State the null hypothesis that you are testing (using either words or symbols)
Answer:
Calculate the observed difference between the mean total SAT scores of the low and high GPA high-school students.
# you finish this code....
obs_diff_sat_hs_gpa <- sat_gpa %>%
specify(sat_total ~ gpa_hs) %>%
calculate(stat = "diff in means", order = c("high", "low"))
Generate the null distribution of \(\delta\). Here you need to generate simulated values as if we lived in a world where there’s no difference in SAT scores between high school students with low and high GPAs.
# you finish this code....
sat_in_null_world <- sat_gpa %>%
specify(sat_total ~ gpa_hs) %>%
hypothesize(null = "independence") %>%
generate(reps = 5000, type = "permute")
Calculate the differences in mean SAT scores between students with low and high GPA scores under the Null. Note…you should use whatever order you chose above…i.e. order = c("low", "high") or order = c("high", "low").
# you finish this code....
sat_diff_under_null <- sat_in_null_world %>%
calculate(stat = "diff in means", order = c("high", "low"))
sat_diff_under_null %>%
slice(1:5)
## # A tibble: 5 x 2
## replicate stat
## <int> <dbl>
## 1 1 1.65
## 2 2 -0.563
## 3 3 -0.743
## 4 4 0.190
## 5 5 -0.162
Visualize how the observed difference compares to the null distribution of \(\delta\). Generate a histogram of the null distribution, with a vertical red line showing the observed difference in SAT scores between high school students with a high and low GPA.
# you finish this code....
sat_diff_under_null %>%
visualize(obs_stat = obs_diff_sat_hs_gpa) +
labs(x = "Difference in SAT score for students with high and low GPA rankings", y = "Count",
title = "Null distribution of differences in SAT scores",
subtitle = "Actual difference observed in the data is marked in red")
Calculate a p-value
Answer:
sat_diff_under_null %>%
get_pvalue(obs_stat = obs_diff_sat_hs_gpa, direction = "both")
## # A tibble: 1 x 1
## p_value
## <dbl>
## 1 0
Write up the results & conclusions for this hypothesis test. Note, p-values less than 0.001 are often reported as p < 0.001.
Answer:
Calculate a confidence interval for the difference in total SAT scores for students with high and low high-school GPA scores. Note…you should use whatever order you chose above…i.e. order = c("low", "high") or order = c("high", "low").
# you finish this code....
ci_diff_sat_means <- sat_gpa %>%
specify(sat_total ~ gpa_hs) %>%
generate(reps = 5000) %>%
calculate(stat = "diff in means", order = c("high", "low")) %>%
get_ci(level = 0.95)
ci_diff_gpa_means
## # A tibble: 1 x 2
## `2.5%` `97.5%`
## <dbl> <dbl>
## 1 0.0560 0.240
Use a t-test to test the null hypothesis that total SAT scores do not differ between students with high and low high school GPA scores at the population level.
t.test(sat_total ~ gpa_hs, var.equal = TRUE, data = sat_gpa)
##
## Two Sample t-test
##
## data: sat_total by gpa_hs
## t = 12.413, df = 998, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 8.79614 12.09948
## sample estimates:
## mean in group high mean in group low
## 108.67828 98.23047