Collaboration

Please indicate who you collaborated with on this problem set:

Background

First load the necessary packages:

library(tidyverse)
library(infer)

For this Problem Set you will work with some grade-point-average (GPA) data for college freshman. The following will read in the data:

sat_gpa <- read_csv("https://rudeboybert.github.io/SDS220/static/PS/sat_gpa.csv")

Use the View() function in the console to take a look at the data in the viewer. Each row or case in this data frame is a student. The data includes the (binary) gender of each student; the math, verbal and total SAT scores for each student; the GPA of each student in high school (categorized as “low” or “high”); and the GPA of each student their first year of college on a numeric scale.

We will use hypothesis testing to answer the following questions:

Is there a difference in male and female freshman GPAs?
Is there a difference in total SAT score for students with a “low” and “high” high-school GPA?

Note, if you get stuck as you are working through this, it will be helpful to go back and read Chapter 10 in ModernDive.

Question 1: Gender differences in first-year GPA?

Exploratory data analysis

Q1.a)

Calculate the mean GPA score for each gender, using the group_by and summarize commands from the dplyr package.

sat_gpa %>% 
  group_by(sex) %>% 
  summarize(mean(gpa_fy))

## # A tibble: 2 x 2
##   sex    `mean(gpa_fy)`
##   <chr>           <dbl>
## 1 Female           2.54
## 2 Male             2.40

Questions:

What is the difference in sample mean GPA’s?
Make a guess: is this difference statistically significant?

Answers:

1.There is a difference of 2.544 - 2.396 = 0.148 1. I don’t think this is a statistically significant difference.

Q1.b)

Generate a data visualization that displays the GPAs of the two groups. Be sure to include a title and label your axes.

ggplot(sat_gpa, aes(x = sex, y = gpa_fy))+ 
  geom_boxplot() +
  labs(title = 'Male and Female First-year GPAs', x = 'Gender', y = 'First-Year GPA')

Stating a null hypothesis

We will now test the null hypothesis that there’s no difference in population mean GPA between the genders at the population level. We can write this out in mathematical notation

\[\begin{aligned} H_0:&\mu_{male} = \mu_{female} \\\ \mbox{vs }H_A:& \mu_{male} \neq \mu_{female} \end{aligned}\]

or expressed differently, that the difference is 0 or not:

\[\begin{aligned} H_0:&\mu_{male} - \mu_{female} = 0 \\\ \mbox{vs }H_A:& \mu_{male} - \mu_{female} \neq 0 \end{aligned}\]

Testing the hypothesis

Here’s how we use infer to run this hypothesis test:

Step 1: Calculate the observed difference

Note that the order we choose does not matter here (female then male)…but since we used order = c("Female", "Male") here, we should do the same in subsequent calculations!

obs_diff_gpa_sex <- sat_gpa %>% 
  specify(gpa_fy ~ sex) %>% 
  calculate(stat = "diff in means", order = c("Female", "Male"))

obs_diff_gpa_sex

## # A tibble: 1 x 1
##    stat
##   <dbl>
## 1 0.149

Note that this is the difference in the group means we calculated earlier!

2.544587 - 2.396066

## [1] 0.148521

Step 2. Generate the null distribution of \(\delta\)

This step involves generating simulated values as if we lived in a world where there’s no difference between the two groups. Going back to the idea of permutation, and tactile sampling, this is akin to shuffling the GPA scores between male and female labels (i.e. removing the structure to the data) just as we could have done with index cards.

gpas_in_null_world <- sat_gpa %>% 
  specify(gpa_fy ~ sex) %>% 
  hypothesize(null = "independence") %>% 
  generate(reps = 5000, type = "permute")

Q1.c)

Question:

What was the size of the “shuffled” (permuted) sample in each run?
How many times did we do a different “shuffle” (permute) here to the sample? How many rows are in the gpas_in_null_world data frame?

Answer:

1000
5000

Step 3. Calculate the differences between male and females under the null

The following calculates the differences in mean GPA for males and females for “shuffled” (permuted) data.

gpa_diff_under_null <- gpas_in_null_world %>% 
  calculate(stat = "diff in means", order = c("Female", "Male")) 

gpa_diff_under_null %>% 
  slice(1:5)

## # A tibble: 5 x 2
##   replicate     stat
##       <int>    <dbl>
## 1         1 -0.0273 
## 2         2 -0.0598 
## 3         3  0.0168 
## 4         4  0.0161 
## 5         5 -0.00800

Q1.d)

Question:

How many rows are in this data frame? Why?

Answer:

In this data frame, slice() has been used on the 5000 rows to show only the first 5 rows.

Step 4. Visualize how the observed difference compares to the null distribution of \(\delta\)

The following plots the \(\delta\) values we calculated for each of the different “shuffled” replicates. This is the null distribution of \(\delta\). The red line shows the observed difference between male and female scores in the data (-0.1485209) from step 1.

gpa_diff_under_null %>% 
  visualize(obs_stat = obs_diff_gpa_sex) +
  labs(x = "Difference in mean GPA for males and females", y = "Count",
       title = "Null distribution of differences in male and female GPAs",
      subtitle = "Actual difference observed in the data is marked in red")

Note that zero is the center of this null distribution. The null hypothesis is that there is no difference between males and females in GPA score. In the permutations, zero was the most common value, because all structure was removed from the data…GPA values were sorted into male and female at random. Values as large as ~ 0.1 and -0.1 occurred, but much less frequently, because they are just not as likely when structure is removed from the data.

Step 5: Calculate a p-value

gpa_diff_under_null %>% 
  get_pvalue(obs_stat = obs_diff_gpa_sex, direction = "both")

## # A tibble: 1 x 1
##   p_value
##     <dbl>
## 1  0.0016

This result indicates that there is a 0.1% chance (very low) chance that we would see a difference of 0.15 in GPA scores between males and females (or a bigger difference) if in fact there was truly no difference between the sexes in GPA scores at the population level.

Q1.e)

Fill in the blanks below to write up the results & conclusions for this test:

The mean GPA scores for females in our sample (\(\bar{x}\) = 2.544) was greater than that of males (\(\bar{x}\) = 2.396). This difference was statistically significant at \(\alpha = 0.05\), (p = 0.0012). Given this, I would reject the Null hypothesis and conclude that femails have higher GPAs than males at the population level.

Step 6: Calculate a confidence interval for the difference

The following will allow us to calculate a 95% confidence interval for the difference between mean GPA scores for males and females.

ci_diff_gpa_means <- sat_gpa %>% 
  specify(gpa_fy ~ sex) %>%
  generate(reps = 5000) %>% 
  calculate(stat = "diff in means", order = c("Female", "Male")) %>% 
  get_ci(level = 0.95)
ci_diff_gpa_means

## # A tibble: 1 x 2
##   `2.5%` `97.5%`
##    <dbl>   <dbl>
## 1 0.0560   0.240

Complete all the above tasks with a t-test

Note that all the above steps can be done with one line of code if a slew of assumptions like normality and equal variance of the groups are met.

t.test(gpa_fy ~ sex, var.equal = TRUE, data = sat_gpa)

## 
##  Two Sample t-test
## 
## data:  gpa_fy by sex
## t = 3.1828, df = 998, p-value = 0.001504
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.05695029 0.24009148
## sample estimates:
## mean in group Female   mean in group Male 
##             2.544587             2.396066

Question 2: Relationship between Total SAT Score & High School GPA?

For this analysis sat_total is the outcome variable, and gpa_hs is the predictor variable, with two levels “low” and “high”.

Exploratory data analysis

We can first calculate the mean total SAT score for each group (i.e students with a low and high GPA), using the group_by and summarize commands from the dplyr package.

avg_sat_gpa <- sat_gpa %>% 
  group_by(gpa_hs) %>% 
  summarize(sat_total = mean(sat_total))

avg_sat_gpa

## # A tibble: 2 x 2
##   gpa_hs sat_total
##   <chr>      <dbl>
## 1 high       109. 
## 2 low         98.2

We will next generate a data visualization that displays the total SAT scores of the two groups. Be sure to include a title and label your axes.

ggplot(sat_gpa, aes(x = gpa_hs, y = sat_total)) + 
  geom_boxplot(fill = "light blue") + 
  labs(title = "SAT scores based on high school GPA scores", 
       x = "GPA ranking", y = "SAT score")

Stating a null hypothesis

Q2.a)

State the null hypothesis that you are testing (using either words or symbols)

Answer:

The difference in GPA ranking (whether high or low) does not affect the SAT score

Testing the hypothesis

Q2.b)

Calculate the observed difference between the mean total SAT scores of the low and high GPA high-school students.

# you finish this code....
obs_diff_sat_hs_gpa <- sat_gpa %>% 
  specify(sat_total ~ gpa_hs) %>% 
  calculate(stat = "diff in means", order = c("high", "low"))

Q2.c)

Generate the null distribution of \(\delta\). Here you need to generate simulated values as if we lived in a world where there’s no difference in SAT scores between high school students with low and high GPAs.

# you finish this code....

sat_in_null_world <- sat_gpa %>% 
  specify(sat_total ~ gpa_hs) %>% 
  hypothesize(null = "independence") %>% 
  generate(reps = 5000, type = "permute")

Q2.d)

Calculate the differences in mean SAT scores between students with low and high GPA scores under the Null. Note…you should use whatever order you chose above…i.e. order = c("low", "high") or order = c("high", "low").

# you finish this code....

sat_diff_under_null <- sat_in_null_world %>% 
  calculate(stat = "diff in means", order = c("high", "low")) 

sat_diff_under_null %>% 
  slice(1:5)

## # A tibble: 5 x 2
##   replicate   stat
##       <int>  <dbl>
## 1         1  1.65 
## 2         2 -0.563
## 3         3 -0.743
## 4         4  0.190
## 5         5 -0.162

Q2.e)

Visualize how the observed difference compares to the null distribution of \(\delta\). Generate a histogram of the null distribution, with a vertical red line showing the observed difference in SAT scores between high school students with a high and low GPA.

# you finish this code....

sat_diff_under_null %>% 
  visualize(obs_stat = obs_diff_sat_hs_gpa) +
  labs(x = "Difference in SAT score for students with high and low GPA rankings", y = "Count",
       title = "Null distribution of differences in SAT scores",
      subtitle = "Actual difference observed in the data is marked in red")

Q2.f)

Calculate a p-value

Answer:

sat_diff_under_null %>% 
  get_pvalue(obs_stat = obs_diff_sat_hs_gpa, direction = "both")

## # A tibble: 1 x 1
##   p_value
##     <dbl>
## 1       0

Q2.g)

Write up the results & conclusions for this hypothesis test. Note, p-values less than 0.001 are often reported as p < 0.001.

Answer:

The mean SAT scores for students with high GPA rankings in our sample (\(\bar{x}\) = 108) was greater than those with low GPA rankings (\(\bar{x}\) = 98). This difference was statistically significant at \(\alpha = 0.05\), (p < 0.001). Given this I would reject the Null hypothesis and conclude that those with high GPA rankings have higher SAT scores than those with low GPA rankings at the population level.

Q2.h)

Calculate a confidence interval for the difference in total SAT scores for students with high and low high-school GPA scores. Note…you should use whatever order you chose above…i.e. order = c("low", "high") or order = c("high", "low").

# you finish this code....

ci_diff_sat_means <- sat_gpa %>% 
  specify(sat_total ~ gpa_hs) %>%
  generate(reps = 5000) %>% 
  calculate(stat = "diff in means", order = c("high", "low")) %>% 
  get_ci(level = 0.95)
ci_diff_gpa_means

## # A tibble: 1 x 2
##   `2.5%` `97.5%`
##    <dbl>   <dbl>
## 1 0.0560   0.240

Q2.i)

Use a t-test to test the null hypothesis that total SAT scores do not differ between students with high and low high school GPA scores at the population level.

t.test(sat_total ~ gpa_hs, var.equal = TRUE, data = sat_gpa)

## 
##  Two Sample t-test
## 
## data:  sat_total by gpa_hs
## t = 12.413, df = 998, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##   8.79614 12.09948
## sample estimates:
## mean in group high  mean in group low 
##          108.67828           98.23047