The data is in an .RData file, the load() function is sufficient
load('admissions.RData')
head(admissions)
## admit gre gpa rank
## 1 0 380 3.61 3
## 2 1 660 3.67 3
## 3 1 800 4.00 1
## 4 1 640 3.19 4
## 5 0 520 2.93 4
## 6 1 760 3.00 2
The data represents a sample of 400 applications for a certain Master’s degree.
Description of variables:
admit: Binary variable, 1 means applicant was admitted, 0 means rejected.
gre: The applicants score on the Graduate Records Examination.
gpa: The applicants grade point average.
rank: The rank of the school at which the applicant completed his bachelor’s.
Source: Empirical Data Analysis course at WU (2024)
First, let’s get rid of the variables which will not be relevant for the research question later.
admissions <- admissions[,c(1,3)]
Next, let’s factor the admit variable.
admissions$admit <- factor(admissions$admit,
levels = c(0,1),
labels = c('No','Yes'))
Let’s see whether the observations of both groups are approximately equal.
table(admissions$admit)
##
## No Yes
## 273 127
We can see that there are much more denied applications than accepted ones, hence we need to resample the data.
#Splitting the data by factor
admit <- admissions[admissions$admit == 'Yes',]
noadmit <- admissions[admissions$admit == 'No',]
#Setting seed for reproducibility
set.seed(12345)
#Sampling the admit data
df <- admit[sample(nrow(admit), 50, replace = F),]
#Sampling the noadmit data and adding it to the data frame
df <- rbind(df,
noadmit[sample(nrow(noadmit), 50, replace = F),])
#Removing temporary variables
rm(admit, noadmit, admissions)
Now we can show descriptive statistics.
library(psych)
describeBy(df$gpa, df$admit)
##
## Descriptive statistics by group
## group: No
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 50 3.28 0.4 3.3 3.27 0.48 2.42 4 1.58 0.1 -0.88 0.06
## ------------------------------------------------------------
## group: Yes
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 50 3.53 0.32 3.59 3.55 0.32 2.67 4 1.33 -0.37 -0.46 0.05
The average GPA of applicants were rejected was 3.28, while ones which were accepted was 3.59.
The standard deviation of GPA for admitted students is 0.32 while the one of rejected students is 0.4. Thus there are higher variations in GPA in the not admitted group. (Comparable because same unit of measurement)
We can also visualize the data.
library(ggplot2)
##
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
##
## %+%, alpha
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
df %>% ggplot(aes(x = admit, y = gpa)) +
geom_boxplot()
We can notice that the admitted group has an outlier with a much lower GPA than the rest of the group.
The quartiles of the distribution of not admitted student’s GPAs are much wider than that of the admitted applicants which further shows the difference in variation.
Research Question:
Is there a difference between the average GPA of admitted and not admitted students?
To investigate this research question we will need to use statistical hypothesis tests about the difference between two population arithmetic means for independent samples.
First, we can check whether we should use a parametric or non parametric test. There are four assumptions which have to be fulfilled to be able to use an independent sample t-test:
Variable is numeric
Data must come from two independent populations
Distribution of variable is normal in both populations.
Variable has the same variance in both populations. (If violated use Welch correction)
Assumptions 1 and 2 are fulfilled since we are interested in the GPA (numeric) between 2 independent groups.
library(rstatix)
##
## Attaching package: 'rstatix'
## The following object is masked from 'package:stats':
##
## filter
df %>%
group_by(admit) %>%
shapiro_test(gpa)
## # A tibble: 2 × 4
## admit variable statistic p
## <fct> <chr> <dbl> <dbl>
## 1 No gpa 0.971 0.265
## 2 Yes gpa 0.956 0.0626
p-value for both groups > 0.05. Thus, we do not reject null hypothesis and assume GPA is normally distributed in both populations.
Therefore, assumption 3 is fulfilled.
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
## The following object is masked from 'package:psych':
##
## logit
leveneTest(df$gpa, group = df$admit)
## Levene's Test for Homogeneity of Variance (center = median)
## Df F value Pr(>F)
## group 1 2.9034 0.09156 .
## 98
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
p-value > 0.05. Thus, we cannot reject the null hypothesis and assume the variance is the same in both populations.
Therefore, assumption 4 is fulfilled and we do not need to apply the Welch correction.
First, let’s consider the null and alternate hypotheses:
\(H_{0}\): \(\mu_{Admitted} = \mu_{Not Admitted}\)
\(H_{A}\): \(\mu_{Admitted} \neq \mu_{Not Admitted}\)
From the fact that the data fulfills all assumptions we know that we can use the parametric test. However, for the purposes of the homework both tests will be performed.
First, parametric:
t.test(df$gpa ~ df$admit,
var.equal = T,
alternative = 'two.sided')
##
## Two Sample t-test
##
## data: df$gpa by df$admit
## t = -3.4649, df = 98, p-value = 0.0007884
## alternative hypothesis: true difference in means between group No and group Yes is not equal to 0
## 95 percent confidence interval:
## -0.3947548 -0.1072452
## sample estimates:
## mean in group No mean in group Yes
## 3.2832 3.5342
Next, non-parametric:
wilcox.test(df$gpa ~ df$admit,
correct = F,
exact = F,
alternative = 'two.sided')
##
## Wilcoxon rank sum test
##
## data: df$gpa by df$admit
## W = 776, p-value = 0.001077
## alternative hypothesis: true location shift is not equal to 0
For the independent sample t-test we will use Cohens D to estimate the effect size:
library(effectsize)
##
## Attaching package: 'effectsize'
## The following objects are masked from 'package:rstatix':
##
## cohens_d, eta_squared
## The following object is masked from 'package:psych':
##
## phi
cohens_d(gpa ~ admit,
data = df,
pooled_sd = F)
## Cohen's d | 95% CI
## --------------------------
## -0.69 | [-1.10, -0.29]
##
## - Estimated using un-pooled SD.
interpret_cohens_d(0.69, rules = 'sawilowsky2009')
## [1] "medium"
## (Rules: sawilowsky2009)
For the non-parametric test we need to use rank-biserial:
effectsize(wilcox.test(df$gpa ~ df$admit,
correct = F,
exact = F,
alternative = 'two.sided'))
## r (rank biserial) | 95% CI
## ----------------------------------
## -0.38 | [-0.56, -0.17]
interpret_rank_biserial(0.38)
## [1] "large"
## (Rules: funder2019)
Since all conditions are fulfilled, we can use the parametric test for interpreting the results. Thus:
Based on the sample data, we find that the average GPA of admitted and not admitted students differ (p < 0.001). Admitted candidates have a higher average GPA (The effect size is medium, d = 0.86).