1. Import Data

The data is in an .RData file, the load() function is sufficient

load('admissions.RData')

2. Display with head()

head(admissions)
##   admit gre  gpa rank
## 1     0 380 3.61    3
## 2     1 660 3.67    3
## 3     1 800 4.00    1
## 4     1 640 3.19    4
## 5     0 520 2.93    4
## 6     1 760 3.00    2

3. Explain your data

The data represents a sample of 400 applications for a certain Master’s degree.

Description of variables:

4. Name the source of the data

Source: Empirical Data Analysis course at WU (2024)

5. Carry out data manipulation

First, let’s get rid of the variables which will not be relevant for the research question later.

admissions <- admissions[,c(1,3)]

Next, let’s factor the admit variable.

admissions$admit <- factor(admissions$admit,
                           levels = c(0,1),
                           labels = c('No','Yes'))

Let’s see whether the observations of both groups are approximately equal.

table(admissions$admit)
## 
##  No Yes 
## 273 127

We can see that there are much more denied applications than accepted ones, hence we need to resample the data.

#Splitting the data by factor
admit <- admissions[admissions$admit == 'Yes',]
noadmit <- admissions[admissions$admit ==  'No',]

#Setting seed for reproducibility
set.seed(12345)

#Sampling the admit data
df <- admit[sample(nrow(admit), 50, replace = F),]
#Sampling the noadmit data and adding it to the data frame
df <- rbind(df,
            noadmit[sample(nrow(noadmit), 50, replace = F),])

#Removing temporary variables
rm(admit, noadmit, admissions)

Now we can show descriptive statistics.

library(psych)

describeBy(df$gpa, df$admit)
## 
##  Descriptive statistics by group 
## group: No
##    vars  n mean  sd median trimmed  mad  min max range skew kurtosis   se
## X1    1 50 3.28 0.4    3.3    3.27 0.48 2.42   4  1.58  0.1    -0.88 0.06
## ------------------------------------------------------------ 
## group: Yes
##    vars  n mean   sd median trimmed  mad  min max range  skew kurtosis   se
## X1    1 50 3.53 0.32   3.59    3.55 0.32 2.67   4  1.33 -0.37    -0.46 0.05

We can also visualize the data.

library(ggplot2)
## 
## Attaching package: 'ggplot2'
## The following objects are masked from 'package:psych':
## 
##     %+%, alpha
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
df %>% ggplot(aes(x = admit, y = gpa)) +
  geom_boxplot()

6. Writing research question and performing statistical hypothesis tests.

Research Question:

Is there a difference between the average GPA of admitted and not admitted students?

To investigate this research question we will need to use statistical hypothesis tests about the difference between two population arithmetic means for independent samples.

First, we can check whether we should use a parametric or non parametric test. There are four assumptions which have to be fulfilled to be able to use an independent sample t-test:

Assumptions 1 and 2 are fulfilled since we are interested in the GPA (numeric) between 2 independent groups.

Assumption 3:

library(rstatix)
## 
## Attaching package: 'rstatix'
## The following object is masked from 'package:stats':
## 
##     filter
df %>%
  group_by(admit) %>%
  shapiro_test(gpa)
## # A tibble: 2 × 4
##   admit variable statistic      p
##   <fct> <chr>        <dbl>  <dbl>
## 1 No    gpa          0.971 0.265 
## 2 Yes   gpa          0.956 0.0626
  • p-value for both groups > 0.05. Thus, we do not reject null hypothesis and assume GPA is normally distributed in both populations.

  • Therefore, assumption 3 is fulfilled.

Assumption 4:

library(car)
## Loading required package: carData
## 
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
## 
##     recode
## The following object is masked from 'package:psych':
## 
##     logit
leveneTest(df$gpa, group = df$admit)
## Levene's Test for Homogeneity of Variance (center = median)
##       Df F value  Pr(>F)  
## group  1  2.9034 0.09156 .
##       98                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
  • p-value > 0.05. Thus, we cannot reject the null hypothesis and assume the variance is the same in both populations.

  • Therefore, assumption 4 is fulfilled and we do not need to apply the Welch correction.

Testing the hypothesis

First, let’s consider the null and alternate hypotheses:

  • \(H_{0}\): \(\mu_{Admitted} = \mu_{Not Admitted}\)

  • \(H_{A}\): \(\mu_{Admitted} \neq \mu_{Not Admitted}\)

From the fact that the data fulfills all assumptions we know that we can use the parametric test. However, for the purposes of the homework both tests will be performed.

First, parametric:

t.test(df$gpa ~ df$admit,
       var.equal = T,
       alternative = 'two.sided')
## 
##  Two Sample t-test
## 
## data:  df$gpa by df$admit
## t = -3.4649, df = 98, p-value = 0.0007884
## alternative hypothesis: true difference in means between group No and group Yes is not equal to 0
## 95 percent confidence interval:
##  -0.3947548 -0.1072452
## sample estimates:
##  mean in group No mean in group Yes 
##            3.2832            3.5342
  • We reject the null hypothesis at p-value < 0.001.

Next, non-parametric:

wilcox.test(df$gpa ~ df$admit,
            correct = F,
            exact = F,
            alternative = 'two.sided')
## 
##  Wilcoxon rank sum test
## 
## data:  df$gpa by df$admit
## W = 776, p-value = 0.001077
## alternative hypothesis: true location shift is not equal to 0
  • Here we also reject null hypothesis however at p-value = 0.0011.

Effect size estimation

For the independent sample t-test we will use Cohens D to estimate the effect size:

library(effectsize)
## 
## Attaching package: 'effectsize'
## The following objects are masked from 'package:rstatix':
## 
##     cohens_d, eta_squared
## The following object is masked from 'package:psych':
## 
##     phi
cohens_d(gpa ~ admit,
         data = df,
         pooled_sd = F)
## Cohen's d |         95% CI
## --------------------------
## -0.69     | [-1.10, -0.29]
## 
## - Estimated using un-pooled SD.
interpret_cohens_d(0.69, rules = 'sawilowsky2009')
## [1] "medium"
## (Rules: sawilowsky2009)
  • We find a medium effect size for the parametric test

For the non-parametric test we need to use rank-biserial:

effectsize(wilcox.test(df$gpa ~ df$admit,
            correct = F,
            exact = F,
            alternative = 'two.sided'))
## r (rank biserial) |         95% CI
## ----------------------------------
## -0.38             | [-0.56, -0.17]
interpret_rank_biserial(0.38)
## [1] "large"
## (Rules: funder2019)
  • For the non-parametric test we find large differences in distributions locations.

Interpretation

Since all conditions are fulfilled, we can use the parametric test for interpreting the results. Thus:

Based on the sample data, we find that the average GPA of admitted and not admitted students differ (p < 0.001). Admitted candidates have a higher average GPA (The effect size is medium, d = 0.86).