ADA Homework 2

1. Import Data

The data is in an .RData file, the load() function is sufficient

load('admissions.RData')

2. Display with head()

head(admissions)

##   admit gre  gpa rank
## 1     0 380 3.61    3
## 2     1 660 3.67    3
## 3     1 800 4.00    1
## 4     1 640 3.19    4
## 5     0 520 2.93    4
## 6     1 760 3.00    2

3. Explain your data

The data represents a sample of 400 applications for a certain Master’s degree.

Description of variables:

admit: Binary variable, 1 means applicant was admitted, 0 means rejected.
gre: The applicants score on the Graduate Records Examination.
gpa: The applicants grade point average.
rank: The rank of the school at which the applicant completed his bachelor’s.

4. Name the source of the data

Source: Empirical Data Analysis course at WU (2024)

5. Carry out data manipulation

First, let’s get rid of the variables which will not be relevant for the research question later.

admissions <- admissions[,c(1,3)]

Next, let’s factor the admit variable.

admissions$admit <- factor(admissions$admit,
                           levels = c(0,1),
                           labels = c('No','Yes'))

Let’s see whether the observations of both groups are approximately equal.

table(admissions$admit)

## 
##  No Yes 
## 273 127

We can see that there are much more denied applications than accepted ones, hence we need to resample the data.

#Splitting the data by factor
admit <- admissions[admissions$admit == 'Yes',]
noadmit <- admissions[admissions$admit ==  'No',]

#Setting seed for reproducibility
set.seed(12345)

#Sampling the admit data
df <- admit[sample(nrow(admit), 50, replace = F),]
#Sampling the noadmit data and adding it to the data frame
df <- rbind(df,
            noadmit[sample(nrow(noadmit), 50, replace = F),])

#Removing temporary variables
rm(admit, noadmit, admissions)

Now we can show descriptive statistics.

library(psych)

describeBy(df$gpa, df$admit)

## 
##  Descriptive statistics by group 
## group: No
##    vars  n mean  sd median trimmed  mad  min max range skew kurtosis   se
## X1    1 50 3.28 0.4    3.3    3.27 0.48 2.42   4  1.58  0.1    -0.88 0.06
## ------------------------------------------------------------ 
## group: Yes
##    vars  n mean   sd median trimmed  mad  min max range  skew kurtosis   se
## X1    1 50 3.53 0.32   3.59    3.55 0.32 2.67   4  1.33 -0.37    -0.46 0.05

The average GPA of applicants were rejected was 3.28, while ones which were accepted was 3.59.
The standard deviation of GPA for admitted students is 0.32 while the one of rejected students is 0.4. Thus there are higher variations in GPA in the not admitted group. (Comparable because same unit of measurement)

We can also visualize the data.

library(ggplot2)

## 
## Attaching package: 'ggplot2'

## The following objects are masked from 'package:psych':
## 
##     %+%, alpha

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

df %>% ggplot(aes(x = admit, y = gpa)) +
  geom_boxplot()

We can notice that the admitted group has an outlier with a much lower GPA than the rest of the group.
The quartiles of the distribution of not admitted student’s GPAs are much wider than that of the admitted applicants which further shows the difference in variation.

6. Writing research question and performing statistical hypothesis tests.

Research Question:

Is there a difference between the average GPA of admitted and not admitted students?

To investigate this research question we will need to use statistical hypothesis tests about the difference between two population arithmetic means for independent samples.

First, we can check whether we should use a parametric or non parametric test. There are four assumptions which have to be fulfilled to be able to use an independent sample t-test:

Variable is numeric
Data must come from two independent populations
Distribution of variable is normal in both populations.
Variable has the same variance in both populations. (If violated use Welch correction)

Assumptions 1 and 2 are fulfilled since we are interested in the GPA (numeric) between 2 independent groups.

Assumption 3:

library(rstatix)

## 
## Attaching package: 'rstatix'

## The following object is masked from 'package:stats':
## 
##     filter

df %>%
  group_by(admit) %>%
  shapiro_test(gpa)

## # A tibble: 2 × 4
##   admit variable statistic      p
##   <fct> <chr>        <dbl>  <dbl>
## 1 No    gpa          0.971 0.265 
## 2 Yes   gpa          0.956 0.0626

p-value for both groups > 0.05. Thus, we do not reject null hypothesis and assume GPA is normally distributed in both populations.
Therefore, assumption 3 is fulfilled.

Assumption 4:

library(car)

## Loading required package: carData

## 
## Attaching package: 'car'

## The following object is masked from 'package:dplyr':
## 
##     recode

## The following object is masked from 'package:psych':
## 
##     logit

leveneTest(df$gpa, group = df$admit)

## Levene's Test for Homogeneity of Variance (center = median)
##       Df F value  Pr(>F)  
## group  1  2.9034 0.09156 .
##       98                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

p-value > 0.05. Thus, we cannot reject the null hypothesis and assume the variance is the same in both populations.
Therefore, assumption 4 is fulfilled and we do not need to apply the Welch correction.

Testing the hypothesis

First, let’s consider the null and alternate hypotheses:

\(H_{0}\): \(\mu_{Admitted} = \mu_{Not Admitted}\)
\(H_{A}\): \(\mu_{Admitted} \neq \mu_{Not Admitted}\)

From the fact that the data fulfills all assumptions we know that we can use the parametric test. However, for the purposes of the homework both tests will be performed.

First, parametric:

t.test(df$gpa ~ df$admit,
       var.equal = T,
       alternative = 'two.sided')

## 
##  Two Sample t-test
## 
## data:  df$gpa by df$admit
## t = -3.4649, df = 98, p-value = 0.0007884
## alternative hypothesis: true difference in means between group No and group Yes is not equal to 0
## 95 percent confidence interval:
##  -0.3947548 -0.1072452
## sample estimates:
##  mean in group No mean in group Yes 
##            3.2832            3.5342

We reject the null hypothesis at p-value < 0.001.

Next, non-parametric:

wilcox.test(df$gpa ~ df$admit,
            correct = F,
            exact = F,
            alternative = 'two.sided')

## 
##  Wilcoxon rank sum test
## 
## data:  df$gpa by df$admit
## W = 776, p-value = 0.001077
## alternative hypothesis: true location shift is not equal to 0

Here we also reject null hypothesis however at p-value = 0.0011.

Effect size estimation

For the independent sample t-test we will use Cohens D to estimate the effect size:

library(effectsize)

## 
## Attaching package: 'effectsize'

## The following objects are masked from 'package:rstatix':
## 
##     cohens_d, eta_squared

## The following object is masked from 'package:psych':
## 
##     phi

cohens_d(gpa ~ admit,
         data = df,
         pooled_sd = F)

## Cohen's d |         95% CI
## --------------------------
## -0.69     | [-1.10, -0.29]
## 
## - Estimated using un-pooled SD.

interpret_cohens_d(0.69, rules = 'sawilowsky2009')

## [1] "medium"
## (Rules: sawilowsky2009)

We find a medium effect size for the parametric test

For the non-parametric test we need to use rank-biserial:

effectsize(wilcox.test(df$gpa ~ df$admit,
            correct = F,
            exact = F,
            alternative = 'two.sided'))

## r (rank biserial) |         95% CI
## ----------------------------------
## -0.38             | [-0.56, -0.17]

interpret_rank_biserial(0.38)

## [1] "large"
## (Rules: funder2019)

For the non-parametric test we find large differences in distributions locations.

Interpretation

Since all conditions are fulfilled, we can use the parametric test for interpreting the results. Thus:

Based on the sample data, we find that the average GPA of admitted and not admitted students differ (p < 0.001). Admitted candidates have a higher average GPA (The effect size is medium, d = 0.86).