Introduction

In this markdown, we’ll take a look at the Graduate Admissions data from Kaggle which contains data on the various factors affecting admissions to a graduate program from an Indian perspective. Using the data, we’d like to answer 2 main questions :

  1. Which factors are important contributors to graduate admission?
  2. Can we predict an applicant’s chance of admission based on the application?

Let’s get started

Data cleaning & wrangling

First, we’ll import and look at the data

library(tidyverse)
library(MLmetrics)
library(knitr)
library(GGally)
library(lmtest)
library(car)
library(kableExtra)
data <- read_csv('Admission_Predict_Ver1.1.csv')
data %>% 
    head() %>% 
    kable() %>% 
    kable_styling()
Serial No.  GRE Score TOEFL Score University Rating SOP LOR CGPA Research Chance of Admit
1 337 118 4 4.5 4.5 9.65 1 0.92
2 324 107 4 4.0 4.5 8.87 1 0.76
3 316 104 3 3.0 3.5 8.00 1 0.72
4 322 110 3 3.5 2.5 8.67 1 0.80
5 314 103 2 2.0 3.0 8.21 0 0.65
6 330 115 5 4.5 3.0 9.34 1 0.90

The data contains 500 rows and 9 columns :

  • Serial No. : ID of the data
  • GRE Score : GRE score (dbl, 0 - 350)
  • TOELF Score : TOEFL score (dbl, 0 - 120)
  • University Rating : Rating of the university, (integer, 1 - 5)
  • SOP : Strength of the Statement of Purpose (integer, 1- 5)
  • LOR : Strength of the letter of recommendation (integer, 1 - 5)
  • CGPA : Undergraduate GPA (dbl, out of 10)
  • Research : Research experience (integer,0 : none, 1 : yes)
  • Chance of Admit : Admission (dbl, 0 - 1)

Before continuing, let’s see if there is any NA values or duplicated rows

colSums(is.na(data))
##        Serial No.         GRE Score       TOEFL Score University Rating 
##                 0                 0                 0                 0 
##               SOP               LOR              CGPA          Research 
##                 0                 0                 0                 0 
##   Chance of Admit 
##                 0

There are no NA values

data[duplicated(data), ]
## # A tibble: 0 x 9
## # ... with 9 variables: `Serial No.` <dbl>, `GRE Score` <dbl>, `TOEFL
## #   Score` <dbl>, `University Rating` <dbl>, SOP <dbl>, LOR <dbl>, CGPA <dbl>,
## #   Research <dbl>, `Chance of Admit` <dbl>

and no duplicated data.

We do not need the Serial no / ID of the rows, therefore we can get it out of the data set. Aside from that, we’ll also rename the columns to make it easier to reference.

data.clean <- data %>% select(-c(`Serial No.`)) %>% 
    rename(
        GRE = `GRE Score`,
        TOEFL = `TOEFL Score`,
        Rating = `University Rating`,
        chance = `Chance of Admit`
    )

We’ll then split the data into train and test data to serve as validation for the model. We’ll use 20% of the data for test data.

idx <- sample(nrow(data.clean), nrow(data.clean) * 0.8)

data.train <- data.clean[idx, ]
data.test <- data.clean[-idx, ]

Exploratory data analysis

Before jumping into modeling, let’s explore the data. All the predictor variables are numeric, and so is the target. We can use a correlation matrix to see how strong the correlations are between the predictor and target.

ggcorr(data.clean, label = T) +
    labs(title = 'Correlation Matrix')

As we can see from the matrix, there are very strong correlations of all the factors to the chance of admission. To further better differentiate the factors, let’s make a linear model and see the strengths of each factor.

Model formulation

model.all <- lm(chance ~ ., data = data.train)
summary(model.all)
## 
## Call:
## lm(formula = chance ~ ., data = data.train)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.243436 -0.022308  0.005847  0.032502  0.154815 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.3886284  0.1156276 -12.009  < 2e-16 ***
## GRE          0.0024488  0.0005427   4.513 8.48e-06 ***
## TOEFL        0.0026660  0.0009260   2.879  0.00421 ** 
## Rating       0.0086929  0.0041558   2.092  0.03710 *  
## SOP         -0.0028817  0.0049498  -0.582  0.56077    
## LOR          0.0135090  0.0044576   3.031  0.00260 ** 
## CGPA         0.1138610  0.0104041  10.944  < 2e-16 ***
## Research     0.0205398  0.0072401   2.837  0.00479 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.05698 on 392 degrees of freedom
## Multiple R-squared:  0.8348, Adjusted R-squared:  0.8319 
## F-statistic:   283 on 7 and 392 DF,  p-value: < 2.2e-16

From the model, it’s shown that the strongest factor in admission (in order) is CGPA, Research Experience, Letter of Recommendation and GRE. Our exploration above, the 4 out of the 7 data are significant predictors for the chance of admission, with the adjusted R2 value of 0.8154.

As a comparison, let’s make another model using stepwise regression. Stepwise regression picks out the best features for the model by minimizing the Akaike Information Criterion (AIC), a metric that estimates information loss. The lower the information loss, the lower the AIC value, and the more data is captured by the model.

model.step <- step(model.all, chance ~ . , direction= 'backward', trace = 0)
summary(model.step)
## 
## Call:
## lm(formula = chance ~ GRE + TOEFL + Rating + LOR + CGPA + Research, 
##     data = data.train)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.242402 -0.021858  0.005769  0.032531  0.154220 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.3792579  0.1144056 -12.056  < 2e-16 ***
## GRE          0.0024540  0.0005421   4.526 7.96e-06 ***
## TOEFL        0.0025992  0.0009181   2.831  0.00488 ** 
## Rating       0.0078189  0.0038719   2.019  0.04413 *  
## LOR          0.0127221  0.0042442   2.998  0.00289 ** 
## CGPA         0.1129167  0.0102682  10.997  < 2e-16 ***
## Research     0.0204675  0.0072329   2.830  0.00490 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.05693 on 393 degrees of freedom
## Multiple R-squared:  0.8347, Adjusted R-squared:  0.8322 
## F-statistic: 330.7 on 6 and 393 DF,  p-value: < 2.2e-16

The first model, model.all captures a better Adjusted R-squared (0.8318689).

Testing

To test the performance of the two models, both models are run against the test dataset. We’ll then use two metrics : Root Mean Squared Error RMSE and Mean Average Percentage Error MAPE to evaluate the models using functions from the MLMetrics library.

Testing model.all

# Results for `result.all` model, using 4 predictors
result.all <- predict(model.all, newdata = data.test)
model.all.mape <- MAPE(result.all, data.test$chance)
model.all.rmse <- RMSE(result.all, data.test$chance)

Testing model.step

# Results for `step` model, using 5 predictors
result.step <- predict(model.step, newdata = data.test)
model.step.mape <- MAPE(result.step, data.test$chance)
model.step.rmse <- RMSE(result.step, data.test$chance)
comp <- rbind(
    c('model', 'MAPE', 'RMSE'),
    c('model.all', model.all.mape, model.all.rmse),
    c('model.step', model.step.mape, model.step.rmse)
)

kable(comp, caption = 'Model comparison') %>% 
    kable_styling()
Model comparison
model MAPE RMSE
model.all 0.0896746956626329 0.0719893938359623
model.step 0.0890540578113787 0.0716849012511888

Evaluation

From the testing results, we can see that model.step provides a slightly better prediction compared to model.all and uses 1 less predictor. Therefore, if we were to deploy this model, model.step would provide better performance than model.all. Even though we have produced a seemingly good model, we still have to evaluate the assumptions behind the model. One way to check if the model has captured the variance in the data is to make a residual plot. For this section, we’ll use model.step.

1. Residual Plot

plot(model.step$fitted.values, model.step$residuals, ylab = 'Residuals', xlab = 'Chance')
abline(h = 0, col = 'red')

The residuals are scattered around the 0 line with a converging pattern around 0. This shows that there is a tendency of the residuals to not be normally distributed across its range of value. In other words, the variance of error changes as the the value changes. To quantify and test this hypothesis, we use the next test : the test for homoscedasticity.

Homoscedasticity

The homoscedasticity assumption states that the variability of a variable is equal across is range of value. As we have seen from the plot above, there is a tendency of the values to converge as the chance goes to 1. To objectively test this, we use the Breusch-Pagan test.

bptest(model.step)
## 
##  studentized Breusch-Pagan test
## 
## data:  model.step
## BP = 20.65, df = 6, p-value = 0.00212

The p-value from the Breusch-Pagan test is used to reject or not reject the null hypothesis :

H0 : Variation of residual is constant (homoscedastic) H1 : Variation of residual is not constant (heteroscedastic)

From the test, we get p-value of 0.0021199, which is less than 0.05. This means that we reject the null hypothesis and the variation of residuals is not constant for across the range of values.

Normality

The normality assumption states that the variability of a variable is equal across is range of value, i.e. the residuals are distributed normally. As we have seen from the plot above, there is a tendency of the values to converge as the chance goes to 1. To objectively test this, we use the Breusch-Pagan test.

shapiro.test(model.step$residuals)
## 
##  Shapiro-Wilk normality test
## 
## data:  model.step$residuals
## W = 0.91986, p-value = 9.069e-14

The p-value from the Shapiro-Wilk normality test is used to reject or not reject the null hypothesis :

H0 : Residuals are normally distributed H1 : Residuals are not normally distributed

The p value is below 0.05, therefore the null hypothesis is rejected. This means that the residuals are note normally distributed.

2. Multicollinearity

Strongly correlated predictors might dominate the model and skew the results. A good model has independent predictors. To test this, we can calculate the Variance Inflation Factor (VIF) using the vif function from the car library.

vif(model.step)
##      GRE    TOEFL   Rating      LOR     CGPA Research 
## 4.567094 3.781739 2.377519 1.870140 4.766208 1.580362

As a rule of thumb, values above 5 shows strong correlation. From the test, we can see that all values fall below 5, but CGPA, GRE and TOEFL are more correlated than the other predictors. This makes sense, as the three are academic scores which usually go hand in hand.

Model improvement

Our evaluation above shows that our model still have room for improvement. The VIF test results hinted at possible improvements by showing that CGPA and GRE are correlated. The residual plot also shows that the residuals converge to -0.02 as chance goes to 1. The plot below shows the various columns of the data plotted against chance.

data.clean %>%
    pivot_longer(cols = -chance, names_to='variable') %>% 
    ggplot() +
    aes(x = value, y = chance) +
    geom_point() +
    facet_wrap(~variable, scales = 'free_x') + 
    labs(title = 'Plots of parameters vs chance of admission for different variables', x = 'Parameter', y = 'Chance')

As we can see from the plots of CGPA, GRE and TOEFL, chance tapers to 1 as the values increase. This causes the residual values to deviate from randomness as it hits the model hits its upper limit.

Feature engineering

Let’s try to engineer our feature to make a better model. One of the thing that we can do is transforming numeric features by taking logarithms for some continuous variables such as CGPA, GRE and TOEFL

data.log <- data.clean %>% 
    mutate(
        CGPA = log(CGPA),
        GRE = log(GRE),
        TOEFL = log(TOEFL)
    )
head(data.log)
## # A tibble: 6 x 8
##     GRE TOEFL Rating   SOP   LOR  CGPA Research chance
##   <dbl> <dbl>  <dbl> <dbl> <dbl> <dbl>    <dbl>  <dbl>
## 1  5.82  4.77      4   4.5   4.5  2.27        1   0.92
## 2  5.78  4.67      4   4     4.5  2.18        1   0.76
## 3  5.76  4.64      3   3     3.5  2.08        1   0.72
## 4  5.77  4.70      3   3.5   2.5  2.16        1   0.8 
## 5  5.75  4.63      2   2     3    2.11        0   0.65
## 6  5.80  4.74      5   4.5   3    2.23        1   0.9
# Splitting log data using the same index
data.log.train <- data.log[idx, ]
data.log.test <- data.log[-idx, ]

Let’s train our model again using stepwise regression and evaluate it

model.log <- step(lm(chance ~ ., data = data.log.train), chance ~ ., direction = 'backward', trace = 0)
summary(model.log)
## 
## Call:
## lm(formula = chance ~ GRE + TOEFL + Rating + LOR + CGPA + Research, 
##     data = data.log.train)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.24524 -0.02209  0.00625  0.03207  0.15713 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -7.299445   0.778768  -9.373  < 2e-16 ***
## GRE          0.788565   0.169183   4.661 4.32e-06 ***
## TOEFL        0.291865   0.097391   2.997  0.00290 ** 
## Rating       0.008301   0.003853   2.155  0.03180 *  
## LOR          0.012688   0.004233   2.998  0.00289 ** 
## CGPA         0.950047   0.085982  11.049  < 2e-16 ***
## Research     0.020104   0.007219   2.785  0.00562 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.05681 on 393 degrees of freedom
## Multiple R-squared:  0.8354, Adjusted R-squared:  0.8329 
## F-statistic: 332.4 on 6 and 393 DF,  p-value: < 2.2e-16

As we can see, the model contains the same parameters with model.all. Evaluating the model

result.log <- predict(model.log, data.log.test)
model.log.mape <- MAPE(result.log, data.log.test$chance)
model.log.rmse <- RMSE(result.log, data.log.test$chance)

model.log using logged predictors have a better MAPE but worse RMSE score compared to the previous models

comp <- rbind(
    comp,
    c('model.log', model.log.mape, model.log.rmse )
)

kable(comp, caption = 'Model comparison') %>% 
    kable_styling()
Model comparison
model MAPE RMSE
model.all 0.0896746956626329 0.0719893938359623
model.step 0.0890540578113787 0.0716849012511888
model.log 0.0881938366067705 0.0718630232949641

Which ones matter?

As we see, overall, university Rating does not affect the chance of admission. However, this raises an interesting question, do universities with different rating values each admission factor differently?

To answer that question, let’s try to separate the data according to the university ratings and try to make a best fit.

# Separate universities by rating
univ <- list()
for(i in c(1:5)) {
 
    univ.data <- data.clean %>% 
    filter(Rating == i) %>% 
    select(-c(Rating))

    univ[[i]] <- univ.data
 
    #Clean up
    rm(univ.data)
}

# Crete models for each rating
compare.univs <- list()
for(i in (1:5)) {
    univ.model <- step(lm(chance ~ ., univ[[i]]), chance ~ ., trace = 0)
    compare.univs[[i]] <- univ.model
    
    # Clean up
    rm(univ.model)
}

# Plot coefficients
univ.coefs <- matrix(0, nrow = length(compare.univs), ncol = 6)
colnames(univ.coefs) <- c('SOP', 'LOR', 'CGPA', 'Research', 'GRE', 'TOEFL')

for(i in 1:length(compare.univs)) {
    
    # Get all coefficients, without intercept
    coefs <- compare.univs[[i]]$coefficients[-1]
    
    # Assign each coefficients to matrix
    for(name in names(coefs)) {
        
        univ.coefs[i, name] <- coefs[name]
    }
    
    # clean up
    rm(coefs)
    
}
univ.coefs <- as_tibble(univ.coefs, rownames = 'Rating')
univ.coefs %>%
    pivot_longer(c(-Rating)) %>% 
    ggplot(aes(x = Rating, y = value)) +
    geom_col(aes(fill = Rating)) +
    facet_wrap(~name) +
        labs(x = 'Univ. Rating', y = 'Coeff. Value', title = 'Coefficient Weights for Universities in Different Rating Brackets')

From the graph, we can see that different factors contribute differently for university applications :

  • CGPA is the highest contributor to the chance of success for most of universities, but its percentage goes down for university with better ratings.
  • GRE and TOEFL are not important for the chance of success. As our model analysis had shown, this may be due to the strong correlation between CGPA, GRE and TOEFL, so most of the variance of GRE and TOEFL had already been described by CGPA.
  • Research experience and Statement of Purpose (SOP) becomes more important for higher-rating universities
  • Letter of Recommendation (LOR) is a big factor in admission to low-rating universities, but not as much in higher universities

Conclusion

In this publication, 3 linear regression models have been trained to answer the 2 questions that was posed in the beginning. We have seen that model.all results in the best prediction, with a very slight advantage over model.step. However, model.step uses less parameters. Therefore, if deployed, model.step would require less processing to produce very similar result.

model.step has captured much of the variance, with an Adjusted R-squared of 0.8321517. Although the model does not pass the required assumptions for linear regression, the deviation is explainable and reasonably acceptable.

We have also discovered that universities in different rating brackets evalute the applicant’s parameters differently. All universities consider CGPA to be the most important factor in admission. Aside from CGPA, high-ranking universities (4-5) evaluate Statement of Purpose (SOP) and Research Experience. Meanwhile, a strong Letter of Recommendation (LOR) is very important when applying to lower-rangking university (1). According to the data, GRE and TOEFL scores are not strong predictors of admissions. However, there might be a minimum GRE and TOEFL that is required to pass the entrance.