Background

The purpose of this R markdown is to find out which of the parameters that are most significant in predicting the result of Graduate Admissions from an Indian perspective. It could also help to predict which applicants that are most likely to get accepted to graduate schools and those who don’t. This can be achieved by using linear regression in machine learning.

The dataset that I used here was obtained from Kaggle and was created by Mohan S Acharya, Asfia Armaan, Aneeta S Antony : A Comparison of Regression Models for Prediction of Graduate Admissions, IEEE International Conference on Computational Intelligence in Data Science 2019.

Import packages and .csv file

library(tidyverse)
library(GGally) # to visualized the correlation among variables
library(performance) # to compare performances from different regression models
library(MLmetrics) # to check error rate such as RMSE
library(lmtest) # Breusch-Pagan assumption check
library(car) # will be used in assumption check using VIF
admission <- read_csv("Graduate Admission 2/Admission_Predict_Ver1.1.csv")
admission

Columns description: 1. Serial No.: Applicants identifiers 2. GRE Score: GRE Scores ( out of 340 ) 3. TOEFL Score: TOEFL Scores ( out of 120 ) 4. University Rating: University Rating ( out of 5 ) 5. SOP: Statement of Purpose ( out of 5 ) 6. LOR: Letter of Recommendation Strength ( out of 5 ) 7. CGPA: Undergraduate GPA ( out of 10 ) 8. Research: Research Experience ( either 0 or 1 ) 9. Chance of Admit: Chance of Admit ( ranging from 0 to 1 )

Data wrangling

admission <- admission %>% 
  mutate(`University Rating` = as.factor(`University Rating`)) %>% 
  select(-`Serial No.`)

admission
anyNA(admission)
## [1] FALSE

All columns are already stored in their correct data types and it doesn’t have missing values. Now, we are ready to make our linear model from admission dataframe.

Linear Regression

Linear regression formula:

\[ \hat{y}=\beta_0+\beta_1.x_1+...+\beta_n.x_n \]

y = b0 + b1x y : target variables / prediction result b0 : intercept b1 : slope \(x_1,...,x_n\) : the predictor variables that we’d use

Choosing our target & predictor variables:

Target variable: Chance of Admit Predictor variables: All variables

Check correlation between target variables and predictor variables with ggcorr function from GGally package.

# from 'GGally' package

ggcorr(admission, label = T)
## Warning in ggcorr(admission, label = T): data in column(s) 'University Rating'
## are not numeric and were ignored

Among the target variables, CGPA has the strongest positive correlation with our target variable, followed closely by TOEFL_Score and GRE_Score.

Cross-validation

Split data into 80% data train and 20% data test.

RNGkind(sample.kind = "Rounding")
set.seed(205)

index <- sample(x = nrow(admission), size = nrow(admission)*0.8)

train <- admission[index,]
test <- admission[-index,]
nrow(train)
## [1] 400
nrow(test)
## [1] 100

Create model

adm_model <- lm(`Chance of Admit`~., train)
summary(adm_model)
## 
## Call:
## lm(formula = `Chance of Admit` ~ ., data = train)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.265434 -0.023774  0.006304  0.031860  0.159554 
## 
## Coefficients:
##                        Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          -1.2199756  0.1180922 -10.331  < 2e-16 ***
## `GRE Score`           0.0016277  0.0005598   2.908 0.003849 ** 
## `TOEFL Score`         0.0029342  0.0009698   3.026 0.002646 ** 
## `University Rating`2 -0.0178475  0.0138356  -1.290 0.197827    
## `University Rating`3 -0.0071599  0.0147563  -0.485 0.627803    
## `University Rating`4 -0.0033960  0.0172852  -0.196 0.844344    
## `University Rating`5  0.0130942  0.0197289   0.664 0.507270    
## SOP                   0.0016861  0.0051836   0.325 0.745145    
## LOR                   0.0166625  0.0045547   3.658 0.000289 ***
## CGPA                  0.1214008  0.0108234  11.217  < 2e-16 ***
## Research              0.0219704  0.0074641   2.943 0.003440 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.05975 on 389 degrees of freedom
## Multiple R-squared:   0.83,  Adjusted R-squared:  0.8256 
## F-statistic: 189.9 on 10 and 389 DF,  p-value: < 2.2e-16

Linear regression is a highly interpretable model, meaning we are able to interpret the model’s result. We can then insert both the intercept and the slope to the linear regression formula.

\[ \hat{y}=\beta_0+\beta_1.x_1+...+\beta_n.x_n \] y = b0 + b1x1 + b2x2 + …. y = -1.2199756 + 0.0016277*GRE Score + ….

  • The intercept / b0 = -1.2199756
  • Slope for GRE Score = 0.0016277
  • Slope for Research = 0.0219704
  • etc.

Model interpretation:

  1. An increase in GRE Score by 1 point will likewise increase the acceptance rate of a graduate school applicants by 0.0016277.
  2. Judging from our p-values, all of the predictor variables are significantly related to the target variable except for SOP.
  3. From column University Ranking, we have a result of 4 dummy variables.
  4. All predictor variables have a positive correlations with the target variable.
  5. The adjusted R-Squared for adm_model is 0.8256. This means our adm_model can already explain 82.56% of the overall target variable.

We will be doing some feature engineering to select which of the predictor columns that we should and shouldn’t keep by looking at its AIC (information loss). This can be done using step-wise regression.

model_backward <- step(object = adm_model, direction = "backward")
## Start:  AIC=-2243.27
## `Chance of Admit` ~ `GRE Score` + `TOEFL Score` + `University Rating` + 
##     SOP + LOR + CGPA + Research
## 
##                       Df Sum of Sq    RSS     AIC
## - SOP                  1   0.00038 1.3890 -2245.2
## - `University Rating`  4   0.02431 1.4129 -2244.3
## <none>                             1.3886 -2243.3
## - `GRE Score`          1   0.03018 1.4188 -2236.7
## - Research             1   0.03093 1.4195 -2236.5
## - `TOEFL Score`        1   0.03268 1.4213 -2236.0
## - LOR                  1   0.04777 1.4364 -2231.7
## - CGPA                 1   0.44909 1.8377 -2133.2
## 
## Step:  AIC=-2245.16
## `Chance of Admit` ~ `GRE Score` + `TOEFL Score` + `University Rating` + 
##     LOR + CGPA + Research
## 
##                       Df Sum of Sq    RSS     AIC
## - `University Rating`  4   0.02630 1.4152 -2245.7
## <none>                             1.3890 -2245.2
## - `GRE Score`          1   0.03003 1.4190 -2238.6
## - Research             1   0.03102 1.4200 -2238.3
## - `TOEFL Score`        1   0.03383 1.4228 -2237.5
## - LOR                  1   0.05429 1.4432 -2231.8
## - CGPA                 1   0.47220 1.8612 -2130.1
## 
## Step:  AIC=-2245.66
## `Chance of Admit` ~ `GRE Score` + `TOEFL Score` + LOR + CGPA + 
##     Research
## 
##                 Df Sum of Sq    RSS     AIC
## <none>                       1.4152 -2245.7
## - `GRE Score`    1   0.03433 1.4496 -2238.1
## - Research       1   0.03463 1.4499 -2238.0
## - `TOEFL Score`  1   0.03948 1.4547 -2236.7
## - LOR            1   0.07277 1.4880 -2227.6
## - CGPA           1   0.53332 1.9486 -2119.8

Through the process of step-wise regression, SOP is the only column that is removed from our model. Let’s now compare the performance from both models.

# from 'performance' package

compare_performance(adm_model, model_backward)

Turns out, ‘adm_model’ has a slightly higher Adujsted R-square in comparison to ‘model_backward’ but its AIC (information loss) is also a little bit higher than the other model. The Akaike information criterion (AIC) itself can be defined as a metric that is used to compare the fit of different regression models. It is also possible to have a negative AIC result. We just have to pay attention to the difference.

The availabe option here is to either go with the model_backward which has lower AIC (model_backward) or to go with adm_model with its higher R adjusted square. At this stage, we will first try to go and use the model_backward first.

model_backward$call
## lm(formula = `Chance of Admit` ~ `GRE Score` + `TOEFL Score` + 
##     LOR + CGPA + Research, data = train)
# check our prediction

model_backward$fitted.values[1:5]
##         1         2         3         4         5 
## 0.6696896 0.6254814 0.9090086 0.8247680 0.8561895

Error/Residual

Error/residual is the difference between our prediction and the actual values.

Error check (RMSE)

# from 'MLmetrics' package

RMSE(y_pred = model_backward$fitted.values, y_true = admission$`Chance of Admit`)
## Warning in y_true - y_pred: longer object length is not a multiple of shorter
## object length
## [1] 0.1979951
range(admission$`Chance of Admit`)
## [1] 0.34 0.97
hist(admission$`Chance of Admit`)

Assumption check

(Bruesch-pagan, linearity, multicollinearity, homoscedasticity)

Normality of Residual

hist(model_backward$residuals)

Most of the residuals are concentrated in 0, making it the shape of normal distribution.

Homoscedasticity of Residual (Breusch-Pagan test)

# from 'lmtest' package

bptest(model_backward)
## 
##  studentized Breusch-Pagan test
## 
## data:  model_backward
## BP = 22.189, df = 5, p-value = 0.000482

With Breusch-Pagan, we got p-value of 0.00009647 which has a way lower number than the alpha (0.05). P-value < alpha means we reject H0. In Breusch-Pagan test, H1 means that our error variance isn’t distributed in constant. Therefore, the model fall in the category of Heteroscesdasticity. Check again with scatter plot.

plot(x = model_backward$fitted.values, # prediction results
     y = model_backward$residuals) # residuals
abline(h = 0, col = "red", lty = 2)

Our plot shows the inclination towards ‘heteroscedasticity’, particularly in the pattern of Fan Shape. We don’t want it to have any pattern, we want it to be random. This pattern indicates that the error standard in the backward_model isn’t distributed in constant.

Multicollinearity

# from `car` package
vif(model_backward)
##   `GRE Score` `TOEFL Score`           LOR          CGPA      Research 
##      4.569588      3.892302      1.733001      4.455922      1.512236

None of our predictor variables that has a score of more than 10. This is great because if it did score more than 10, it means that there’s a strong correlation between one predictor variable with another and we don’t want that to happen.

Check all asumptions at once

By using the check_model function, we can test our models with all assumptions.

# from 'performance' package

check_model(model_backward)

Predict values with the test data set

# predict with adm_model
adm_test <- predict(adm_model, test)

# add column adm_test to the original test data set
test$predict.adm_model <- adm_test

Comparing the result between the prediction and the actual values.

test <- test %>% 
  rename("Predicted Values" = "predict.adm_model")

test %>% 
  select(c(`Predicted Values`, `Chance of Admit`)) %>% 
  head()

Plot observed and predicted values from the test dataset.

test %>% 
  ggplot(aes(x = `Chance of Admit`, y = `Predicted Values`)) +
  geom_point(alpha = 0.8) +
  geom_abline(intercept = 0,
              slope = 1,
              color = "blue",
              size = 1.5) +
  theme_minimal()

Conclusion

At this stage, the model that we use to predict graduate admissions is the model_backward. Through the process of feature selection, we get six variables that are considered to be important in predicting the acceptance rate of a graduate school applicant. The six contributing factors are: 1) GRE score 2) TOEFL Score 3) University Rating 4) LOR (Letter of Recommendation Strength) 5) CGPA (Undergraduate GPA) and 6) Experience in research.

In terms of our model performance, it managed to pass two out of four assumptions check and those are the Normality of Residuals and the Multicollinearity. However, a linear regression model should have successfully passed all four of the assumptions check in order to be qualified as a “good” model. Meanwhile, the model_backward model has failed to fulfill that requirement. Therefore, we can say that predicting graduate admissions with linear regression model wouldn’t be the best choice that we can apply for future use. It is not sufficient to correctly predict the acceptance rate of a grad school applicants.

Predicting with different regression models beside linear regression would hopefully create a better model than the one that we just made.