Graduate Admissions

Introduction
Library
Load Dataset
Exploratory Data Analysis
- Correlation
- Data Visualization
Modeling
Evaluation
- Model Performance
- Assumption
Model Improvement
Conclusion

Introduction

Here, we would like to use linear regression model using graduate admission dataset. The main goal is to know the relationship between variables and to predict the chance of admission.

The graduate admission dataset contains several parameters which are considered important during the application for Masters Programs. The parameters included are :

GRE Scores ( out of 340 )
TOEFL Scores ( out of 120 )
University Rating ( out of 5 )
Statement of Purpose and Letter of Recommendation Strength ( out of 5 )
Undergraduate GPA ( out of 10 )
Research Experience ( either 0 or 1 )
Chance of Admit ( ranging from 0 to 1 )

Library

#knitr::opts_chunk$set(echo = F)
library(tidyverse)
library(caret)
library(plotly)
library(data.table)
library(GGally)
library(tidymodels)
library(car)
library(scales)
library(lmtest)
library(dplyr)
options(scipen = 100, max.print = 1e+06)

Load Dataset

a <- read.csv("Admission_Predict_Ver1.1.csv") %>% 
  select(-Serial.No.)

str(a)

## 'data.frame':    500 obs. of  8 variables:
##  $ GRE.Score        : int  337 324 316 322 314 330 321 308 302 323 ...
##  $ TOEFL.Score      : int  118 107 104 110 103 115 109 101 102 108 ...
##  $ University.Rating: int  4 4 3 3 2 5 3 2 1 3 ...
##  $ SOP              : num  4.5 4 3 3.5 2 4.5 3 3 2 3.5 ...
##  $ LOR              : num  4.5 4.5 3.5 2.5 3 3 4 4 1.5 3 ...
##  $ CGPA             : num  9.65 8.87 8 8.67 8.21 9.34 8.2 7.9 8 8.6 ...
##  $ Research         : int  1 1 1 1 0 1 1 0 0 0 ...
##  $ Chance.of.Admit  : num  0.92 0.76 0.72 0.8 0.65 0.9 0.75 0.68 0.5 0.45 ...

The data has 500 observartions (rows) with 8 variables (column). The Serial No. variable is neglected as it is an unique identifier. The target variable is Chance of Admit.

Exploratory Data Analysis

Before exploring the data variables, it is important to inspect and change the class of variables into the suitable one. In this case, Research and University.Rating variables needed to be change into the character class.

a <- a %>% 
  mutate(Research = as.factor(Research),
         University.Rating = as.factor(University.Rating))

Correlation

To see the relationship between variables, we could find the Pearson correlation between the features.

ggcorr(a, label=T, label_size = 2.9)

## Warning in ggcorr(a, label = T, label_size = 2.9): data in column(s)
## 'University.Rating', 'Research' are not numeric and were ignored

The graph indicates that Chance of Admit variable is correlate strongly with each predictor variables.

Data Visualization

plot_ly(a, x=~GRE.Score, y=~TOEFL.Score, z=~CGPA, color = ~Chance.of.Admit, type="scatter3d", mode="markers") %>% 
 layout(scene = list(xaxis = list(title = "GRE Score"),
                            yaxis = list(title = "TOEFL Score"),
                            zaxis = list(title = "CGPA")))

From the 3D graph, we could see that chance of admission is higher when the CGPA, TOEFL and GRE score is higher.

Modeling

The data is needed to be seperated into two dataset, which are train dataset and test dataset. The train dataset will be used for the linear regression model while test dataset is used for comparison. Train dataset contains 70% of the main data.

set.seed(333)
id <- sample(nrow(a), nrow(a)*0.7)
a.train <- a[id,]
a.test <- a[-id,]

We will try to do the linear regression modeing using Chance of Admit as the target variable.

m <- lm(Chance.of.Admit~., a.train)
summary(m)

## 
## Call:
## lm(formula = Chance.of.Admit ~ ., data = a.train)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.226543 -0.023759  0.007652  0.034031  0.164322 
## 
## Coefficients:
##                      Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)        -1.3233948  0.1213051 -10.910 < 0.0000000000000002 ***
## GRE.Score           0.0024465  0.0005654   4.327            0.0000199 ***
## TOEFL.Score         0.0022961  0.0010320   2.225             0.026749 *  
## University.Rating2 -0.0129424  0.0138680  -0.933             0.351354    
## University.Rating3 -0.0104983  0.0147933  -0.710             0.478400    
## University.Rating4 -0.0134790  0.0175686  -0.767             0.443485    
## University.Rating5  0.0037659  0.0202704   0.186             0.852726    
## SOP                 0.0065106  0.0055111   1.181             0.238290    
## LOR                 0.0182460  0.0047745   3.822             0.000158 ***
## CGPA                0.1088601  0.0112626   9.666 < 0.0000000000000002 ***
## Research1           0.0269591  0.0079592   3.387             0.000789 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.05836 on 339 degrees of freedom
## Multiple R-squared:  0.8338, Adjusted R-squared:  0.8289 
## F-statistic: 170.1 on 10 and 339 DF,  p-value: < 0.00000000000000022

The summary of the model m shows that the adjusted R-squared value of 0.8288. Here we could see that there are two variables with Pr(>|t|) > 0.05 which show that the variables have no significant effect toward the model.

For comparing, we would use step-wise regression with backward elimination method.

stats::step(m, direction = "backward")

## Start:  AIC=-1977.99
## Chance.of.Admit ~ GRE.Score + TOEFL.Score + University.Rating + 
##     SOP + LOR + CGPA + Research
## 
##                     Df Sum of Sq    RSS     AIC
## - University.Rating  4   0.01186 1.1664 -1982.4
## - SOP                1   0.00475 1.1593 -1978.5
## <none>                           1.1545 -1978.0
## - TOEFL.Score        1   0.01686 1.1714 -1974.9
## - Research           1   0.03907 1.1936 -1968.3
## - LOR                1   0.04974 1.2042 -1965.2
## - GRE.Score          1   0.06376 1.2183 -1961.2
## - CGPA               1   0.31817 1.4727 -1894.8
## 
## Step:  AIC=-1982.41
## Chance.of.Admit ~ GRE.Score + TOEFL.Score + SOP + LOR + CGPA + 
##     Research
## 
##               Df Sum of Sq    RSS     AIC
## - SOP          1   0.00568 1.1721 -1982.7
## <none>                     1.1664 -1982.4
## - TOEFL.Score  1   0.01703 1.1834 -1979.3
## - Research     1   0.04164 1.2080 -1972.1
## - LOR          1   0.05207 1.2184 -1969.1
## - GRE.Score    1   0.06607 1.2324 -1965.1
## - CGPA         1   0.32988 1.4962 -1897.2
## 
## Step:  AIC=-1982.71
## Chance.of.Admit ~ GRE.Score + TOEFL.Score + LOR + CGPA + Research
## 
##               Df Sum of Sq    RSS     AIC
## <none>                     1.1721 -1982.7
## - TOEFL.Score  1   0.02192 1.1940 -1978.2
## - Research     1   0.04385 1.2159 -1971.9
## - GRE.Score    1   0.06508 1.2371 -1965.8
## - LOR          1   0.07173 1.2438 -1963.9
## - CGPA         1   0.37421 1.5462 -1887.7

## 
## Call:
## lm(formula = Chance.of.Admit ~ GRE.Score + TOEFL.Score + LOR + 
##     CGPA + Research, data = a.train)
## 
## Coefficients:
## (Intercept)    GRE.Score  TOEFL.Score          LOR         CGPA    Research1  
##   -1.389682     0.002461     0.002552     0.020523     0.113357     0.028009

m1 <- lm(formula = Chance.of.Admit ~ GRE.Score + TOEFL.Score + LOR + 
    CGPA + Research, data = a.train)

summary(m1)

## 
## Call:
## lm(formula = Chance.of.Admit ~ GRE.Score + TOEFL.Score + LOR + 
##     CGPA + Research, data = a.train)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.231469 -0.024395  0.007353  0.035280  0.163887 
## 
## Coefficients:
##               Estimate Std. Error t value             Pr(>|t|)    
## (Intercept) -1.3896819  0.1125191 -12.351 < 0.0000000000000002 ***
## GRE.Score    0.0024611  0.0005631   4.370           0.00001645 ***
## TOEFL.Score  0.0025517  0.0010059   2.537             0.011631 *  
## LOR          0.0205232  0.0044728   4.588           0.00000627 ***
## CGPA         0.1133571  0.0108165  10.480 < 0.0000000000000002 ***
## Research1    0.0280088  0.0078072   3.588             0.000382 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.05837 on 344 degrees of freedom
## Multiple R-squared:  0.8313, Adjusted R-squared:  0.8288 
## F-statistic:   339 on 5 and 344 DF,  p-value: < 0.00000000000000022

The stepwise regression model throw away the two variables that don’t have significant effect on model. To see if it has an effect on the model, we need to compare the adjusted R-square value from both. The first model show value of 0.8288 while the second model show the same value. It show that it is safe to removing the variable that has no significant value. Therefore, we use the second model, m1 as the main candidate model.

Evaluation

Model Performance

To see how well the model predict the target variable, we use root mean squared error (RMSE)

#RMSE of train dataset
RMSE(pred=m1$fitted.values, obs=a.train$Chance.of.Admit)

## [1] 0.05786801

#RMSE of test dataset
pred <- predict(m1, a.test)
RMSE(pred=pred, obs=a.test$Chance.of.Admit)

## [1] 0.06423879

As the RMSE of both train and test datasets is similar, we could assume that the model is not overfit.

Assumption

1. Linearity

lin <- data.frame(residual = m1$residuals, fitted = m1$fitted.values)

lin %>% ggplot(aes(fitted, residual)) + geom_point() + geom_smooth() + geom_hline(aes(yintercept = 0)) + 
    theme(panel.grid = element_blank(), panel.background = element_blank())

## `geom_smooth()` using method = 'loess' and formula 'y ~ x'

It could be seen from the plot that there is no visible pattern. So that indicate that the model is linear.

2. Normality Test

hist(m1$residuals, breaks = 20)

shapiro.test(m1$residuals)

## 
##  Shapiro-Wilk normality test
## 
## data:  m1$residuals
## W = 0.94155, p-value = 0.0000000001612

With p-value < 0.05, it can be concluded that our hypothesis is rejected, which means that residuals are not following the normal distribution.

3. Heterocedasticity

#install.packages('lmtest')
library(lmtest)
bptest(m1)

## 
##  studentized Breusch-Pagan test
## 
## data:  m1
## BP = 17.041, df = 5, p-value = 0.004423

lin %>% ggplot(aes(fitted, residual)) + geom_point() + theme_light() + geom_hline(aes(yintercept = 0))

Using the Breusch-Pagan test, the model shows p-value below 0.05, so it can be concluded that heterocesdasticity is present in our model.

4. Autocorrelation

durbinWatsonTest(m1)

##  lag Autocorrelation D-W Statistic p-value
##    1      0.05308529      1.893537   0.332
##  Alternative hypothesis: rho != 0

Autocorrelation can be detected using the durbin watson test, with null hypothesis that there is no autocorrelation.The result shows that the null hypothesis is not rejected, meaning that our residual has no autocorrelation in it.

5. Multicollinearity

library(car)
vif(m1)

##   GRE.Score TOEFL.Score         LOR        CGPA    Research 
##    4.248963    3.861199    1.815835    4.411289    1.540605

Multicollinearity could indicate correlation between the independent variables/predictors. A VIF value that exceeds 5 or 10 indicates a problematic amount of collinearity. In the model, all the VIF value is under 5 so correlation between predictors is weak.

Model Improvement

Model Tuning

library(dplyr)
a2 <- a %>% 
  mutate(chance = ifelse(Chance.of.Admit>0.5,1,0)) %>% 
  select(-Chance.of.Admit)

a2$chance <- as.factor(a2$chance)

As the linear regression model does not meet several assumptions, we try to change the model. We will try to use logistic regression model. We make new variable, chance that using cutoff value of 0.5 from the previous target, Chance.of.Admit.

Modeling

set.seed(333)
id <- sample(nrow(a2), nrow(a2)*0.8)
a2.train <- a2[id,]
a2.test <- a2[-id,]

m2 <- glm(formula = chance~., data = a2.train, family = "binomial")

summary(m2)

## 
## Call:
## glm(formula = chance ~ ., family = "binomial", data = a2.train)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.62630   0.00406   0.05386   0.21851   1.64962  
## 
## Coefficients:
##                      Estimate Std. Error z value   Pr(>|z|)    
## (Intercept)         -57.01801   11.83404  -4.818 0.00000145 ***
## GRE.Score             0.04581    0.03908   1.172     0.2410    
## TOEFL.Score           0.13289    0.08483   1.567     0.1172    
## University.Rating2   -1.02121    0.74887  -1.364     0.1727    
## University.Rating3   -0.44938    0.91825  -0.489     0.6246    
## University.Rating4    0.20034    1.34131   0.149     0.8813    
## University.Rating5   11.26151 1169.46461   0.010     0.9923    
## SOP                  -0.70378    0.38625  -1.822     0.0684 .  
## LOR                   0.95513    0.39320   2.429     0.0151 *  
## CGPA                  3.92543    0.99701   3.937 0.00008243 ***
## Research1            -0.30313    0.63527  -0.477     0.6332    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 232.65  on 399  degrees of freedom
## Residual deviance: 112.82  on 389  degrees of freedom
## AIC: 134.82
## 
## Number of Fisher Scoring iterations: 18

Performance

library(caret)
a2.train$pred.chance <- predict(m2, a2.train, type = "response")
a2.train$pred.label <- ifelse(a2.train$pred.chance < 0.5, "0", "1") %>% as.factor()
confusionMatrix(a2.train$pred.label, a2.train$chance, positive = "1")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction   0   1
##          0  16   7
##          1  18 359
##                                           
##                Accuracy : 0.9375          
##                  95% CI : (0.9091, 0.9591)
##     No Information Rate : 0.915           
##     P-Value [Acc > NIR] : 0.0591          
##                                           
##                   Kappa : 0.5291          
##                                           
##  Mcnemar's Test P-Value : 0.0455          
##                                           
##             Sensitivity : 0.9809          
##             Specificity : 0.4706          
##          Pos Pred Value : 0.9523          
##          Neg Pred Value : 0.6957          
##              Prevalence : 0.9150          
##          Detection Rate : 0.8975          
##    Detection Prevalence : 0.9425          
##       Balanced Accuracy : 0.7257          
##                                           
##        'Positive' Class : 1               
##

library(caret)
a2.test$pred.chance <- predict(m2, a2.test, type = "response")
a2.test$pred.label <- ifelse(a2.test$pred.chance < 0.5, "0", "1") %>% as.factor()
confusionMatrix(a2.test$pred.label, a2.test$chance, positive = "1")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0  1  1
##          1  4 94
##                                           
##                Accuracy : 0.95            
##                  95% CI : (0.8872, 0.9836)
##     No Information Rate : 0.95            
##     P-Value [Acc > NIR] : 0.6160          
##                                           
##                   Kappa : 0.2647          
##                                           
##  Mcnemar's Test P-Value : 0.3711          
##                                           
##             Sensitivity : 0.9895          
##             Specificity : 0.2000          
##          Pos Pred Value : 0.9592          
##          Neg Pred Value : 0.5000          
##              Prevalence : 0.9500          
##          Detection Rate : 0.9400          
##    Detection Prevalence : 0.9800          
##       Balanced Accuracy : 0.5947          
##                                           
##        'Positive' Class : 1               
##

The train dataset model has the accuracy of 93.75%, while test dataset model has the accuracy of 95%.

Conclusion

To predict chances from student to enter the university, variables that are useful are GRE Score, TOEFL Score, CGPA, Research and Letter of Recommendation Strength. The R-squared of the model is pretty high, 82.88%. RMSE from the training data has RMSE of 0.058, while RMSE from the test data is 0.064, that show that the model is fit. Unfortunately, the model could not satisfy the classical assumptions. When we tried the logistic regression model with cutoff of 0.5, the training data show accuracy of 93.75 %, while the test data show accuracy of 95 %. Logistic regression model is the better model for the graduate admission data.