1 Explanation

To get accepted to your favorite University is everyone’s dream, but there are a lot of things that need to be considered. So, we want to analyze what factor that influence the the chance of admissions to Graduate Programs using linear regression. This data was downloaded from kaggle about Graduate Admissions from an Indian perspective.

Some relevant columns in the data :
- GRE.Score : GRE Scores ( out of 340 )
- TOEFL.Score : TOEFL Scores ( out of 120 )
- University.Rating : University Rating ( out of 5 )
- SOP : Statement of Purpose ( out of 5 )
- LOR : Letter of Recommendation Strength ( out of 5 )
- CGPA : Undergraduate GPA ( out of 10 )
- Research : Research Experience ( either 0 or 1 )
- Chance.of.Admit : Chance of Admit ( ranging from 0 to 1 )

1.1 Attaching Packages

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(ggplot2)
library(GGally)
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
library(leaps)
## Warning: package 'leaps' was built under R version 4.0.5
library(lmtest)
## Warning: package 'lmtest' was built under R version 4.0.5
## Loading required package: zoo
## Warning: package 'zoo' was built under R version 4.0.5
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
library(car)
## Loading required package: carData
## 
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
## 
##     recode
library(MLmetrics)
## Warning: package 'MLmetrics' was built under R version 4.0.5
## 
## Attaching package: 'MLmetrics'
## The following object is masked from 'package:base':
## 
##     Recall
library(manipulate)

2 Data Explanatory

2.1 Input Data

admission <- read.csv("data/Admission_Predict_Ver1.1.csv")

2.2 Data Inspection

head(admission)
##   Serial.No. GRE.Score TOEFL.Score University.Rating SOP LOR CGPA Research
## 1          1       337         118                 4 4.5 4.5 9.65        1
## 2          2       324         107                 4 4.0 4.5 8.87        1
## 3          3       316         104                 3 3.0 3.5 8.00        1
## 4          4       322         110                 3 3.5 2.5 8.67        1
## 5          5       314         103                 2 2.0 3.0 8.21        0
## 6          6       330         115                 5 4.5 3.0 9.34        1
##   Chance.of.Admit
## 1            0.92
## 2            0.76
## 3            0.72
## 4            0.80
## 5            0.65
## 6            0.90
tail(admission)
##     Serial.No. GRE.Score TOEFL.Score University.Rating SOP LOR CGPA Research
## 495        495       301          99                 3 2.5 2.0 8.45        1
## 496        496       332         108                 5 4.5 4.0 9.02        1
## 497        497       337         117                 5 5.0 5.0 9.87        1
## 498        498       330         120                 5 4.5 5.0 9.56        1
## 499        499       312         103                 4 4.0 5.0 8.43        0
## 500        500       327         113                 4 4.5 4.5 9.04        0
##     Chance.of.Admit
## 495            0.68
## 496            0.87
## 497            0.96
## 498            0.93
## 499            0.73
## 500            0.84

2.3 Subsetting Data

We want to use the data that have the chance of admit more than 75%.

adm <- admission %>% 
  filter(Chance.of.Admit > 0.75) %>% 
  dplyr::select(-Serial.No.)

2.4 Data Cleansing

To check if there is any missing value from the data.

anyNA(adm)
## [1] FALSE
colSums(is.na(adm))
##         GRE.Score       TOEFL.Score University.Rating               SOP 
##                 0                 0                 0                 0 
##               LOR              CGPA          Research   Chance.of.Admit 
##                 0                 0                 0                 0

In this data there is no missing value.

Check data type for each column.

glimpse(adm)
## Rows: 210
## Columns: 8
## $ GRE.Score         <int> 337, 324, 322, 330, 327, 328, 328, 334, 336, 340, 32~
## $ TOEFL.Score       <int> 118, 107, 110, 115, 111, 112, 116, 119, 119, 120, 10~
## $ University.Rating <int> 4, 4, 3, 5, 4, 4, 5, 5, 5, 5, 5, 4, 5, 5, 5, 4, 5, 5~
## $ SOP               <dbl> 4.5, 4.0, 3.5, 4.5, 4.0, 4.0, 5.0, 5.0, 4.0, 4.5, 4.~
## $ LOR               <dbl> 4.5, 4.5, 2.5, 3.0, 4.5, 4.5, 5.0, 4.5, 3.5, 4.5, 3.~
## $ CGPA              <dbl> 9.65, 8.87, 8.67, 9.34, 9.00, 9.10, 9.50, 9.70, 9.80~
## $ Research          <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 0, 1, 1~
## $ Chance.of.Admit   <dbl> 0.92, 0.76, 0.80, 0.90, 0.84, 0.78, 0.94, 0.95, 0.97~
ggcorr(adm, label = T, label_size = 3, hjust = 0.9)

All variable have positive correlation with Chance.of.Admit, but the strongest positive correlation with Chance.of.Admit is CGPA, TOEFL.Score, and GRE.Score.

3 Linear Regression Model

3.1 Strongest Correlation

Firstly, we will make a linear regression model that use CGPA, TOEFL.Score, and GRE.Score as the predictor.

model_one <- lm(Chance.of.Admit ~ CGPA + TOEFL.Score + GRE.Score, adm)
summary(model_one)
## 
## Call:
## lm(formula = Chance.of.Admit ~ CGPA + TOEFL.Score + GRE.Score, 
##     data = adm)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.091157 -0.016801  0.002086  0.020893  0.077820 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.8480784  0.0929162  -9.127  < 2e-16 ***
## CGPA         0.1115444  0.0083104  13.422  < 2e-16 ***
## TOEFL.Score  0.0030727  0.0006916   4.442 1.45e-05 ***
## GRE.Score    0.0010480  0.0004403   2.380   0.0182 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.02977 on 206 degrees of freedom
## Multiple R-squared:  0.7917, Adjusted R-squared:  0.7887 
## F-statistic:   261 on 3 and 206 DF,  p-value: < 2.2e-16

Using the first model we got the Adjusted R-squared = 0.7887

3.2 Step-wise Regression

Next, we will choose predictor using Step-wise Regression with backward elimination to make the second model.

model_step <- lm(Chance.of.Admit ~ ., adm)
step(model_step, direction = "backward")
## Start:  AIC=-1506.92
## Chance.of.Admit ~ GRE.Score + TOEFL.Score + University.Rating + 
##     SOP + LOR + CGPA + Research
## 
##                     Df Sum of Sq     RSS     AIC
## - LOR                1  0.000786 0.14962 -1507.8
## - SOP                1  0.001166 0.15000 -1507.3
## - GRE.Score          1  0.001300 0.15013 -1507.1
## <none>                           0.14883 -1506.9
## - Research           1  0.007299 0.15613 -1498.9
## - University.Rating  1  0.009019 0.15785 -1496.6
## - TOEFL.Score        1  0.017296 0.16613 -1485.8
## - CGPA               1  0.090177 0.23901 -1409.5
## 
## Step:  AIC=-1507.82
## Chance.of.Admit ~ GRE.Score + TOEFL.Score + University.Rating + 
##     SOP + CGPA + Research
## 
##                     Df Sum of Sq     RSS     AIC
## - GRE.Score          1  0.001103 0.15072 -1508.3
## <none>                           0.14962 -1507.8
## - SOP                1  0.001936 0.15156 -1507.1
## - Research           1  0.007283 0.15690 -1499.8
## - University.Rating  1  0.011135 0.16076 -1494.7
## - TOEFL.Score        1  0.016821 0.16644 -1487.4
## - CGPA               1  0.097553 0.24718 -1404.4
## 
## Step:  AIC=-1508.28
## Chance.of.Admit ~ TOEFL.Score + University.Rating + SOP + CGPA + 
##     Research
## 
##                     Df Sum of Sq     RSS     AIC
## <none>                           0.15072 -1508.3
## - SOP                1  0.001751 0.15248 -1507.8
## - Research           1  0.009631 0.16036 -1497.3
## - University.Rating  1  0.011987 0.16271 -1494.2
## - TOEFL.Score        1  0.027459 0.17818 -1475.1
## - CGPA               1  0.119510 0.27024 -1387.7
## 
## Call:
## lm(formula = Chance.of.Admit ~ TOEFL.Score + University.Rating + 
##     SOP + CGPA + Research, data = adm)
## 
## Coefficients:
##       (Intercept)        TOEFL.Score  University.Rating                SOP  
##         -0.505701           0.003387           0.011247           0.005708  
##              CGPA           Research  
##          0.098242           0.020271
model_two <- lm(formula = Chance.of.Admit ~ TOEFL.Score + University.Rating + 
    SOP + CGPA + Research, data = adm)
summary(model_two)
## 
## Call:
## lm(formula = Chance.of.Admit ~ TOEFL.Score + University.Rating + 
##     SOP + CGPA + Research, data = adm)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.09466 -0.01305  0.00267  0.01702  0.09020 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       -0.5057014  0.0558446  -9.056  < 2e-16 ***
## TOEFL.Score        0.0033872  0.0005556   6.096 5.35e-09 ***
## University.Rating  0.0112466  0.0027921   4.028 7.94e-05 ***
## SOP                0.0057084  0.0037084   1.539 0.125281    
## CGPA               0.0982415  0.0077245  12.718  < 2e-16 ***
## Research           0.0202706  0.0056145   3.610 0.000385 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.02718 on 204 degrees of freedom
## Multiple R-squared:  0.828,  Adjusted R-squared:  0.8238 
## F-statistic: 196.4 on 5 and 204 DF,  p-value: < 2.2e-16

Step-wise Regression will make a model using the most appropriate predictor based on the AIC value where the lower the AIC value the less information loss that happened on the model. Using the second model we got the Adjusted R-squared = 0.8238.

The Adjusted R-squared from the first model is 0.7887 and the Adjusted R-squared from the second model is 0.8238. So we can conclude that the second model got higher Adjusted R-squared than the first model. The higher the Adjusted R-squared the better the model is.

Models: \[Chance.of.Admit = 0.111544(CGPA) + 0.003073(TOEFL.Score) + 0.001048(GRE.Score) - 0.848078 \] \[Chance.of.Admit = 0.003387(TOEFL.Score) + 0.011247(University.Rating) + 0.005708(SOP) + 0.098242(CGPA) + 0.020271(Research) - 0.505701\]

4 Prediction Model and Error

predict(model_one, data.frame(Chance.of.Admit = 0.92,
                              CGPA = 9.65,
                              TOEFL.Score = 118,
                              GRE.Score = 337),
        interval = "confidence", level = 0.95)
##         fit      lwr       upr
## 1 0.9440649 0.936391 0.9517387
predict(model_two, data.frame(Chance.of.Admit = 0.92,
                              CGPA = 9.65,
                              TOEFL.Score = 118,
                              University.Rating = 4,
                              SOP = 4.5,
                              Research = 1),
        interval = "confidence", level = 0.95)
##         fit       lwr       upr
## 1 0.9329629 0.9255154 0.9404104
sqrt((0.92-0.9440649)^2)
## [1] 0.0240649
sqrt((0.92-0.9329629)^2)
## [1] 0.0129629

The smaller the RSE value the better the model is so we can conclude that the second model is better than the first model.

5 Model Evaluation

5.1 Normality of Residual

hist(model_one$residuals, breaks = 20)

shapiro.test(model_one$residuals)
## 
##  Shapiro-Wilk normality test
## 
## data:  model_one$residuals
## W = 0.9897, p-value = 0.138
hist(model_two$residuals, breaks = 20)

shapiro.test(model_two$residuals)
## 
##  Shapiro-Wilk normality test
## 
## data:  model_two$residuals
## W = 0.97511, p-value = 0.0008914

P-value of the first model is more than 0.05, while the p-value of the second model is less than 0.05. So, the residual of the second model isn’t normally distributed. To make the residual of the second model to normally distributed we could transform the data or we could increase the data.

5.2 Homoscedasticity of Residual

plot(adm$Chance.of.Admit, model_one$residuals)
abline(h = 0, col = "red")

bptest(model_one)
## 
##  studentized Breusch-Pagan test
## 
## data:  model_one
## BP = 7.6526, df = 3, p-value = 0.05376
plot(adm$Chance.of.Admit, model_two$residuals)
abline(h = 0, col = "red")

bptest(model_two)
## 
##  studentized Breusch-Pagan test
## 
## data:  model_two
## BP = 10.754, df = 5, p-value = 0.05648

The p-value of first model and the p-value of second model is more than 0.05. So, the residual is constant and didn’t make a pattern (Heteroscedasticity).

5.3 No Multicolinearity

vif(model_one)
##        CGPA TOEFL.Score   GRE.Score 
##    2.185335    2.407613    2.557778
vif(model_two)
##       TOEFL.Score University.Rating               SOP              CGPA 
##          1.863324          1.863578          1.910092          2.264315 
##          Research 
##          1.127349

There are no values more than 10 in the both of the model. So, there is no predictor that depend with each other (Multicolinearity).

6 Conclusion

If we only look at the Adjusted R-squared and the RSE value, the second model is better than the first model. The Adjusted R-squared from the first model is 0.7887 and the Adjusted R-squared from the second model is 0.8238. But, the second model didn’t pass the Normality of Residual while the first model pass all of the model evaluation.

So, I think the first model still better than the second model. It means the factor that most influence the chance of admissions to Graduate Programs are Undergraduate GPA, GRE Scores, and TOEFL Scores.