1 Intro

We will be using linear regression model to predicting the chances of a student making it into this particular university masters program. hopefully the prediction will give the students fair idea of how likely they are to make it into the program.

2 Setup and Data Import

library(scales)
library(dplyr)
library(GGally)
library(performance)
library(car)
library(MLmetrics)
library(lmtest)
options(scipen = 9999)
admission <- read.csv('AdmissionTrain.csv')
head(admission)

2.1 Data Dictionary

  • GRE Scores ( out of 340 )
  • TOEFL Scores ( out of 120 )
  • University Rating ( out of 5 )
  • Statement of Purpose and Letter of Recommendation Strength ( out of 5 )
  • Undergraduate GPA ( out of 10 )
  • Research Experience ( either 0 or 1 )
  • Chance of Admit ( ranging from 0 to 1 )
summary(admission)
#>    Serial.No.      GRE.Score      TOEFL.Score    University.Rating
#>  Min.   :  1.0   Min.   :290.0   Min.   : 92.0   Min.   :1.000    
#>  1st Qu.:100.8   1st Qu.:308.0   1st Qu.:103.0   1st Qu.:2.000    
#>  Median :200.5   Median :317.0   Median :107.0   Median :3.000    
#>  Mean   :200.5   Mean   :316.8   Mean   :107.4   Mean   :3.087    
#>  3rd Qu.:300.2   3rd Qu.:325.0   3rd Qu.:112.0   3rd Qu.:4.000    
#>  Max.   :400.0   Max.   :340.0   Max.   :120.0   Max.   :5.000    
#>       SOP           LOR             CGPA          Research     
#>  Min.   :1.0   Min.   :1.000   Min.   :6.800   Min.   :0.0000  
#>  1st Qu.:2.5   1st Qu.:3.000   1st Qu.:8.170   1st Qu.:0.0000  
#>  Median :3.5   Median :3.500   Median :8.610   Median :1.0000  
#>  Mean   :3.4   Mean   :3.453   Mean   :8.599   Mean   :0.5475  
#>  3rd Qu.:4.0   3rd Qu.:4.000   3rd Qu.:9.062   3rd Qu.:1.0000  
#>  Max.   :5.0   Max.   :5.000   Max.   :9.920   Max.   :1.0000  
#>  Chance.of.Admit 
#>  Min.   :0.3400  
#>  1st Qu.:0.6400  
#>  Median :0.7300  
#>  Mean   :0.7244  
#>  3rd Qu.:0.8300  
#>  Max.   :0.9700

3 Exploratory Data Analysis

ggcorr(admission, label = T, hjust= 0.9)

Our Graphic shows that all of the variables indicating strong correlation with the target variable (Chance.of.Admit). Also Strong correlation between our features, we will explore it later on multicolinearity assumption test.

anyNA(admission)
#> [1] FALSE

No missing value found on our data, we can move on to the modelling.

4 Creating Model and Feature Selection

We can start building model using every predictors available

init_model_all <- lm(Chance.of.Admit~., admission)

Then use stepwise function to eliminate the predictor one by one based on their significance and if the model achieved their minimum AIC value.

backstep_model <- step(init_model_all, direction = 'backward')
#> Start:  AIC=-2224.4
#> Chance.of.Admit ~ Serial.No. + GRE.Score + TOEFL.Score + University.Rating + 
#>     SOP + LOR + CGPA + Research
#> 
#>                     Df Sum of Sq    RSS     AIC
#> - SOP                1  0.000001 1.4703 -2226.4
#> <none>                           1.4703 -2224.4
#> - University.Rating  1  0.013613 1.4839 -2222.7
#> - GRE.Score          1  0.036819 1.5071 -2216.5
#> - Research           1  0.038157 1.5084 -2216.2
#> - TOEFL.Score        1  0.045734 1.5160 -2214.2
#> - LOR                1  0.061407 1.5317 -2210.0
#> - Serial.No.         1  0.124472 1.5948 -2193.9
#> - CGPA               1  0.290273 1.7606 -2154.3
#> 
#> Step:  AIC=-2226.4
#> Chance.of.Admit ~ Serial.No. + GRE.Score + TOEFL.Score + University.Rating + 
#>     LOR + CGPA + Research
#> 
#>                     Df Sum of Sq    RSS     AIC
#> <none>                           1.4703 -2226.4
#> - University.Rating  1  0.014924 1.4852 -2224.4
#> - GRE.Score          1  0.037011 1.5073 -2218.5
#> - Research           1  0.038427 1.5087 -2218.1
#> - TOEFL.Score        1  0.046338 1.5166 -2216.0
#> - LOR                1  0.073463 1.5437 -2208.9
#> - Serial.No.         1  0.125908 1.5962 -2195.5
#> - CGPA               1  0.298134 1.7684 -2154.6
summary(backstep_model)
#> 
#> Call:
#> lm(formula = Chance.of.Admit ~ Serial.No. + GRE.Score + TOEFL.Score + 
#>     University.Rating + LOR + CGPA + Research, data = admission)
#> 
#> Residuals:
#>       Min        1Q    Median        3Q       Max 
#> -0.233547 -0.026629  0.006155  0.038221  0.140281 
#> 
#> Coefficients:
#>                      Estimate  Std. Error t value             Pr(>|t|)    
#> (Intercept)       -1.29378909  0.11967201 -10.811 < 0.0000000000000002 ***
#> Serial.No.         0.00015926  0.00002749   5.794         0.0000000141 ***
#> GRE.Score          0.00179818  0.00057244   3.141             0.001810 ** 
#> TOEFL.Score        0.00368430  0.00104820   3.515             0.000491 ***
#> University.Rating  0.00880952  0.00441633   1.995             0.046761 *  
#> LOR                0.02157648  0.00487533   4.426         0.0000124796 ***
#> CGPA               0.10533353  0.01181460   8.916 < 0.0000000000000002 ***
#> Research           0.02438836  0.00761939   3.201             0.001482 ** 
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 0.06124 on 392 degrees of freedom
#> Multiple R-squared:  0.8188, Adjusted R-squared:  0.8156 
#> F-statistic: 253.1 on 7 and 392 DF,  p-value: < 0.00000000000000022

The model end up using Serial.No as a predictor, we should remove it.

admission_model <- lm(Chance.of.Admit ~ GRE.Score + University.Rating + TOEFL.Score + LOR + CGPA + Research, admission)
summary(admission_model)
#> 
#> Call:
#> lm(formula = Chance.of.Admit ~ GRE.Score + University.Rating + 
#>     TOEFL.Score + LOR + CGPA + Research, data = admission)
#> 
#> Residuals:
#>      Min       1Q   Median       3Q      Max 
#> -0.26391 -0.02197  0.01008  0.03621  0.15851 
#> 
#> Coefficients:
#>                     Estimate Std. Error t value             Pr(>|t|)    
#> (Intercept)       -1.2543094  0.1243300 -10.089 < 0.0000000000000002 ***
#> GRE.Score          0.0017646  0.0005957   2.962              0.00324 ** 
#> University.Rating  0.0048540  0.0045404   1.069              0.28570    
#> TOEFL.Score        0.0028389  0.0010801   2.628              0.00892 ** 
#> LOR                0.0210327  0.0050724   4.147            0.0000414 ***
#> CGPA               0.1179066  0.0120852   9.756 < 0.0000000000000002 ***
#> Research           0.0241542  0.0079287   3.046              0.00247 ** 
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 0.06373 on 393 degrees of freedom
#> Multiple R-squared:  0.8033, Adjusted R-squared:  0.8003 
#> F-statistic: 267.5 on 6 and 393 DF,  p-value: < 0.00000000000000022

Our model now has lesser Adj.R-Square value, but still a good fit. also University.Rating is now an unsignificant predictor, it is safe to remove it.

Regarding SOP, after many tries, i decided to not include SOP variable in the model, not only after applying stepwise function in both direction, both model leaves SOP, also the function of SOP are also similiar with LOR, adding SOP to model also reduce model performance.

Our Regression Model is : Chance.of.Admit ~ GRE.Score + University.Rating + TOEFL.Score + LOR + CGPA + Research

admission_model <- lm(Chance.of.Admit ~ GRE.Score + TOEFL.Score + LOR + CGPA + Research, admission)
summary(admission_model)
#> 
#> Call:
#> lm(formula = Chance.of.Admit ~ GRE.Score + TOEFL.Score + LOR + 
#>     CGPA + Research, data = admission)
#> 
#> Residuals:
#>       Min        1Q    Median        3Q       Max 
#> -0.263542 -0.023297  0.009879  0.038078  0.159897 
#> 
#> Coefficients:
#>               Estimate Std. Error t value             Pr(>|t|)    
#> (Intercept) -1.2984636  0.1172905 -11.070 < 0.0000000000000002 ***
#> GRE.Score    0.0017820  0.0005955   2.992              0.00294 ** 
#> TOEFL.Score  0.0030320  0.0010651   2.847              0.00465 ** 
#> LOR          0.0227762  0.0048039   4.741           0.00000297 ***
#> CGPA         0.1210042  0.0117349  10.312 < 0.0000000000000002 ***
#> Research     0.0245769  0.0079203   3.103              0.00205 ** 
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> Residual standard error: 0.06374 on 394 degrees of freedom
#> Multiple R-squared:  0.8027, Adjusted R-squared:  0.8002 
#> F-statistic: 320.6 on 5 and 394 DF,  p-value: < 0.00000000000000022

Interpretation : -For every 1 point in GRE, the chance of admission will increase by 0.001782 -For every 1 point in TOEFL, the chance of admission will increase by 0.0030320 -For every 1 point in LOR, the chance of admission will increase by 0.0227762 -For every 1 point in CGPA, the chance of admission will increase by 0.1210042 -For every 1 point in Researcg, the chance of admission will increase by 0.0245769

5 Assumption Test

Our Linear Regression model need to fulfill four assumptions before we actually use it, so our model wont be biased.

5.1 Linearity

cor.test(admission$Chance.of.Admit, admission$GRE.Score)
#> 
#>  Pearson's product-moment correlation
#> 
#> data:  admission$Chance.of.Admit and admission$GRE.Score
#> t = 26.843, df = 398, p-value < 0.00000000000000022
#> alternative hypothesis: true correlation is not equal to 0
#> 95 percent confidence interval:
#>  0.7647419 0.8349536
#> sample estimates:
#>       cor 
#> 0.8026105
cor.test(admission$Chance.of.Admit, admission$TOEFL.Score)
#> 
#>  Pearson's product-moment correlation
#> 
#> data:  admission$Chance.of.Admit and admission$TOEFL.Score
#> t = 25.845, df = 398, p-value < 0.00000000000000022
#> alternative hypothesis: true correlation is not equal to 0
#> 95 percent confidence interval:
#>  0.7519028 0.8255675
#> sample estimates:
#>      cor 
#> 0.791594
cor.test(admission$Chance.of.Admit, admission$LOR)
#> 
#>  Pearson's product-moment correlation
#> 
#> data:  admission$Chance.of.Admit and admission$LOR
#> t = 18, df = 398, p-value < 0.00000000000000022
#> alternative hypothesis: true correlation is not equal to 0
#> 95 percent confidence interval:
#>  0.6120380 0.7206083
#> sample estimates:
#>       cor 
#> 0.6698888
cor.test(admission$Chance.of.Admit, admission$CGPA)
#> 
#>  Pearson's product-moment correlation
#> 
#> data:  admission$Chance.of.Admit and admission$CGPA
#> t = 35.759, df = 398, p-value < 0.00000000000000022
#> alternative hypothesis: true correlation is not equal to 0
#> 95 percent confidence interval:
#>  0.8478354 0.8947275
#> sample estimates:
#>       cor 
#> 0.8732891
cor.test(admission$Chance.of.Admit, admission$Research)
#> 
#>  Pearson's product-moment correlation
#> 
#> data:  admission$Chance.of.Admit and admission$Research
#> t = 13.248, df = 398, p-value < 0.00000000000000022
#> alternative hypothesis: true correlation is not equal to 0
#> 95 percent confidence interval:
#>  0.4812548 0.6177458
#> sample estimates:
#>       cor 
#> 0.5532021

Based on correlation test above, all five predictors we used shows no significant correlation to the target, p-value < 0.05.

5.2 Normality

hist(admission_model$residuals)

shapiro.test(admission_model$residuals)
#> 
#>  Shapiro-Wilk normality test
#> 
#> data:  admission_model$residuals
#> W = 0.92193, p-value = 0.0000000000001443

H1 = errors are not distributed normally

As expected, the model did not pass the stastistical test, but if you look at the histogram, Our visual shows bell curve with longer tail on the left side, mean we have outlier in our data, but i decide to leave it as it is because the amount is to low, hence why we have the bell curve, so it pass the test

5.3 Homoscedasticity

plot(admission_model$fitted.values, admission_model$residuals)
abline(h = 0, col = 'red')

bptest(admission_model)
#> 
#>  studentized Breusch-Pagan test
#> 
#> data:  admission_model
#> BP = 22.428, df = 5, p-value = 0.0004341

With p-value < 0.05, we can conclude that heterocesdasticity is present. Our model did not pass Homoscedasticity test due to the outlier. however, just like our normality test, our visual did not catch a pattern. visually, it pass the test.

5.4 Multi Colinearity

vif(admission_model)
#>   GRE.Score TOEFL.Score         LOR        CGPA    Research 
#>    4.585053    4.104255    1.829491    4.808767    1.530007

no colinearity found.

Assumption Test Result : -Linearity : All predictor has p-value < 0.05. -Normality : Errors are distributed normally, transform the target variable. (Due to our data contained several outliers, i judged it based on visual) -Homoscedasticity : Error are scaterred randomly (Due to our data contained several outliers, i judged it based on visual) -Multi Colinearity : none of our predictor are correlated to each other.

The model violate Normality and Homoscedasticity assumption test because the data contain outlier but not so much, so i can pass it based on the visual.

hist(scale(admission$Chance.of.Admit))

6 Evaluation

test <- read.csv('AdmissionTest.csv')
test$prediction <- predict(admission_model, newdata = test)

6.1 MAPE Train and Test

# MAPE TRAIN
MAPE(y_pred = admission_model$fitted.values, y_true=admission$Chance.of.Admit)
#> [1] 0.07317295
# MAPE TEST
MAPE(y_pred = test$prediction, y_true=test$Chance.of.Admit)
#> [1] 0.05130283

MAPE on test set (5%) managed to achieved smaller value compared to train set (7%).

6.2 RMSE

# RMSE TRAIN
RMSE(y_pred = admission_model$fitted.values, y_true=admission$Chance.of.Admit)
#> [1] 0.06326207
# RMSE TRAIN
RMSE(y_pred = test$prediction, y_true=test$Chance.of.Admit)
#> [1] 0.04304102

Error Value are very close and even less, we can assume that the predictions are very very close to the actual value.

7 Conclusion and Recommendation

Conclussion : - Our model is a good-fit (Adj.R-Square = 0.8) - Train and Test error value are small, and its close to each other.

Recommendation : Student who are planning to continue their study should work hard to get highest GPA possible(0-10), do a research and then try to acquire a strong letter of recomendation, folllowed by TOEFL Score and GRE Score to boost your chance of admission.