Intro

Greetings

Hi !! Welcome!.
im looking dataset from external source (kaggle). I hope u’ll enjoy.

Content

This data is talking about prediction of Graduate Admissions. our goal is to know what are the important parameters for university admission. Other words we will predicting admission from important parameters.
Dataset was built with the purpose of helping students in shortlisting universities with their profiles. The predicted output gives them a fair idea about their chances for a particular university.

The dataset contains several parameters which are considered important during the application for Masters Programs. The parameters included are :
1. GRE Scores ( out of 340 )
2. TOEFL Scores ( out of 120 )
3. University Rating ( out of 5 )
4. Statement of Purpose ( out of 5 )
5. Letter of Recommendation Strength ( out of 5 )
5. Undergraduate GPA ( out of 10 )
6. Research Experience ( either 0 or 1 )
7. Chance of Admit ( ranging from 0 to 1 )

Load Library needed

library(dplyr)
library(caret)
library(plotly)
library(GGally)
library(scales)
library(lmtest)
library(MLmetrics)
library(car)

options(scipen = 9999)

Load Dataset

admission <- read.csv("Input Data/Admission_Predict_Ver1.1.csv")
head(admission,20)

  Serial.No. GRE.Score TOEFL.Score University.Rating SOP LOR CGPA Research
1          1       337         118                 4 4.5 4.5 9.65        1
2          2       324         107                 4 4.0 4.5 8.87        1
3          3       316         104                 3 3.0 3.5 8.00        1
4          4       322         110                 3 3.5 2.5 8.67        1
5          5       314         103                 2 2.0 3.0 8.21        0
6          6       330         115                 5 4.5 3.0 9.34        1
7          7       321         109                 3 3.0 4.0 8.20        1
8          8       308         101                 2 3.0 4.0 7.90        0
  Chance.of.Admit
1            0.92
2            0.76
3            0.72
4            0.80
5            0.65
6            0.90
7            0.75
8            0.68
 [ reached 'max' / getOption("max.print") -- omitted 12 rows ]

We can say, our Target Variable is ‘Chance.of.Admit’

Deleting unnecesary column - delete ‘Serial.No.’ column

admission <- admission %>% 
  select(-1)

Data Preparation

Explanatory Data Analysis

Explore all data variable, and find out the corelation / pattern between variables

glimpse(admission)

Observations: 500
Variables: 8
$ GRE.Score         <int> 337, 324, 316, 322, 314, 330, 321, 308, 302,...
$ TOEFL.Score       <int> 118, 107, 104, 110, 103, 115, 109, 101, 102,...
$ University.Rating <int> 4, 4, 3, 3, 2, 5, 3, 2, 1, 3, 3, 4, 4, 3, 3,...
$ SOP               <dbl> 4.5, 4.0, 3.0, 3.5, 2.0, 4.5, 3.0, 3.0, 2.0,...
$ LOR               <dbl> 4.5, 4.5, 3.5, 2.5, 3.0, 3.0, 4.0, 4.0, 1.5,...
$ CGPA              <dbl> 9.65, 8.87, 8.00, 8.67, 8.21, 9.34, 8.20, 7....
$ Research          <int> 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1,...
$ Chance.of.Admit   <dbl> 0.92, 0.76, 0.72, 0.80, 0.65, 0.90, 0.75, 0....

admission <- admission %>% 
  mutate(Research = as.factor(Research))

Seems good for all variable type.

then find out missing value :

colSums(is.na(admission))

        GRE.Score       TOEFL.Score University.Rating               SOP 
                0                 0                 0                 0 
              LOR              CGPA          Research   Chance.of.Admit 
                0                 0                 0                 0

Cool! no any missing value in this Dataset

ggcorr(admission, label= TRUE, hjust = 0.8, layout.exp = 2)

Interpretation :
From graph above, we could see that all variable have positif influence to target variable. GPA score are the highest influence (0.9) for the target variable followed by TOEFL.Score and GRE.Score with 0.8.

Check Data Distributions

Check distribution within these prameters : universitas rating, LOR, SOP, GPA, Research

boxplot(admission[,3:7])

There is 1 outlier inside ‘LOR’ column, lets check it

boxplot(admission$LOR,plot = F)$out

[1] 1

admission[admission$LOR==1,]

    GRE.Score TOEFL.Score University.Rating SOP LOR CGPA Research
348       299          94                 1   1   1 7.34        0
    Chance.of.Admit
348            0.42

**after we check the outlier inside LOR with value = 1 , has result 0.42 of chance of admit (meaning will not pass admission) this wont change the model. I will keep this outlier.

Check distribution within these prameters : GRE and TOEFL

boxplot(admission[,1:2])

NO Outliers

Modeling

Using all Parameters

Target Variable “Chance.of.Admit”

m1 <- lm(Chance.of.Admit~., admission)
m1


Call:
lm(formula = Chance.of.Admit ~ ., data = admission)

Coefficients:
      (Intercept)          GRE.Score        TOEFL.Score  
        -1.275725           0.001859           0.002778  
University.Rating                SOP                LOR  
         0.005941           0.001586           0.016859  
             CGPA          Research1  
         0.118385           0.024307

summary(m1)


Call:
lm(formula = Chance.of.Admit ~ ., data = admission)

Residuals:
      Min        1Q    Median        3Q       Max 
-0.266657 -0.023327  0.009191  0.033714  0.156818 

Coefficients:
                    Estimate Std. Error t value             Pr(>|t|)    
(Intercept)       -1.2757251  0.1042962 -12.232 < 0.0000000000000002 ***
GRE.Score          0.0018585  0.0005023   3.700             0.000240 ***
TOEFL.Score        0.0027780  0.0008724   3.184             0.001544 ** 
University.Rating  0.0059414  0.0038019   1.563             0.118753    
SOP                0.0015861  0.0045627   0.348             0.728263    
LOR                0.0168587  0.0041379   4.074            0.0000538 ***
CGPA               0.1183851  0.0097051  12.198 < 0.0000000000000002 ***
Research1          0.0243075  0.0066057   3.680             0.000259 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.05999 on 492 degrees of freedom
Multiple R-squared:  0.8219,    Adjusted R-squared:  0.8194 
F-statistic: 324.4 on 7 and 492 DF,  p-value: < 0.00000000000000022

y = -1.2757251 +0.0018585GRE.Score + 0.0027780 TOEFL.Score + 0.0059414University.Rating + 0.0015861SOP + 0.0168587LOR + 0.1183851CGPA + 0.0243075Research
Adj. R-Squared : 0.8194

Using 3 Parameters

Target Variable “Chance.of.Admit”

In this part we will use 3 predictors : CGPA, TOEFL and GRE.
Based on ggcorr result, these are the top 3 predictors which have highest corelation value (0.8-0.9).

m2 <- lm(Chance.of.Admit~ CGPA+GRE.Score+TOEFL.Score, admission)
m2


Call:
lm(formula = Chance.of.Admit ~ CGPA + GRE.Score + TOEFL.Score, 
    data = admission)

Coefficients:
(Intercept)         CGPA    GRE.Score  TOEFL.Score  
  -1.596809     0.143574     0.002352     0.003199

summary(m2)


Call:
lm(formula = Chance.of.Admit ~ CGPA + GRE.Score + TOEFL.Score, 
    data = admission)

Residuals:
      Min        1Q    Median        3Q       Max 
-0.293061 -0.020722  0.008274  0.036718  0.141429 

Coefficients:
              Estimate Std. Error t value             Pr(>|t|)    
(Intercept) -1.5968093  0.0909032 -17.566 < 0.0000000000000002 ***
CGPA         0.1435741  0.0089717  16.003 < 0.0000000000000002 ***
GRE.Score    0.0023519  0.0005007   4.697           0.00000342 ***
TOEFL.Score  0.0031986  0.0008953   3.573             0.000388 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.06258 on 496 degrees of freedom
Multiple R-squared:  0.8046,    Adjusted R-squared:  0.8034 
F-statistic: 680.9 on 3 and 496 DF,  p-value: < 0.00000000000000022

y = -1.5968093 + 0.1435741 CGPA + 0.0023519 GRE.Score + 0.0031986 TOEFL.Score
with adj.R-squared : 0.8034

Stepwise Regression - Backward

Target Variable “Chance.of.Admit”

m3 <- step(m1, direction = "backward")

Start:  AIC=-2805.71
Chance.of.Admit ~ GRE.Score + TOEFL.Score + University.Rating + 
    SOP + LOR + CGPA + Research

                    Df Sum of Sq    RSS     AIC
- SOP                1   0.00043 1.7708 -2807.6
<none>                           1.7704 -2805.7
- University.Rating  1   0.00879 1.7792 -2805.2
- TOEFL.Score        1   0.03648 1.8069 -2797.5
- Research           1   0.04872 1.8191 -2794.1
- GRE.Score          1   0.04926 1.8196 -2794.0
- LOR                1   0.05973 1.8301 -2791.1
- CGPA               1   0.53542 2.3058 -2675.6

Step:  AIC=-2807.59
Chance.of.Admit ~ GRE.Score + TOEFL.Score + University.Rating + 
    LOR + CGPA + Research

                    Df Sum of Sq    RSS     AIC
<none>                           1.7708 -2807.6
- University.Rating  1   0.01190 1.7827 -2806.2
- TOEFL.Score        1   0.03760 1.8084 -2799.1
- Research           1   0.04893 1.8197 -2796.0
- GRE.Score          1   0.04901 1.8198 -2795.9
- LOR                1   0.06892 1.8397 -2790.5
- CGPA               1   0.55954 2.3304 -2672.3

summary(m3)


Call:
lm(formula = Chance.of.Admit ~ GRE.Score + TOEFL.Score + University.Rating + 
    LOR + CGPA + Research, data = admission)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.26617 -0.02321  0.00946  0.03345  0.15713 

Coefficients:
                    Estimate Std. Error t value             Pr(>|t|)    
(Intercept)       -1.2800138  0.1034717 -12.371 < 0.0000000000000002 ***
GRE.Score          0.0018528  0.0005016   3.694             0.000246 ***
TOEFL.Score        0.0028072  0.0008676   3.236             0.001295 ** 
University.Rating  0.0064279  0.0035318   1.820             0.069363 .  
LOR                0.0172873  0.0039464   4.380            0.0000145 ***
CGPA               0.1189994  0.0095344  12.481 < 0.0000000000000002 ***
Research1          0.0243538  0.0065985   3.691             0.000248 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.05993 on 493 degrees of freedom
Multiple R-squared:  0.8219,    Adjusted R-squared:  0.8197 
F-statistic: 379.1 on 6 and 493 DF,  p-value: < 0.00000000000000022

y = -1.2800138+ 0.0018528GRE.Score + 0.0028072TOEFL.Score + 0.0064279University.Rating + 0.0172873LOR + 0.1189994CGPA + 0.0243538Research
Adj.R-Squared : 0.8197

Model Prediction & Error

we have 3 models already :

m1 - using all parameters
y = -1.2757251 +0.0018585GRE.Score + 0.0027780 TOEFL.Score + 0.0059414University.Rating + 0.0015861SOP + 0.0168587LOR + 0.1183851CGPA + 0.0243075Research
m2 - using 3 parameters (CGPA+GRE.Score+TOEFL.Score)
y = -1.5968093 + 0.1435741 CGPA + 0.0023519 GRE.Score + 0.0031986 TOEFL.Score
with adj.R-squared : 0.8034
m3 - model from Stepwise Regression Backward
y = -1.2800138+ 0.0018528GRE.Score + 0.0028072TOEFL.Score + 0.0064279University.Rating + 0.0172873LOR + 0.1189994CGPA + 0.0243538Research

Prediction

predict_m1 <- predict(m1, newdata = data.frame(GRE.Score = admission$GRE.Score, TOEFL.Score = admission$TOEFL.Score, University.Rating = admission$University.Rating, SOP = admission$SOP, LOR = admission$LOR , CGPA = admission$CGPA, Research = admission$Research))

predict_m2 <- predict(m2, newdata = data.frame(GRE.Score = admission$GRE.Score, TOEFL.Score = admission$TOEFL.Score, CGPA = admission$CGPA))

predict_m3 <- predict(m3, newdata = data.frame(GRE.Score = admission$GRE.Score, TOEFL.Score = admission$TOEFL.Score, University.Rating = admission$University.Rating, LOR = admission$LOR , CGPA = admission$CGPA, Research = admission$Research))

MSE

MSE(m1$fitted.values, admission$Chance.of.Admit)

[1] 0.003540751

MSE(m2$fitted.values, admission$Chance.of.Admit)

[1] 0.003884371

MSE(m3$fitted.values, admission$Chance.of.Admit)

[1] 0.003541621

adj.R-Squared

paste("m1 adj.R-Squared:", summary(m1)$adj.r.squared)

[1] "m1 adj.R-Squared: 0.819366806950006"

paste("m2 adj.R-Squared:", summary(m2)$adj.r.squared)

[1] "m2 adj.R-Squared: 0.803434925701783"

paste("m3 adj.R-Squared:", summary(m3)$adj.r.squared)

[1] "m3 adj.R-Squared: 0.819688923884365"

From comparison above, we can conclude that model m3 is the best compare to others. with adj.R-Squared 0.819 and error 0.00354162

Evaluation

Linearity Check

names(admission)

[1] "GRE.Score"         "TOEFL.Score"       "University.Rating"
[4] "SOP"               "LOR"               "CGPA"             
[7] "Research"          "Chance.of.Admit"

cor.test(admission$GRE.Score, admission$Chance.of.Admit)


    Pearson's product-moment correlation

data:  admission$GRE.Score and admission$Chance.of.Admit
t = 30.862, df = 498, p-value < 0.00000000000000022
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.7779406 0.8384601
sample estimates:
      cor 
0.8103506

cor.test(admission$TOEFL.Score, admission$Chance.of.Admit)


    Pearson's product-moment correlation

data:  admission$TOEFL.Score and admission$Chance.of.Admit
t = 28.972, df = 498, p-value < 0.00000000000000022
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.7571359 0.8227603
sample estimates:
      cor 
0.7922276

cor.test(admission$University.Rating, admission$Chance.of.Admit)


    Pearson's product-moment correlation

data:  admission$University.Rating and admission$Chance.of.Admit
t = 21.281, df = 498, p-value < 0.00000000000000022
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.6412490 0.7334367
sample estimates:
      cor 
0.6901324

cor.test(admission$SOP, admission$Chance.of.Admit)


    Pearson's product-moment correlation

data:  admission$SOP and admission$Chance.of.Admit
t = 20.932, df = 498, p-value < 0.00000000000000022
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.6345118 0.7281441
sample estimates:
      cor 
0.6841365

cor.test(admission$LOR, admission$Chance.of.Admit)


    Pearson's product-moment correlation

data:  admission$LOR and admission$Chance.of.Admit
t = 18.854, df = 498, p-value < 0.00000000000000022
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.5911272 0.6937918
sample estimates:
      cor 
0.6453645

cor.test(admission$CGPA, admission$Chance.of.Admit)


    Pearson's product-moment correlation

data:  admission$CGPA and admission$Chance.of.Admit
t = 41.855, df = 498, p-value < 0.00000000000000022
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.8613745 0.9004286
sample estimates:
      cor 
0.8824126

all p.value < 0.05 –> linear

Normality of Residual

plot(density(m1$residuals))

plot(density(m2$residuals))

plot(density(m3$residuals))

from graph above, have potention for underpredictresult from a model.

shapiro.test(m1$residuals)


    Shapiro-Wilk normality test

data:  m1$residuals
W = 0.92549, p-value = 0.000000000000004824

shapiro.test(m2$residuals)


    Shapiro-Wilk normality test

data:  m2$residuals
W = 0.92927, p-value = 0.0000000000000128

shapiro.test(m3$residuals)


    Shapiro-Wilk normality test

data:  m3$residuals
W = 0.92626, p-value = 0.000000000000005868

*p.value < 0.05 , non normal distributed

Homoscedascity

plot(m1$residuals, admission$Chance.of.Admit)

plot(m2$residuals, admission$Chance.of.Admit)

plot(m3$residuals, admission$Chance.of.Admit)

plot(m1)

plot(m2)

plot(m3)

bptest(m1)


    studentized Breusch-Pagan test

data:  m1
BP = 30.516, df = 7, p-value = 0.00007634

bptest(m2)


    studentized Breusch-Pagan test

data:  m2
BP = 15.34, df = 3, p-value = 0.001548

bptest(m3)


    studentized Breusch-Pagan test

data:  m3
BP = 27.599, df = 6, p-value = 0.0001118

*p.value <0.05 , Hetero, errors has pattern

Little to no Multicollinearity

vif(m1)

        GRE.Score       TOEFL.Score University.Rating               SOP 
         4.464249          3.904213          2.621036          2.835210 
              LOR              CGPA          Research 
         2.033555          4.777992          1.494008

vif(m2)

       CGPA   GRE.Score TOEFL.Score 
   3.752142    4.075746    3.778121

vif(m3)

        GRE.Score       TOEFL.Score University.Rating               LOR 
         4.459541          3.867976          2.265898          1.853078 
             CGPA          Research 
         4.619554          1.493400

vif < 10 –> No Multicollinearity

Conclussion

we may take conclussion, from all 3 moddels (all parameters, 3 parameters, and stepwise backward), the highest value of adj.r-squared is m3 model (using stepwise backward-which paremeter been used are University Rating, Toefl, Reserch, GRE, LOR, CGPA) with MSE 0.003541621.

Model Validations shows us specisal case, wherein for these 3 models are passing linearity check and multicollinearity test meaning all models have linear relation between parameters and target variable (chance to admit). beside that, all 3 models showing vif value is less than 10 meaning there isn’t high corelation between parameters been used.

however all models at the same time not passing normality test and Homoscedascity test. Which means error has normal distribution and error has pattern. for this situation we have some option. first we may continue using the model with some notes, second option is we may do some transformation from non-normal dependent variable into a normal shape (example using boxcox transformation). Just in case we want to keep these model and using it, some notes need to be remembered, these models have possibilities produce underpredict result (refer to plot density above in Evaluation section) and we could improve our model by adding some new data in future.

Graduate Admission

Widya Kania Rahayu

November 18, 2019

Intro

Greetings

Content

Load Dataset

Data Preparation

Explanatory Data Analysis

Check Data Distributions

Modeling

Using all Parameters

Using 3 Parameters

Stepwise Regression - Backward

Model Prediction & Error

Prediction

MSE

adj.R-Squared

Evaluation

Linearity Check

Normality of Residual

Homoscedascity

Little to no Multicollinearity

Conclussion