Intro

Greetings

Hi !! Welcome!.
im looking dataset from external source (kaggle). I hope u’ll enjoy.

Content

This data is talking about prediction of Graduate Admissions. our goal is to know what are the important parameters for university admission. Other words we will predicting admission from important parameters.
Dataset was built with the purpose of helping students in shortlisting universities with their profiles. The predicted output gives them a fair idea about their chances for a particular university.

The dataset contains several parameters which are considered important during the application for Masters Programs. The parameters included are :
1. GRE Scores ( out of 340 )
2. TOEFL Scores ( out of 120 )
3. University Rating ( out of 5 )
4. Statement of Purpose ( out of 5 )
5. Letter of Recommendation Strength ( out of 5 )
5. Undergraduate GPA ( out of 10 )
6. Research Experience ( either 0 or 1 )
7. Chance of Admit ( ranging from 0 to 1 )

Load Library needed

Load Dataset

  Serial.No. GRE.Score TOEFL.Score University.Rating SOP LOR CGPA Research
1          1       337         118                 4 4.5 4.5 9.65        1
2          2       324         107                 4 4.0 4.5 8.87        1
3          3       316         104                 3 3.0 3.5 8.00        1
4          4       322         110                 3 3.5 2.5 8.67        1
5          5       314         103                 2 2.0 3.0 8.21        0
6          6       330         115                 5 4.5 3.0 9.34        1
7          7       321         109                 3 3.0 4.0 8.20        1
8          8       308         101                 2 3.0 4.0 7.90        0
  Chance.of.Admit
1            0.92
2            0.76
3            0.72
4            0.80
5            0.65
6            0.90
7            0.75
8            0.68
 [ reached 'max' / getOption("max.print") -- omitted 12 rows ]

We can say, our Target Variable is ‘Chance.of.Admit’

Deleting unnecesary column - delete ‘Serial.No.’ column

Data Preparation

Explanatory Data Analysis

Explore all data variable, and find out the corelation / pattern between variables

Observations: 500
Variables: 8
$ GRE.Score         <int> 337, 324, 316, 322, 314, 330, 321, 308, 302,...
$ TOEFL.Score       <int> 118, 107, 104, 110, 103, 115, 109, 101, 102,...
$ University.Rating <int> 4, 4, 3, 3, 2, 5, 3, 2, 1, 3, 3, 4, 4, 3, 3,...
$ SOP               <dbl> 4.5, 4.0, 3.0, 3.5, 2.0, 4.5, 3.0, 3.0, 2.0,...
$ LOR               <dbl> 4.5, 4.5, 3.5, 2.5, 3.0, 3.0, 4.0, 4.0, 1.5,...
$ CGPA              <dbl> 9.65, 8.87, 8.00, 8.67, 8.21, 9.34, 8.20, 7....
$ Research          <int> 1, 1, 1, 1, 0, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1,...
$ Chance.of.Admit   <dbl> 0.92, 0.76, 0.72, 0.80, 0.65, 0.90, 0.75, 0....

Seems good for all variable type.

then find out missing value :

        GRE.Score       TOEFL.Score University.Rating               SOP 
                0                 0                 0                 0 
              LOR              CGPA          Research   Chance.of.Admit 
                0                 0                 0                 0 

Cool! no any missing value in this Dataset

Interpretation :
From graph above, we could see that all variable have positif influence to target variable. GPA score are the highest influence (0.9) for the target variable followed by TOEFL.Score and GRE.Score with 0.8.

Check Data Distributions

Check distribution within these prameters : universitas rating, LOR, SOP, GPA, Research

There is 1 outlier inside ‘LOR’ column, lets check it

[1] 1
    GRE.Score TOEFL.Score University.Rating SOP LOR CGPA Research
348       299          94                 1   1   1 7.34        0
    Chance.of.Admit
348            0.42

**after we check the outlier inside LOR with value = 1 , has result 0.42 of chance of admit (meaning will not pass admission) this wont change the model. I will keep this outlier.

Check distribution within these prameters : GRE and TOEFL

NO Outliers

Modeling

Using all Parameters

Target Variable “Chance.of.Admit”


Call:
lm(formula = Chance.of.Admit ~ ., data = admission)

Coefficients:
      (Intercept)          GRE.Score        TOEFL.Score  
        -1.275725           0.001859           0.002778  
University.Rating                SOP                LOR  
         0.005941           0.001586           0.016859  
             CGPA          Research1  
         0.118385           0.024307  

Call:
lm(formula = Chance.of.Admit ~ ., data = admission)

Residuals:
      Min        1Q    Median        3Q       Max 
-0.266657 -0.023327  0.009191  0.033714  0.156818 

Coefficients:
                    Estimate Std. Error t value             Pr(>|t|)    
(Intercept)       -1.2757251  0.1042962 -12.232 < 0.0000000000000002 ***
GRE.Score          0.0018585  0.0005023   3.700             0.000240 ***
TOEFL.Score        0.0027780  0.0008724   3.184             0.001544 ** 
University.Rating  0.0059414  0.0038019   1.563             0.118753    
SOP                0.0015861  0.0045627   0.348             0.728263    
LOR                0.0168587  0.0041379   4.074            0.0000538 ***
CGPA               0.1183851  0.0097051  12.198 < 0.0000000000000002 ***
Research1          0.0243075  0.0066057   3.680             0.000259 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.05999 on 492 degrees of freedom
Multiple R-squared:  0.8219,    Adjusted R-squared:  0.8194 
F-statistic: 324.4 on 7 and 492 DF,  p-value: < 0.00000000000000022

y = -1.2757251 +0.0018585GRE.Score + 0.0027780 TOEFL.Score + 0.0059414University.Rating + 0.0015861SOP + 0.0168587LOR + 0.1183851CGPA + 0.0243075Research
Adj. R-Squared : 0.8194

Using 3 Parameters

Target Variable “Chance.of.Admit”

In this part we will use 3 predictors : CGPA, TOEFL and GRE.
Based on ggcorr result, these are the top 3 predictors which have highest corelation value (0.8-0.9).


Call:
lm(formula = Chance.of.Admit ~ CGPA + GRE.Score + TOEFL.Score, 
    data = admission)

Coefficients:
(Intercept)         CGPA    GRE.Score  TOEFL.Score  
  -1.596809     0.143574     0.002352     0.003199  

Call:
lm(formula = Chance.of.Admit ~ CGPA + GRE.Score + TOEFL.Score, 
    data = admission)

Residuals:
      Min        1Q    Median        3Q       Max 
-0.293061 -0.020722  0.008274  0.036718  0.141429 

Coefficients:
              Estimate Std. Error t value             Pr(>|t|)    
(Intercept) -1.5968093  0.0909032 -17.566 < 0.0000000000000002 ***
CGPA         0.1435741  0.0089717  16.003 < 0.0000000000000002 ***
GRE.Score    0.0023519  0.0005007   4.697           0.00000342 ***
TOEFL.Score  0.0031986  0.0008953   3.573             0.000388 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.06258 on 496 degrees of freedom
Multiple R-squared:  0.8046,    Adjusted R-squared:  0.8034 
F-statistic: 680.9 on 3 and 496 DF,  p-value: < 0.00000000000000022

y = -1.5968093 + 0.1435741 CGPA + 0.0023519 GRE.Score + 0.0031986 TOEFL.Score
with adj.R-squared : 0.8034

Stepwise Regression - Backward

Target Variable “Chance.of.Admit”

Start:  AIC=-2805.71
Chance.of.Admit ~ GRE.Score + TOEFL.Score + University.Rating + 
    SOP + LOR + CGPA + Research

                    Df Sum of Sq    RSS     AIC
- SOP                1   0.00043 1.7708 -2807.6
<none>                           1.7704 -2805.7
- University.Rating  1   0.00879 1.7792 -2805.2
- TOEFL.Score        1   0.03648 1.8069 -2797.5
- Research           1   0.04872 1.8191 -2794.1
- GRE.Score          1   0.04926 1.8196 -2794.0
- LOR                1   0.05973 1.8301 -2791.1
- CGPA               1   0.53542 2.3058 -2675.6

Step:  AIC=-2807.59
Chance.of.Admit ~ GRE.Score + TOEFL.Score + University.Rating + 
    LOR + CGPA + Research

                    Df Sum of Sq    RSS     AIC
<none>                           1.7708 -2807.6
- University.Rating  1   0.01190 1.7827 -2806.2
- TOEFL.Score        1   0.03760 1.8084 -2799.1
- Research           1   0.04893 1.8197 -2796.0
- GRE.Score          1   0.04901 1.8198 -2795.9
- LOR                1   0.06892 1.8397 -2790.5
- CGPA               1   0.55954 2.3304 -2672.3

Call:
lm(formula = Chance.of.Admit ~ GRE.Score + TOEFL.Score + University.Rating + 
    LOR + CGPA + Research, data = admission)

Residuals:
     Min       1Q   Median       3Q      Max 
-0.26617 -0.02321  0.00946  0.03345  0.15713 

Coefficients:
                    Estimate Std. Error t value             Pr(>|t|)    
(Intercept)       -1.2800138  0.1034717 -12.371 < 0.0000000000000002 ***
GRE.Score          0.0018528  0.0005016   3.694             0.000246 ***
TOEFL.Score        0.0028072  0.0008676   3.236             0.001295 ** 
University.Rating  0.0064279  0.0035318   1.820             0.069363 .  
LOR                0.0172873  0.0039464   4.380            0.0000145 ***
CGPA               0.1189994  0.0095344  12.481 < 0.0000000000000002 ***
Research1          0.0243538  0.0065985   3.691             0.000248 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 0.05993 on 493 degrees of freedom
Multiple R-squared:  0.8219,    Adjusted R-squared:  0.8197 
F-statistic: 379.1 on 6 and 493 DF,  p-value: < 0.00000000000000022

y = -1.2800138+ 0.0018528GRE.Score + 0.0028072TOEFL.Score + 0.0064279University.Rating + 0.0172873LOR + 0.1189994CGPA + 0.0243538Research
Adj.R-Squared : 0.8197

Model Prediction & Error

we have 3 models already :

  1. m1 - using all parameters
    y = -1.2757251 +0.0018585GRE.Score + 0.0027780 TOEFL.Score + 0.0059414University.Rating + 0.0015861SOP + 0.0168587LOR + 0.1183851CGPA + 0.0243075Research

  2. m2 - using 3 parameters (CGPA+GRE.Score+TOEFL.Score)
    y = -1.5968093 + 0.1435741 CGPA + 0.0023519 GRE.Score + 0.0031986 TOEFL.Score
    with adj.R-squared : 0.8034

  3. m3 - model from Stepwise Regression Backward
    y = -1.2800138+ 0.0018528GRE.Score + 0.0028072TOEFL.Score + 0.0064279University.Rating + 0.0172873LOR + 0.1189994CGPA + 0.0243538Research

adj.R-Squared

[1] "m1 adj.R-Squared: 0.819366806950006"
[1] "m2 adj.R-Squared: 0.803434925701783"
[1] "m3 adj.R-Squared: 0.819688923884365"

From comparison above, we can conclude that model m3 is the best compare to others. with adj.R-Squared 0.819 and error 0.00354162

Evaluation

Linearity Check

[1] "GRE.Score"         "TOEFL.Score"       "University.Rating"
[4] "SOP"               "LOR"               "CGPA"             
[7] "Research"          "Chance.of.Admit"  

    Pearson's product-moment correlation

data:  admission$GRE.Score and admission$Chance.of.Admit
t = 30.862, df = 498, p-value < 0.00000000000000022
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.7779406 0.8384601
sample estimates:
      cor 
0.8103506 

    Pearson's product-moment correlation

data:  admission$TOEFL.Score and admission$Chance.of.Admit
t = 28.972, df = 498, p-value < 0.00000000000000022
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.7571359 0.8227603
sample estimates:
      cor 
0.7922276 

    Pearson's product-moment correlation

data:  admission$University.Rating and admission$Chance.of.Admit
t = 21.281, df = 498, p-value < 0.00000000000000022
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.6412490 0.7334367
sample estimates:
      cor 
0.6901324 

    Pearson's product-moment correlation

data:  admission$SOP and admission$Chance.of.Admit
t = 20.932, df = 498, p-value < 0.00000000000000022
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.6345118 0.7281441
sample estimates:
      cor 
0.6841365 

    Pearson's product-moment correlation

data:  admission$LOR and admission$Chance.of.Admit
t = 18.854, df = 498, p-value < 0.00000000000000022
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.5911272 0.6937918
sample estimates:
      cor 
0.6453645 

    Pearson's product-moment correlation

data:  admission$CGPA and admission$Chance.of.Admit
t = 41.855, df = 498, p-value < 0.00000000000000022
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.8613745 0.9004286
sample estimates:
      cor 
0.8824126 

all p.value < 0.05 –> linear

Normality of Residual

from graph above, have potention for underpredictresult from a model.


    Shapiro-Wilk normality test

data:  m1$residuals
W = 0.92549, p-value = 0.000000000000004824

    Shapiro-Wilk normality test

data:  m2$residuals
W = 0.92927, p-value = 0.0000000000000128

    Shapiro-Wilk normality test

data:  m3$residuals
W = 0.92626, p-value = 0.000000000000005868

*p.value < 0.05 , non normal distributed

Homoscedascity


    studentized Breusch-Pagan test

data:  m1
BP = 30.516, df = 7, p-value = 0.00007634

    studentized Breusch-Pagan test

data:  m2
BP = 15.34, df = 3, p-value = 0.001548

    studentized Breusch-Pagan test

data:  m3
BP = 27.599, df = 6, p-value = 0.0001118

*p.value <0.05 , Hetero, errors has pattern

Little to no Multicollinearity

        GRE.Score       TOEFL.Score University.Rating               SOP 
         4.464249          3.904213          2.621036          2.835210 
              LOR              CGPA          Research 
         2.033555          4.777992          1.494008 
       CGPA   GRE.Score TOEFL.Score 
   3.752142    4.075746    3.778121 
        GRE.Score       TOEFL.Score University.Rating               LOR 
         4.459541          3.867976          2.265898          1.853078 
             CGPA          Research 
         4.619554          1.493400 

vif < 10 –> No Multicollinearity

Conclussion

we may take conclussion, from all 3 moddels (all parameters, 3 parameters, and stepwise backward), the highest value of adj.r-squared is m3 model (using stepwise backward-which paremeter been used are University Rating, Toefl, Reserch, GRE, LOR, CGPA) with MSE 0.003541621.

Model Validations shows us specisal case, wherein for these 3 models are passing linearity check and multicollinearity test meaning all models have linear relation between parameters and target variable (chance to admit). beside that, all 3 models showing vif value is less than 10 meaning there isn’t high corelation between parameters been used.

however all models at the same time not passing normality test and Homoscedascity test. Which means error has normal distribution and error has pattern. for this situation we have some option. first we may continue using the model with some notes, second option is we may do some transformation from non-normal dependent variable into a normal shape (example using boxcox transformation). Just in case we want to keep these model and using it, some notes need to be remembered, these models have possibilities produce underpredict result (refer to plot density above in Evaluation section) and we could improve our model by adding some new data in future.