Business Question Identification

Identifying the chance of applicant being admitted

Data Preparation

Importing admission file

admission <- read.csv("F:/Documents/DATA SCIENCE/II. Machine Learning/Regression Model/LBB regression model/Admission_Predict.csv")
head(admission)
##   Serial.No. GRE.Score TOEFL.Score University.Rating SOP LOR CGPA Research
## 1          1       337         118                 4 4.5 4.5 9.65        1
## 2          2       324         107                 4 4.0 4.5 8.87        1
## 3          3       316         104                 3 3.0 3.5 8.00        1
## 4          4       322         110                 3 3.5 2.5 8.67        1
## 5          5       314         103                 2 2.0 3.0 8.21        0
## 6          6       330         115                 5 4.5 3.0 9.34        1
##   Chance.of.Admit
## 1            0.92
## 2            0.76
## 3            0.72
## 4            0.80
## 5            0.65
## 6            0.90
str(admission)
## 'data.frame':    400 obs. of  9 variables:
##  $ Serial.No.       : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ GRE.Score        : int  337 324 316 322 314 330 321 308 302 323 ...
##  $ TOEFL.Score      : int  118 107 104 110 103 115 109 101 102 108 ...
##  $ University.Rating: int  4 4 3 3 2 5 3 2 1 3 ...
##  $ SOP              : num  4.5 4 3 3.5 2 4.5 3 3 2 3.5 ...
##  $ LOR              : num  4.5 4.5 3.5 2.5 3 3 4 4 1.5 3 ...
##  $ CGPA             : num  9.65 8.87 8 8.67 8.21 9.34 8.2 7.9 8 8.6 ...
##  $ Research         : int  1 1 1 1 0 1 1 0 0 0 ...
##  $ Chance.of.Admit  : num  0.92 0.76 0.72 0.8 0.65 0.9 0.75 0.68 0.5 0.45 ...

y / target variable: Chance.of.Admit

Exploration to Dependent Variable

boxplot(admission$GRE.Score)

boxplot(admission$TOEFL.Score)

There is no outlier

Making Model

Linearity Test

library(GGally)
## Loading required package: ggplot2
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
ggcorr(admission,label = T, hjust = 0.9,  label_size = 3, layout.exp = 3)

All variable have decent correlation with the chance of being admitted

Model generation

mdl_all <- lm(formula = Chance.of.Admit ~ ., data = admission)
mdl_1 <- lm(formula = Chance.of.Admit ~ 1, data = admission)
step(object = mdl_all,direction = "backward", trace = 0)
## 
## Call:
## lm(formula = Chance.of.Admit ~ Serial.No. + GRE.Score + TOEFL.Score + 
##     University.Rating + LOR + CGPA + Research, data = admission)
## 
## Coefficients:
##       (Intercept)         Serial.No.          GRE.Score        TOEFL.Score  
##        -1.2937891          0.0001593          0.0017982          0.0036843  
## University.Rating                LOR               CGPA           Research  
##         0.0088095          0.0215765          0.1053335          0.0243884
step(object = mdl_1,scope = list(lower = mdl_1, upper =mdl_all),direction = "forward", trace = 0)
## 
## Call:
## lm(formula = Chance.of.Admit ~ CGPA + GRE.Score + LOR + Serial.No. + 
##     TOEFL.Score + Research + University.Rating, data = admission)
## 
## Coefficients:
##       (Intercept)               CGPA          GRE.Score                LOR  
##        -1.2937891          0.1053335          0.0017982          0.0215765  
##        Serial.No.        TOEFL.Score           Research  University.Rating  
##         0.0001593          0.0036843          0.0243884          0.0088095

Using regsubset

library(leaps)
reg <- regsubsets(Chance.of.Admit ~ ., data = admission, nbest = 2)
plot(reg, scale="adjr", main="All possible regression: ranked by Adjusted R-squared")

Both feature selection methods are recommending the same variables, we will assign them into mdl_select object

mdl_select <- lm(formula = Chance.of.Admit ~ CGPA + GRE.Score + LOR + TOEFL.Score + Research + University.Rating, data = admission)
#Serial No. is omitted
summary(mdl_select)
## 
## Call:
## lm(formula = Chance.of.Admit ~ CGPA + GRE.Score + LOR + TOEFL.Score + 
##     Research + University.Rating, data = admission)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -0.26391 -0.02197  0.01008  0.03621  0.15851 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       -1.2543094  0.1243300 -10.089  < 2e-16 ***
## CGPA               0.1179066  0.0120852   9.756  < 2e-16 ***
## GRE.Score          0.0017646  0.0005957   2.962  0.00324 ** 
## LOR                0.0210327  0.0050724   4.147 4.14e-05 ***
## TOEFL.Score        0.0028389  0.0010801   2.628  0.00892 ** 
## Research           0.0241542  0.0079287   3.046  0.00247 ** 
## University.Rating  0.0048540  0.0045404   1.069  0.28570    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.06373 on 393 degrees of freedom
## Multiple R-squared:  0.8033, Adjusted R-squared:  0.8003 
## F-statistic: 267.5 on 6 and 393 DF,  p-value: < 2.2e-16

Key information from summary: - Error/ Residual is ranging from -0.26391 to 0.15851 - University Rating has the weakest significancy to target variable - Rsquare of the generated model: 80.03%

Confirming the Assumption

Normality

shapiro.test(x = mdl_select$residuals)
## 
##  Shapiro-Wilk normality test
## 
## data:  mdl_select$residuals
## W = 0.91977, p-value = 8.901e-14

W > pvalue = error distribution is accepted

Homoscedasticity

library(lmtest)
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
bptest(mdl_select)
## 
##  studentized Breusch-Pagan test
## 
## data:  mdl_select
## BP = 23.863, df = 6, p-value = 0.0005534
plot(mdl_select$fitted.values, mdl_select$residuals)
abline(h = 0, col = "red")

Residual distribution aren’t making any distinct pattern

Multicolinearity

library(car)
## Loading required package: carData
vif(mdl_select)
##              CGPA         GRE.Score               LOR       TOEFL.Score 
##          5.102055          4.588481          2.040413          4.222318 
##          Research University.Rating 
##          1.533822          2.649239

All variables have score < 10, hence there is no any multicolinearity among all dependent variables

Testing MSE between two different datasets

library(MLmetrics)
## 
## Attaching package: 'MLmetrics'
## The following object is masked from 'package:base':
## 
##     Recall
pred_adm <- predict(object = mdl_select,newdata = admission)
MSE(y_pred = pred_adm, y_true = admission$Chance.of.Admit)
## [1] 0.003990485
adm_2 <- read.csv("F:/Documents/DATA SCIENCE/II. Machine Learning/Regression Model/LBB regression model/Admission_Predict_Ver1.1.csv")
head(adm_2)
##   Serial.No. GRE.Score TOEFL.Score University.Rating SOP LOR CGPA Research
## 1          1       337         118                 4 4.5 4.5 9.65        1
## 2          2       324         107                 4 4.0 4.5 8.87        1
## 3          3       316         104                 3 3.0 3.5 8.00        1
## 4          4       322         110                 3 3.5 2.5 8.67        1
## 5          5       314         103                 2 2.0 3.0 8.21        0
## 6          6       330         115                 5 4.5 3.0 9.34        1
##   Chance.of.Admit
## 1            0.92
## 2            0.76
## 3            0.72
## 4            0.80
## 5            0.65
## 6            0.90
pred_adm2 <- predict(object = mdl_select,newdata = adm_2)
MSE(y_pred = pred_adm, y_true = admission$Chance.of.Admit)
## [1] 0.003990485

Conclusion: Both dataset shows the same MSE score