Identifying the chance of applicant being admitted
Importing admission file
admission <- read.csv("F:/Documents/DATA SCIENCE/II. Machine Learning/Regression Model/LBB regression model/Admission_Predict.csv")
head(admission)
## Serial.No. GRE.Score TOEFL.Score University.Rating SOP LOR CGPA Research
## 1 1 337 118 4 4.5 4.5 9.65 1
## 2 2 324 107 4 4.0 4.5 8.87 1
## 3 3 316 104 3 3.0 3.5 8.00 1
## 4 4 322 110 3 3.5 2.5 8.67 1
## 5 5 314 103 2 2.0 3.0 8.21 0
## 6 6 330 115 5 4.5 3.0 9.34 1
## Chance.of.Admit
## 1 0.92
## 2 0.76
## 3 0.72
## 4 0.80
## 5 0.65
## 6 0.90
str(admission)
## 'data.frame': 400 obs. of 9 variables:
## $ Serial.No. : int 1 2 3 4 5 6 7 8 9 10 ...
## $ GRE.Score : int 337 324 316 322 314 330 321 308 302 323 ...
## $ TOEFL.Score : int 118 107 104 110 103 115 109 101 102 108 ...
## $ University.Rating: int 4 4 3 3 2 5 3 2 1 3 ...
## $ SOP : num 4.5 4 3 3.5 2 4.5 3 3 2 3.5 ...
## $ LOR : num 4.5 4.5 3.5 2.5 3 3 4 4 1.5 3 ...
## $ CGPA : num 9.65 8.87 8 8.67 8.21 9.34 8.2 7.9 8 8.6 ...
## $ Research : int 1 1 1 1 0 1 1 0 0 0 ...
## $ Chance.of.Admit : num 0.92 0.76 0.72 0.8 0.65 0.9 0.75 0.68 0.5 0.45 ...
y / target variable: Chance.of.Admit
boxplot(admission$GRE.Score)
boxplot(admission$TOEFL.Score)
There is no outlier
library(GGally)
## Loading required package: ggplot2
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
ggcorr(admission,label = T, hjust = 0.9, label_size = 3, layout.exp = 3)
All variable have decent correlation with the chance of being admitted
mdl_all <- lm(formula = Chance.of.Admit ~ ., data = admission)
mdl_1 <- lm(formula = Chance.of.Admit ~ 1, data = admission)
step(object = mdl_all,direction = "backward", trace = 0)
##
## Call:
## lm(formula = Chance.of.Admit ~ Serial.No. + GRE.Score + TOEFL.Score +
## University.Rating + LOR + CGPA + Research, data = admission)
##
## Coefficients:
## (Intercept) Serial.No. GRE.Score TOEFL.Score
## -1.2937891 0.0001593 0.0017982 0.0036843
## University.Rating LOR CGPA Research
## 0.0088095 0.0215765 0.1053335 0.0243884
step(object = mdl_1,scope = list(lower = mdl_1, upper =mdl_all),direction = "forward", trace = 0)
##
## Call:
## lm(formula = Chance.of.Admit ~ CGPA + GRE.Score + LOR + Serial.No. +
## TOEFL.Score + Research + University.Rating, data = admission)
##
## Coefficients:
## (Intercept) CGPA GRE.Score LOR
## -1.2937891 0.1053335 0.0017982 0.0215765
## Serial.No. TOEFL.Score Research University.Rating
## 0.0001593 0.0036843 0.0243884 0.0088095
Using regsubset
library(leaps)
reg <- regsubsets(Chance.of.Admit ~ ., data = admission, nbest = 2)
plot(reg, scale="adjr", main="All possible regression: ranked by Adjusted R-squared")
Both feature selection methods are recommending the same variables, we will assign them into mdl_select object
mdl_select <- lm(formula = Chance.of.Admit ~ CGPA + GRE.Score + LOR + TOEFL.Score + Research + University.Rating, data = admission)
#Serial No. is omitted
summary(mdl_select)
##
## Call:
## lm(formula = Chance.of.Admit ~ CGPA + GRE.Score + LOR + TOEFL.Score +
## Research + University.Rating, data = admission)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.26391 -0.02197 0.01008 0.03621 0.15851
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.2543094 0.1243300 -10.089 < 2e-16 ***
## CGPA 0.1179066 0.0120852 9.756 < 2e-16 ***
## GRE.Score 0.0017646 0.0005957 2.962 0.00324 **
## LOR 0.0210327 0.0050724 4.147 4.14e-05 ***
## TOEFL.Score 0.0028389 0.0010801 2.628 0.00892 **
## Research 0.0241542 0.0079287 3.046 0.00247 **
## University.Rating 0.0048540 0.0045404 1.069 0.28570
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.06373 on 393 degrees of freedom
## Multiple R-squared: 0.8033, Adjusted R-squared: 0.8003
## F-statistic: 267.5 on 6 and 393 DF, p-value: < 2.2e-16
Key information from summary: - Error/ Residual is ranging from -0.26391 to 0.15851 - University Rating has the weakest significancy to target variable - Rsquare of the generated model: 80.03%
Normality
shapiro.test(x = mdl_select$residuals)
##
## Shapiro-Wilk normality test
##
## data: mdl_select$residuals
## W = 0.91977, p-value = 8.901e-14
W > pvalue = error distribution is accepted
Homoscedasticity
library(lmtest)
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
bptest(mdl_select)
##
## studentized Breusch-Pagan test
##
## data: mdl_select
## BP = 23.863, df = 6, p-value = 0.0005534
plot(mdl_select$fitted.values, mdl_select$residuals)
abline(h = 0, col = "red")
Residual distribution aren’t making any distinct pattern
Multicolinearity
library(car)
## Loading required package: carData
vif(mdl_select)
## CGPA GRE.Score LOR TOEFL.Score
## 5.102055 4.588481 2.040413 4.222318
## Research University.Rating
## 1.533822 2.649239
All variables have score < 10, hence there is no any multicolinearity among all dependent variables
library(MLmetrics)
##
## Attaching package: 'MLmetrics'
## The following object is masked from 'package:base':
##
## Recall
pred_adm <- predict(object = mdl_select,newdata = admission)
MSE(y_pred = pred_adm, y_true = admission$Chance.of.Admit)
## [1] 0.003990485
adm_2 <- read.csv("F:/Documents/DATA SCIENCE/II. Machine Learning/Regression Model/LBB regression model/Admission_Predict_Ver1.1.csv")
head(adm_2)
## Serial.No. GRE.Score TOEFL.Score University.Rating SOP LOR CGPA Research
## 1 1 337 118 4 4.5 4.5 9.65 1
## 2 2 324 107 4 4.0 4.5 8.87 1
## 3 3 316 104 3 3.0 3.5 8.00 1
## 4 4 322 110 3 3.5 2.5 8.67 1
## 5 5 314 103 2 2.0 3.0 8.21 0
## 6 6 330 115 5 4.5 3.0 9.34 1
## Chance.of.Admit
## 1 0.92
## 2 0.76
## 3 0.72
## 4 0.80
## 5 0.65
## 6 0.90
pred_adm2 <- predict(object = mdl_select,newdata = adm_2)
MSE(y_pred = pred_adm, y_true = admission$Chance.of.Admit)
## [1] 0.003990485
Conclusion: Both dataset shows the same MSE score