To predict the chance of admission
Load the required package
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(GGally)
## Loading required package: ggplot2
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
##
## Attaching package: 'GGally'
## The following object is masked from 'package:dplyr':
##
## nasa
library(MLmetrics)
##
## Attaching package: 'MLmetrics'
## The following object is masked from 'package:base':
##
## Recall
library(lmtest)
## Loading required package: zoo
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
library(car)
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
Let’s start with importing data from dataset into our workspace.
admission <- read.csv("data_input/Admission_Predict_Ver1.1.csv")
head(admission)
Check the structure of data
str(admission)
## 'data.frame': 500 obs. of 9 variables:
## $ Serial.No. : int 1 2 3 4 5 6 7 8 9 10 ...
## $ GRE.Score : int 337 324 316 322 314 330 321 308 302 323 ...
## $ TOEFL.Score : int 118 107 104 110 103 115 109 101 102 108 ...
## $ University.Rating: int 4 4 3 3 2 5 3 2 1 3 ...
## $ SOP : num 4.5 4 3 3.5 2 4.5 3 3 2 3.5 ...
## $ LOR : num 4.5 4.5 3.5 2.5 3 3 4 4 1.5 3 ...
## $ CGPA : num 9.65 8.87 8 8.67 8.21 9.34 8.2 7.9 8 8.6 ...
## $ Research : int 1 1 1 1 0 1 1 0 0 0 ...
## $ Chance.of.Admit : num 0.92 0.76 0.72 0.8 0.65 0.9 0.75 0.68 0.5 0.45 ...
Range of the variables:
- GRE Scores ( out of 340 )
- TOEFL Scores ( out of 120 )
- University Rating ( out of 5 )
- SOP / Statement of Purpose (out of 5)
- LOR / Letter of Recommendation Strength ( out of 5 )
- Undergraduate GPA ( out of 10 )
- Research Experience ( either 0 or 1 )
- Chance of Admit ( ranging from 0 to 1 )
summary(admission)
## Serial.No. GRE.Score TOEFL.Score University.Rating
## Min. : 1.0 Min. :290.0 Min. : 92.0 Min. :1.000
## 1st Qu.:125.8 1st Qu.:308.0 1st Qu.:103.0 1st Qu.:2.000
## Median :250.5 Median :317.0 Median :107.0 Median :3.000
## Mean :250.5 Mean :316.5 Mean :107.2 Mean :3.114
## 3rd Qu.:375.2 3rd Qu.:325.0 3rd Qu.:112.0 3rd Qu.:4.000
## Max. :500.0 Max. :340.0 Max. :120.0 Max. :5.000
## SOP LOR CGPA Research
## Min. :1.000 Min. :1.000 Min. :6.800 Min. :0.00
## 1st Qu.:2.500 1st Qu.:3.000 1st Qu.:8.127 1st Qu.:0.00
## Median :3.500 Median :3.500 Median :8.560 Median :1.00
## Mean :3.374 Mean :3.484 Mean :8.576 Mean :0.56
## 3rd Qu.:4.000 3rd Qu.:4.000 3rd Qu.:9.040 3rd Qu.:1.00
## Max. :5.000 Max. :5.000 Max. :9.920 Max. :1.00
## Chance.of.Admit
## Min. :0.3400
## 1st Qu.:0.6300
## Median :0.7200
## Mean :0.7217
## 3rd Qu.:0.8200
## Max. :0.9700
From median and mean score of the variables (except Research variable), we can conclude that the data we have are distributed normally. Let’s format and check Research variable data and take out Serial.No.
admission <- admission %>%
mutate(Research = as.factor(Research)) %>%
select(-Serial.No.)
table(admission$Research)
##
## 0 1
## 220 280
The Research variable has a balance proportion.
Next step, check whether there is any missing value
anyNA(admission)
## [1] FALSE
FALSE means there is no missing value in the data we have.
In this step, let’s check the correlation between target and the predictors
ggcorr(admission, label = TRUE, label_size = 3.5, hjust = 0.75, layout.exp = 3.5)
## Warning in ggcorr(admission, label = TRUE, label_size = 3.5, hjust = 0.75, :
## data in column(s) 'Research' are not numeric and were ignored
From the result of above, we can conclude that all of the variables have strong positive correlation with chance of admission
Before we start modeling, we need to split the data into train data and test data. The train data will be used to generate regression linear model
set.seed(88)
index <- sample(nrow(admission), nrow(admission)*0.8)
admission.train <- admission[index,]
admission.test <- admission[-index,]
Let’s start with model of all variables included
admission.model <- lm(Chance.of.Admit~., data = admission.train)
summary(admission.model)
##
## Call:
## lm(formula = Chance.of.Admit ~ ., data = admission.train)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.267009 -0.023378 0.009299 0.032683 0.157721
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -1.2574817 0.1155317 -10.884 < 2e-16 ***
## GRE.Score 0.0020065 0.0005545 3.618 0.000335 ***
## TOEFL.Score 0.0022177 0.0009987 2.221 0.026946 *
## University.Rating 0.0069350 0.0042658 1.626 0.104815
## SOP 0.0026155 0.0052468 0.498 0.618417
## LOR 0.0168895 0.0045775 3.690 0.000256 ***
## CGPA 0.1170066 0.0108412 10.793 < 2e-16 ***
## Research1 0.0244389 0.0073723 3.315 0.001002 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.06047 on 392 degrees of freedom
## Multiple R-squared: 0.8159, Adjusted R-squared: 0.8126
## F-statistic: 248.2 on 7 and 392 DF, p-value: < 2.2e-16
As we see Adjusted R-squared score that is 0.813 which mean this default model can comprehend the target variable (Chance.of.Admit) around 81,3% and the rest is comprehended other factors. Eventhough, there are 2 variables that are not significantly effect the model, we still can try to insert those variables into feature selection process
#model without predictor variable
model.none <- lm(Chance.of.Admit~1, data = admission.train)
# backward model
step(object = admission.model,direction = "backward", trace = 0)
##
## Call:
## lm(formula = Chance.of.Admit ~ GRE.Score + TOEFL.Score + University.Rating +
## LOR + CGPA + Research, data = admission.train)
##
## Coefficients:
## (Intercept) GRE.Score TOEFL.Score University.Rating
## -1.264484 0.001997 0.002269 0.007725
## LOR CGPA Research1
## 0.017579 0.117981 0.024672
model.back <- lm(formula = Chance.of.Admit ~ GRE.Score + TOEFL.Score + University.Rating +
LOR + CGPA + Research, data = admission.train)
# forward model
step(object = model.none,scope = list(lower = model.none, upper =admission.model),direction = "forward", trace = 0)
##
## Call:
## lm(formula = Chance.of.Admit ~ CGPA + GRE.Score + LOR + Research +
## TOEFL.Score + University.Rating, data = admission.train)
##
## Coefficients:
## (Intercept) CGPA GRE.Score LOR
## -1.264484 0.117981 0.001997 0.017579
## Research1 TOEFL.Score University.Rating
## 0.024672 0.002269 0.007725
model.forward <- lm(formula = Chance.of.Admit ~ CGPA + GRE.Score + LOR + Research +
TOEFL.Score + University.Rating, data = admission.train)
# both model
step(object = admission.model,scope = list(lower = model.none, upper =admission.model),direction = "both",trace = 0)
##
## Call:
## lm(formula = Chance.of.Admit ~ GRE.Score + TOEFL.Score + University.Rating +
## LOR + CGPA + Research, data = admission.train)
##
## Coefficients:
## (Intercept) GRE.Score TOEFL.Score University.Rating
## -1.264484 0.001997 0.002269 0.007725
## LOR CGPA Research1
## 0.017579 0.117981 0.024672
model.both <- lm(formula = Chance.of.Admit ~ GRE.Score + TOEFL.Score + University.Rating +
LOR + CGPA + Research, data = admission.train)
Due to variables used in those three models are the same, adjusted R-squareds are supposed to be same. We can see by formula below.
summary(model.back)$adj.r.squared
## [1] 0.8129988
summary(model.forward)$adj.r.squared
## [1] 0.8129988
summary(model.both)$adj.r.squared
## [1] 0.8129988
Model candidate: Chance.of.Admit = -1.264484 + 0.001997(GRE.Score) + 0.002269(TOEFL.Score) + 0.007725(University.Rating) + 0.017579(LOR)+ 0.117981(CGPA) + 0.024672(Research)
Due this model is slightly better than model with all variables predictor, we can choose either this model or prior model as our final model.
pred_test <- predict(object = model.back,newdata = admission.test)
Error checking
# MSE train model
MSE(y_pred = model.back$fitted.values, y_true = admission.train$Chance.of.Admit)
## [1] 0.003585992
# MSE test model
MSE(y_pred = pred_test, y_true = admission.test$Chance.of.Admit)
## [1] 0.003384234
hist(model.back$residuals)
shapiro.test(x = model.back$residuals)
##
## Shapiro-Wilk normality test
##
## data: model.back$residuals
## W = 0.91595, p-value = 3.856e-14
This is unexpectedly result. The p-value are less than 0.05 which mean we can’t accept H0 that indicated the residuals were distributing normal. Before we give the judgement whether we can use this model, let’s move to other test.
plot(model.back$fitted.values, model.back$residuals)
abline(h = 0, col = "red")
bptest(model.back)
##
## studentized Breusch-Pagan test
##
## data: model.back
## BP = 18.6, df = 6, p-value = 0.004895
Another shot that we cannot accept H0 indicating Homoscedasticity of the residuals. Let’s test Multicolinearity assumption to convince us whether we have to tuning this model or not.
vif(model.back)
## GRE.Score TOEFL.Score University.Rating LOR
## 4.254761 3.882034 2.233527 1.849029
## CGPA Research
## 4.440739 1.460715
There is no variables which showed multicolinearity (<10). That means no strong correlation between variables.