1. Objective

To predict the chance of admission

2. Data Wrangling

Load the required package

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(GGally)
## Loading required package: ggplot2
## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2
## 
## Attaching package: 'GGally'
## The following object is masked from 'package:dplyr':
## 
##     nasa
library(MLmetrics)
## 
## Attaching package: 'MLmetrics'
## The following object is masked from 'package:base':
## 
##     Recall
library(lmtest)
## Loading required package: zoo
## 
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric
library(car)
## Loading required package: carData
## 
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
## 
##     recode

Let’s start with importing data from dataset into our workspace.

admission <- read.csv("data_input/Admission_Predict_Ver1.1.csv")
head(admission)

Check the structure of data

str(admission)
## 'data.frame':    500 obs. of  9 variables:
##  $ Serial.No.       : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ GRE.Score        : int  337 324 316 322 314 330 321 308 302 323 ...
##  $ TOEFL.Score      : int  118 107 104 110 103 115 109 101 102 108 ...
##  $ University.Rating: int  4 4 3 3 2 5 3 2 1 3 ...
##  $ SOP              : num  4.5 4 3 3.5 2 4.5 3 3 2 3.5 ...
##  $ LOR              : num  4.5 4.5 3.5 2.5 3 3 4 4 1.5 3 ...
##  $ CGPA             : num  9.65 8.87 8 8.67 8.21 9.34 8.2 7.9 8 8.6 ...
##  $ Research         : int  1 1 1 1 0 1 1 0 0 0 ...
##  $ Chance.of.Admit  : num  0.92 0.76 0.72 0.8 0.65 0.9 0.75 0.68 0.5 0.45 ...

Range of the variables:
- GRE Scores ( out of 340 )
- TOEFL Scores ( out of 120 )
- University Rating ( out of 5 )
- SOP / Statement of Purpose (out of 5)
- LOR / Letter of Recommendation Strength ( out of 5 )
- Undergraduate GPA ( out of 10 )
- Research Experience ( either 0 or 1 )
- Chance of Admit ( ranging from 0 to 1 )

summary(admission)
##    Serial.No.      GRE.Score      TOEFL.Score    University.Rating
##  Min.   :  1.0   Min.   :290.0   Min.   : 92.0   Min.   :1.000    
##  1st Qu.:125.8   1st Qu.:308.0   1st Qu.:103.0   1st Qu.:2.000    
##  Median :250.5   Median :317.0   Median :107.0   Median :3.000    
##  Mean   :250.5   Mean   :316.5   Mean   :107.2   Mean   :3.114    
##  3rd Qu.:375.2   3rd Qu.:325.0   3rd Qu.:112.0   3rd Qu.:4.000    
##  Max.   :500.0   Max.   :340.0   Max.   :120.0   Max.   :5.000    
##       SOP             LOR             CGPA          Research   
##  Min.   :1.000   Min.   :1.000   Min.   :6.800   Min.   :0.00  
##  1st Qu.:2.500   1st Qu.:3.000   1st Qu.:8.127   1st Qu.:0.00  
##  Median :3.500   Median :3.500   Median :8.560   Median :1.00  
##  Mean   :3.374   Mean   :3.484   Mean   :8.576   Mean   :0.56  
##  3rd Qu.:4.000   3rd Qu.:4.000   3rd Qu.:9.040   3rd Qu.:1.00  
##  Max.   :5.000   Max.   :5.000   Max.   :9.920   Max.   :1.00  
##  Chance.of.Admit 
##  Min.   :0.3400  
##  1st Qu.:0.6300  
##  Median :0.7200  
##  Mean   :0.7217  
##  3rd Qu.:0.8200  
##  Max.   :0.9700

From median and mean score of the variables (except Research variable), we can conclude that the data we have are distributed normally. Let’s format and check Research variable data and take out Serial.No.

admission <- admission %>% 
  mutate(Research = as.factor(Research)) %>% 
  select(-Serial.No.)

table(admission$Research)
## 
##   0   1 
## 220 280

The Research variable has a balance proportion.

Next step, check whether there is any missing value

anyNA(admission)
## [1] FALSE

FALSE means there is no missing value in the data we have.

3. Exploratory Data Analysis

In this step, let’s check the correlation between target and the predictors

ggcorr(admission, label = TRUE, label_size = 3.5, hjust = 0.75, layout.exp = 3.5)
## Warning in ggcorr(admission, label = TRUE, label_size = 3.5, hjust = 0.75, :
## data in column(s) 'Research' are not numeric and were ignored

From the result of above, we can conclude that all of the variables have strong positive correlation with chance of admission

4. Modeling

Before we start modeling, we need to split the data into train data and test data. The train data will be used to generate regression linear model

set.seed(88)
index <- sample(nrow(admission), nrow(admission)*0.8)
admission.train <- admission[index,]
admission.test <- admission[-index,]

Let’s start with model of all variables included

admission.model <- lm(Chance.of.Admit~., data = admission.train)
summary(admission.model)
## 
## Call:
## lm(formula = Chance.of.Admit ~ ., data = admission.train)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.267009 -0.023378  0.009299  0.032683  0.157721 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       -1.2574817  0.1155317 -10.884  < 2e-16 ***
## GRE.Score          0.0020065  0.0005545   3.618 0.000335 ***
## TOEFL.Score        0.0022177  0.0009987   2.221 0.026946 *  
## University.Rating  0.0069350  0.0042658   1.626 0.104815    
## SOP                0.0026155  0.0052468   0.498 0.618417    
## LOR                0.0168895  0.0045775   3.690 0.000256 ***
## CGPA               0.1170066  0.0108412  10.793  < 2e-16 ***
## Research1          0.0244389  0.0073723   3.315 0.001002 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.06047 on 392 degrees of freedom
## Multiple R-squared:  0.8159, Adjusted R-squared:  0.8126 
## F-statistic: 248.2 on 7 and 392 DF,  p-value: < 2.2e-16

As we see Adjusted R-squared score that is 0.813 which mean this default model can comprehend the target variable (Chance.of.Admit) around 81,3% and the rest is comprehended other factors. Eventhough, there are 2 variables that are not significantly effect the model, we still can try to insert those variables into feature selection process

#model without predictor variable
model.none <- lm(Chance.of.Admit~1, data = admission.train)
# backward model
step(object = admission.model,direction = "backward", trace = 0)
## 
## Call:
## lm(formula = Chance.of.Admit ~ GRE.Score + TOEFL.Score + University.Rating + 
##     LOR + CGPA + Research, data = admission.train)
## 
## Coefficients:
##       (Intercept)          GRE.Score        TOEFL.Score  University.Rating  
##         -1.264484           0.001997           0.002269           0.007725  
##               LOR               CGPA          Research1  
##          0.017579           0.117981           0.024672
model.back <- lm(formula = Chance.of.Admit ~ GRE.Score + TOEFL.Score + University.Rating + 
    LOR + CGPA + Research, data = admission.train)
# forward model
step(object = model.none,scope = list(lower = model.none, upper =admission.model),direction = "forward", trace = 0)
## 
## Call:
## lm(formula = Chance.of.Admit ~ CGPA + GRE.Score + LOR + Research + 
##     TOEFL.Score + University.Rating, data = admission.train)
## 
## Coefficients:
##       (Intercept)               CGPA          GRE.Score                LOR  
##         -1.264484           0.117981           0.001997           0.017579  
##         Research1        TOEFL.Score  University.Rating  
##          0.024672           0.002269           0.007725
model.forward <- lm(formula = Chance.of.Admit ~ CGPA + GRE.Score + LOR + Research + 
    TOEFL.Score + University.Rating, data = admission.train)
# both model
step(object = admission.model,scope = list(lower = model.none, upper =admission.model),direction = "both",trace = 0)
## 
## Call:
## lm(formula = Chance.of.Admit ~ GRE.Score + TOEFL.Score + University.Rating + 
##     LOR + CGPA + Research, data = admission.train)
## 
## Coefficients:
##       (Intercept)          GRE.Score        TOEFL.Score  University.Rating  
##         -1.264484           0.001997           0.002269           0.007725  
##               LOR               CGPA          Research1  
##          0.017579           0.117981           0.024672
model.both <- lm(formula = Chance.of.Admit ~ GRE.Score + TOEFL.Score + University.Rating + 
    LOR + CGPA + Research, data = admission.train)

Due to variables used in those three models are the same, adjusted R-squareds are supposed to be same. We can see by formula below.

summary(model.back)$adj.r.squared
## [1] 0.8129988
summary(model.forward)$adj.r.squared
## [1] 0.8129988
summary(model.both)$adj.r.squared
## [1] 0.8129988

Model candidate: Chance.of.Admit = -1.264484 + 0.001997(GRE.Score) + 0.002269(TOEFL.Score) + 0.007725(University.Rating) + 0.017579(LOR)+ 0.117981(CGPA) + 0.024672(Research)

Due this model is slightly better than model with all variables predictor, we can choose either this model or prior model as our final model.

5. Model & Error Prediction

pred_test <- predict(object = model.back,newdata = admission.test)

Error checking

# MSE train model
MSE(y_pred = model.back$fitted.values, y_true = admission.train$Chance.of.Admit)
## [1] 0.003585992
# MSE test model
MSE(y_pred = pred_test, y_true = admission.test$Chance.of.Admit)
## [1] 0.003384234

6. Model Evaluation

6.1. Normality

hist(model.back$residuals)

shapiro.test(x = model.back$residuals)
## 
##  Shapiro-Wilk normality test
## 
## data:  model.back$residuals
## W = 0.91595, p-value = 3.856e-14

This is unexpectedly result. The p-value are less than 0.05 which mean we can’t accept H0 that indicated the residuals were distributing normal. Before we give the judgement whether we can use this model, let’s move to other test.

6.2 Homoscedasticity

plot(model.back$fitted.values, model.back$residuals)
abline(h = 0, col = "red")

bptest(model.back)
## 
##  studentized Breusch-Pagan test
## 
## data:  model.back
## BP = 18.6, df = 6, p-value = 0.004895

Another shot that we cannot accept H0 indicating Homoscedasticity of the residuals. Let’s test Multicolinearity assumption to convince us whether we have to tuning this model or not.

6.3 Multicolinearity

vif(model.back)
##         GRE.Score       TOEFL.Score University.Rating               LOR 
##          4.254761          3.882034          2.233527          1.849029 
##              CGPA          Research 
##          4.440739          1.460715

There is no variables which showed multicolinearity (<10). That means no strong correlation between variables.

7. Conclusion

  • Eventhough we can see the result of normality and homoscedasticity test are not as expected, but we can still tolerate due to our own judgement by see the mapping of plotting. For normality, most of the residuals are distributed near to 0 as plotted in the histogram. As for homoscedasticity, there is no pattern in residual distribution at plot mapping eventhough the test was indicating there is heteroscedasticity. And the multicolinearity test shows there is no strong correlation between variables
  • The model.back can be used to predict the chance of admission. We can see the difference of MSE between train and test data is quite small relatively.