Predicting School Admission

Objective

The objective of this model is to create model to determine whether someone is admitted to Master Degree Program or not, based on several predictors including:

GRE Scores ( out of 340 )
TOEFL Scores ( out of 120 )
University Rating ( out of 5 )
Statement of Purpose and Letter of Recommendation Strength ( out of 5 )
Undergraduate GPA ( out of 10 )
Research Experience ( either 0 or 1 )

Data Source

The data is obtained from this link https://www.kaggle.com/mohansacharya/graduate-admissions

Library

Add several libraries

library(tidyverse)

## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --

## v ggplot2 3.3.2     v purrr   0.3.4
## v tibble  3.0.4     v dplyr   1.0.2
## v tidyr   1.1.2     v stringr 1.4.0
## v readr   1.4.0     v forcats 0.5.0

## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(e1071)
library(partykit)

## Loading required package: grid

## Loading required package: libcoin

## Loading required package: mvtnorm

library(caret)

## Loading required package: lattice

## 
## Attaching package: 'caret'

## The following object is masked from 'package:purrr':
## 
##     lift

library(ROCR)

Data Preprocessing

Data Input

db <- read.csv("Admission_Predict_Ver1.1.csv")
head(db)

##   Serial.No. GRE.Score TOEFL.Score University.Rating SOP LOR CGPA Research
## 1          1       337         118                 4 4.5 4.5 9.65        1
## 2          2       324         107                 4 4.0 4.5 8.87        1
## 3          3       316         104                 3 3.0 3.5 8.00        1
## 4          4       322         110                 3 3.5 2.5 8.67        1
## 5          5       314         103                 2 2.0 3.0 8.21        0
## 6          6       330         115                 5 4.5 3.0 9.34        1
##   Chance.of.Admit
## 1            0.92
## 2            0.76
## 3            0.72
## 4            0.80
## 5            0.65
## 6            0.90

Remove Unecessary Column

As serial number is not relevant with chance of admit at all, that column will be removed

db <- db %>%
  select(-Serial.No.)

Check Missing Value

colSums(is.na(db))

##         GRE.Score       TOEFL.Score University.Rating               SOP 
##                 0                 0                 0                 0 
##               LOR              CGPA          Research   Chance.of.Admit 
##                 0                 0                 0                 0

Check Data Type

str(db)

## 'data.frame':    500 obs. of  8 variables:
##  $ GRE.Score        : int  337 324 316 322 314 330 321 308 302 323 ...
##  $ TOEFL.Score      : int  118 107 104 110 103 115 109 101 102 108 ...
##  $ University.Rating: int  4 4 3 3 2 5 3 2 1 3 ...
##  $ SOP              : num  4.5 4 3 3.5 2 4.5 3 3 2 3.5 ...
##  $ LOR              : num  4.5 4.5 3.5 2.5 3 3 4 4 1.5 3 ...
##  $ CGPA             : num  9.65 8.87 8 8.67 8.21 9.34 8.2 7.9 8 8.6 ...
##  $ Research         : int  1 1 1 1 0 1 1 0 0 0 ...
##  $ Chance.of.Admit  : num  0.92 0.76 0.72 0.8 0.65 0.9 0.75 0.68 0.5 0.45 ...

db <- db %>%
  mutate(University.Rating = as.factor(University.Rating))%>%
  mutate(Research = ifelse(Research == 0, "No", "Yes"))%>%
  mutate(Research = as.factor(Research))
summary(db)

##    GRE.Score      TOEFL.Score    University.Rating      SOP       
##  Min.   :290.0   Min.   : 92.0   1: 34             Min.   :1.000  
##  1st Qu.:308.0   1st Qu.:103.0   2:126             1st Qu.:2.500  
##  Median :317.0   Median :107.0   3:162             Median :3.500  
##  Mean   :316.5   Mean   :107.2   4:105             Mean   :3.374  
##  3rd Qu.:325.0   3rd Qu.:112.0   5: 73             3rd Qu.:4.000  
##  Max.   :340.0   Max.   :120.0                     Max.   :5.000  
##       LOR             CGPA       Research  Chance.of.Admit 
##  Min.   :1.000   Min.   :6.800   No :220   Min.   :0.3400  
##  1st Qu.:3.000   1st Qu.:8.127   Yes:280   1st Qu.:0.6300  
##  Median :3.500   Median :8.560             Median :0.7200  
##  Mean   :3.484   Mean   :8.576             Mean   :0.7217  
##  3rd Qu.:4.000   3rd Qu.:9.040             3rd Qu.:0.8200  
##  Max.   :5.000   Max.   :9.920             Max.   :0.9700

Transforming Target Variable

Transforming target variable to Yes or No, if chance of admit > 75% (passing grade), meaning YES

db <- db %>%
  mutate(Chance.of.Admit = ifelse(Chance.of.Admit <= 0.75, "No", "Yes"))%>%
  mutate(Chance.of.Admit = as.factor(Chance.of.Admit))

Near Zero Var Check

nearZeroVar(db)

## integer(0)

Data Final Check

str(db)

## 'data.frame':    500 obs. of  8 variables:
##  $ GRE.Score        : int  337 324 316 322 314 330 321 308 302 323 ...
##  $ TOEFL.Score      : int  118 107 104 110 103 115 109 101 102 108 ...
##  $ University.Rating: Factor w/ 5 levels "1","2","3","4",..: 4 4 3 3 2 5 3 2 1 3 ...
##  $ SOP              : num  4.5 4 3 3.5 2 4.5 3 3 2 3.5 ...
##  $ LOR              : num  4.5 4.5 3.5 2.5 3 3 4 4 1.5 3 ...
##  $ CGPA             : num  9.65 8.87 8 8.67 8.21 9.34 8.2 7.9 8 8.6 ...
##  $ Research         : Factor w/ 2 levels "No","Yes": 2 2 2 2 1 2 2 1 1 1 ...
##  $ Chance.of.Admit  : Factor w/ 2 levels "No","Yes": 2 2 1 2 1 2 1 1 1 1 ...

summary(db)

##    GRE.Score      TOEFL.Score    University.Rating      SOP       
##  Min.   :290.0   Min.   : 92.0   1: 34             Min.   :1.000  
##  1st Qu.:308.0   1st Qu.:103.0   2:126             1st Qu.:2.500  
##  Median :317.0   Median :107.0   3:162             Median :3.500  
##  Mean   :316.5   Mean   :107.2   4:105             Mean   :3.374  
##  3rd Qu.:325.0   3rd Qu.:112.0   5: 73             3rd Qu.:4.000  
##  Max.   :340.0   Max.   :120.0                     Max.   :5.000  
##       LOR             CGPA       Research  Chance.of.Admit
##  Min.   :1.000   Min.   :6.800   No :220   No :290        
##  1st Qu.:3.000   1st Qu.:8.127   Yes:280   Yes:210        
##  Median :3.500   Median :8.560                            
##  Mean   :3.484   Mean   :8.576                            
##  3rd Qu.:4.000   3rd Qu.:9.040                            
##  Max.   :5.000   Max.   :9.920

head(db)

##   GRE.Score TOEFL.Score University.Rating SOP LOR CGPA Research Chance.of.Admit
## 1       337         118                 4 4.5 4.5 9.65      Yes             Yes
## 2       324         107                 4 4.0 4.5 8.87      Yes             Yes
## 3       316         104                 3 3.0 3.5 8.00      Yes              No
## 4       322         110                 3 3.5 2.5 8.67      Yes             Yes
## 5       314         103                 2 2.0 3.0 8.21       No              No
## 6       330         115                 5 4.5 3.0 9.34      Yes             Yes

Cross Validation Scheme

Splitting training set and test set, training set will be 80% from total data

RNGkind(sample.kind = "Rounding")

## Warning in RNGkind(sample.kind = "Rounding"): non-uniform 'Rounding' sampler
## used

set.seed(100)

intrain <- sample(nrow(db), size = nrow(db) * 0.8)
train_set <-db[intrain,]
test_set <- db[-intrain,]

Data Proportion

prop.table(table(db$Chance.of.Admit))

## 
##   No  Yes 
## 0.58 0.42

Data Proportion is still considered as Okay

Modelling with Naive Bayes

model_nb <- naiveBayes(x = train_set %>% select(-Chance.of.Admit),
                       y = train_set$Chance.of.Admit, 
                       laplace = 1)

Modelling with Decision Tree

model_dtree <- ctree(formula = Chance.of.Admit ~ ., data = train_set)

Plot

plot(model_dtree, type = 'simple')

## Modelling with Decision Tree

ctrl <- trainControl(method = "repeatedcv",
                     number = 5, # k-fold
                     repeats = 3) # repetisi

model_rf <- train(Chance.of.Admit ~ ., data = train_set, method = "rf", trControl = ctrl)

Prediction with Test Set

prediction_nb <- predict(model_nb, newdata = test_set)

prediction_dtree <- predict(model_dtree, newdata = test_set)
  
prediction_rf <- predict(model_rf, newdata = test_set)

Model Evaluation using Confusion Matrix

Goal

Minimise False Negative, therefore the model has to have the best Recall

Naive Bayes Model

confusionMatrix(prediction_nb, reference = test_set$Chance.of.Admit, positive =  "Yes")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction No Yes
##        No  53   5
##        Yes  8  34
##                                          
##                Accuracy : 0.87           
##                  95% CI : (0.788, 0.9289)
##     No Information Rate : 0.61           
##     P-Value [Acc > NIR] : 9.34e-09       
##                                          
##                   Kappa : 0.7305         
##                                          
##  Mcnemar's Test P-Value : 0.5791         
##                                          
##             Sensitivity : 0.8718         
##             Specificity : 0.8689         
##          Pos Pred Value : 0.8095         
##          Neg Pred Value : 0.9138         
##              Prevalence : 0.3900         
##          Detection Rate : 0.3400         
##    Detection Prevalence : 0.4200         
##       Balanced Accuracy : 0.8703         
##                                          
##        'Positive' Class : Yes            
##

Decision Tree Model

confusionMatrix(prediction_dtree, reference = test_set$Chance.of.Admit, positive =  "Yes")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction No Yes
##        No  53   9
##        Yes  8  30
##                                           
##                Accuracy : 0.83            
##                  95% CI : (0.7418, 0.8977)
##     No Information Rate : 0.61            
##     P-Value [Acc > NIR] : 1.632e-06       
##                                           
##                   Kappa : 0.641           
##                                           
##  Mcnemar's Test P-Value : 1               
##                                           
##             Sensitivity : 0.7692          
##             Specificity : 0.8689          
##          Pos Pred Value : 0.7895          
##          Neg Pred Value : 0.8548          
##              Prevalence : 0.3900          
##          Detection Rate : 0.3000          
##    Detection Prevalence : 0.3800          
##       Balanced Accuracy : 0.8190          
##                                           
##        'Positive' Class : Yes             
##

Random Forest Model

confusionMatrix(prediction_rf, reference = test_set$Chance.of.Admit, positive =  "Yes")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction No Yes
##        No  54   7
##        Yes  7  32
##                                           
##                Accuracy : 0.86            
##                  95% CI : (0.7763, 0.9213)
##     No Information Rate : 0.61            
##     P-Value [Acc > NIR] : 3.809e-08       
##                                           
##                   Kappa : 0.7058          
##                                           
##  Mcnemar's Test P-Value : 1               
##                                           
##             Sensitivity : 0.8205          
##             Specificity : 0.8852          
##          Pos Pred Value : 0.8205          
##          Neg Pred Value : 0.8852          
##              Prevalence : 0.3900          
##          Detection Rate : 0.3200          
##    Detection Prevalence : 0.3900          
##       Balanced Accuracy : 0.8529          
##                                           
##        'Positive' Class : Yes             
##

The best model is the one that have highest sensitivity, which in this case is Naive Bayes Model

Model Evaluation Using AUC and ROC

ROC Plot

default_prob <- predict(object = model_nb, newdata = test_set, type = "raw")
head(default_prob)

##                No          Yes
## [1,] 1.744591e-09 1.000000e+00
## [2,] 6.243815e-04 9.993756e-01
## [3,] 9.996182e-01 3.818286e-04
## [4,] 9.999997e-01 2.547033e-07
## [5,] 8.745990e-01 1.254010e-01
## [6,] 2.568884e-05 9.999743e-01

data_roc <- data.frame(pred_prob = default_prob[,"Yes"],
                       actual = ifelse(test_set$Chance.of.Admit == 'Yes', 1, 0))
head(data_roc)

##      pred_prob actual
## 1 1.000000e+00      1
## 2 9.993756e-01      1
## 3 3.818286e-04      0
## 4 2.547033e-07      0
## 5 1.254010e-01      0
## 6 9.999743e-01      1

default_roc <- prediction(predictions = data_roc$pred_prob,
                      labels = data_roc$actual)
# ROC curve

plot(performance(default_roc, "tpr", "fpr"))

### AUC Value

default_auc <- performance(default_roc, measure = "auc")
default_auc@y.values[[1]]

## [1] 0.962169

AUC is close to 1, meaning the model is good at determining positive and negative class

Conclusion

In this dataset the Naive Bayes is better in predicting the Master Degree Program Admission with the sensitivity 87% compared to decision tree model 76% and random forest model 82%. In terms of accuracy Naive Bayes model also has higher value of 87%, compared to decision tree model 83% and random forest model 86%.

Evaluation using ROC and AUC also shown that the model is good ad determining positive and negative class