The objective of this model is to create model to determine whether someone is admitted to Master Degree Program or not, based on several predictors including:
The data is obtained from this link https://www.kaggle.com/mohansacharya/graduate-admissions
Add several libraries
## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.2 v purrr 0.3.4
## v tibble 3.0.4 v dplyr 1.0.2
## v tidyr 1.1.2 v stringr 1.4.0
## v readr 1.4.0 v forcats 0.5.0
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
## Loading required package: grid
## Loading required package: libcoin
## Loading required package: mvtnorm
## Loading required package: lattice
##
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
##
## lift
## Serial.No. GRE.Score TOEFL.Score University.Rating SOP LOR CGPA Research
## 1 1 337 118 4 4.5 4.5 9.65 1
## 2 2 324 107 4 4.0 4.5 8.87 1
## 3 3 316 104 3 3.0 3.5 8.00 1
## 4 4 322 110 3 3.5 2.5 8.67 1
## 5 5 314 103 2 2.0 3.0 8.21 0
## 6 6 330 115 5 4.5 3.0 9.34 1
## Chance.of.Admit
## 1 0.92
## 2 0.76
## 3 0.72
## 4 0.80
## 5 0.65
## 6 0.90
As serial number is not relevant with chance of admit at all, that column will be removed
## GRE.Score TOEFL.Score University.Rating SOP
## 0 0 0 0
## LOR CGPA Research Chance.of.Admit
## 0 0 0 0
## 'data.frame': 500 obs. of 8 variables:
## $ GRE.Score : int 337 324 316 322 314 330 321 308 302 323 ...
## $ TOEFL.Score : int 118 107 104 110 103 115 109 101 102 108 ...
## $ University.Rating: int 4 4 3 3 2 5 3 2 1 3 ...
## $ SOP : num 4.5 4 3 3.5 2 4.5 3 3 2 3.5 ...
## $ LOR : num 4.5 4.5 3.5 2.5 3 3 4 4 1.5 3 ...
## $ CGPA : num 9.65 8.87 8 8.67 8.21 9.34 8.2 7.9 8 8.6 ...
## $ Research : int 1 1 1 1 0 1 1 0 0 0 ...
## $ Chance.of.Admit : num 0.92 0.76 0.72 0.8 0.65 0.9 0.75 0.68 0.5 0.45 ...
db <- db %>%
mutate(University.Rating = as.factor(University.Rating))%>%
mutate(Research = ifelse(Research == 0, "No", "Yes"))%>%
mutate(Research = as.factor(Research))
summary(db)## GRE.Score TOEFL.Score University.Rating SOP
## Min. :290.0 Min. : 92.0 1: 34 Min. :1.000
## 1st Qu.:308.0 1st Qu.:103.0 2:126 1st Qu.:2.500
## Median :317.0 Median :107.0 3:162 Median :3.500
## Mean :316.5 Mean :107.2 4:105 Mean :3.374
## 3rd Qu.:325.0 3rd Qu.:112.0 5: 73 3rd Qu.:4.000
## Max. :340.0 Max. :120.0 Max. :5.000
## LOR CGPA Research Chance.of.Admit
## Min. :1.000 Min. :6.800 No :220 Min. :0.3400
## 1st Qu.:3.000 1st Qu.:8.127 Yes:280 1st Qu.:0.6300
## Median :3.500 Median :8.560 Median :0.7200
## Mean :3.484 Mean :8.576 Mean :0.7217
## 3rd Qu.:4.000 3rd Qu.:9.040 3rd Qu.:0.8200
## Max. :5.000 Max. :9.920 Max. :0.9700
Transforming target variable to Yes or No, if chance of admit > 75% (passing grade), meaning YES
## 'data.frame': 500 obs. of 8 variables:
## $ GRE.Score : int 337 324 316 322 314 330 321 308 302 323 ...
## $ TOEFL.Score : int 118 107 104 110 103 115 109 101 102 108 ...
## $ University.Rating: Factor w/ 5 levels "1","2","3","4",..: 4 4 3 3 2 5 3 2 1 3 ...
## $ SOP : num 4.5 4 3 3.5 2 4.5 3 3 2 3.5 ...
## $ LOR : num 4.5 4.5 3.5 2.5 3 3 4 4 1.5 3 ...
## $ CGPA : num 9.65 8.87 8 8.67 8.21 9.34 8.2 7.9 8 8.6 ...
## $ Research : Factor w/ 2 levels "No","Yes": 2 2 2 2 1 2 2 1 1 1 ...
## $ Chance.of.Admit : Factor w/ 2 levels "No","Yes": 2 2 1 2 1 2 1 1 1 1 ...
## GRE.Score TOEFL.Score University.Rating SOP
## Min. :290.0 Min. : 92.0 1: 34 Min. :1.000
## 1st Qu.:308.0 1st Qu.:103.0 2:126 1st Qu.:2.500
## Median :317.0 Median :107.0 3:162 Median :3.500
## Mean :316.5 Mean :107.2 4:105 Mean :3.374
## 3rd Qu.:325.0 3rd Qu.:112.0 5: 73 3rd Qu.:4.000
## Max. :340.0 Max. :120.0 Max. :5.000
## LOR CGPA Research Chance.of.Admit
## Min. :1.000 Min. :6.800 No :220 No :290
## 1st Qu.:3.000 1st Qu.:8.127 Yes:280 Yes:210
## Median :3.500 Median :8.560
## Mean :3.484 Mean :8.576
## 3rd Qu.:4.000 3rd Qu.:9.040
## Max. :5.000 Max. :9.920
## GRE.Score TOEFL.Score University.Rating SOP LOR CGPA Research Chance.of.Admit
## 1 337 118 4 4.5 4.5 9.65 Yes Yes
## 2 324 107 4 4.0 4.5 8.87 Yes Yes
## 3 316 104 3 3.0 3.5 8.00 Yes No
## 4 322 110 3 3.5 2.5 8.67 Yes Yes
## 5 314 103 2 2.0 3.0 8.21 No No
## 6 330 115 5 4.5 3.0 9.34 Yes Yes
Splitting training set and test set, training set will be 80% from total data
## Warning in RNGkind(sample.kind = "Rounding"): non-uniform 'Rounding' sampler
## used
##
## No Yes
## 0.58 0.42
Data Proportion is still considered as Okay
Goal
Minimise False Negative, therefore the model has to have the best Recall
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 53 5
## Yes 8 34
##
## Accuracy : 0.87
## 95% CI : (0.788, 0.9289)
## No Information Rate : 0.61
## P-Value [Acc > NIR] : 9.34e-09
##
## Kappa : 0.7305
##
## Mcnemar's Test P-Value : 0.5791
##
## Sensitivity : 0.8718
## Specificity : 0.8689
## Pos Pred Value : 0.8095
## Neg Pred Value : 0.9138
## Prevalence : 0.3900
## Detection Rate : 0.3400
## Detection Prevalence : 0.4200
## Balanced Accuracy : 0.8703
##
## 'Positive' Class : Yes
##
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 53 9
## Yes 8 30
##
## Accuracy : 0.83
## 95% CI : (0.7418, 0.8977)
## No Information Rate : 0.61
## P-Value [Acc > NIR] : 1.632e-06
##
## Kappa : 0.641
##
## Mcnemar's Test P-Value : 1
##
## Sensitivity : 0.7692
## Specificity : 0.8689
## Pos Pred Value : 0.7895
## Neg Pred Value : 0.8548
## Prevalence : 0.3900
## Detection Rate : 0.3000
## Detection Prevalence : 0.3800
## Balanced Accuracy : 0.8190
##
## 'Positive' Class : Yes
##
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 54 7
## Yes 7 32
##
## Accuracy : 0.86
## 95% CI : (0.7763, 0.9213)
## No Information Rate : 0.61
## P-Value [Acc > NIR] : 3.809e-08
##
## Kappa : 0.7058
##
## Mcnemar's Test P-Value : 1
##
## Sensitivity : 0.8205
## Specificity : 0.8852
## Pos Pred Value : 0.8205
## Neg Pred Value : 0.8852
## Prevalence : 0.3900
## Detection Rate : 0.3200
## Detection Prevalence : 0.3900
## Balanced Accuracy : 0.8529
##
## 'Positive' Class : Yes
##
The best model is the one that have highest sensitivity, which in this case is Naive Bayes Model
## No Yes
## [1,] 1.744591e-09 1.000000e+00
## [2,] 6.243815e-04 9.993756e-01
## [3,] 9.996182e-01 3.818286e-04
## [4,] 9.999997e-01 2.547033e-07
## [5,] 8.745990e-01 1.254010e-01
## [6,] 2.568884e-05 9.999743e-01
data_roc <- data.frame(pred_prob = default_prob[,"Yes"],
actual = ifelse(test_set$Chance.of.Admit == 'Yes', 1, 0))
head(data_roc)## pred_prob actual
## 1 1.000000e+00 1
## 2 9.993756e-01 1
## 3 3.818286e-04 0
## 4 2.547033e-07 0
## 5 1.254010e-01 0
## 6 9.999743e-01 1
default_roc <- prediction(predictions = data_roc$pred_prob,
labels = data_roc$actual)
# ROC curve
plot(performance(default_roc, "tpr", "fpr")) ### AUC Value
## [1] 0.962169
AUC is close to 1, meaning the model is good at determining positive and negative class
In this dataset the Naive Bayes is better in predicting the Master Degree Program Admission with the sensitivity 87% compared to decision tree model 76% and random forest model 82%. In terms of accuracy Naive Bayes model also has higher value of 87%, compared to decision tree model 83% and random forest model 86%.
Evaluation using ROC and AUC also shown that the model is good ad determining positive and negative class