library(tidyverse)

## Warning: package 'tidyverse' was built under R version 3.6.2

## -- Attaching packages -------------------------------------------------------------------- tidyverse 1.3.0 --

## v ggplot2 3.2.1     v purrr   0.3.3
## v tibble  2.1.3     v dplyr   0.8.3
## v tidyr   1.0.0     v stringr 1.4.0
## v readr   1.3.1     v forcats 0.4.0

## Warning: package 'ggplot2' was built under R version 3.6.2

## Warning: package 'tibble' was built under R version 3.6.2

## Warning: package 'tidyr' was built under R version 3.6.2

## Warning: package 'readr' was built under R version 3.6.2

## Warning: package 'purrr' was built under R version 3.6.2

## Warning: package 'dplyr' was built under R version 3.6.2

## Warning: package 'stringr' was built under R version 3.6.2

## Warning: package 'forcats' was built under R version 3.6.2

## -- Conflicts ----------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(plotly)

## Warning: package 'plotly' was built under R version 3.6.2

## 
## Attaching package: 'plotly'

## The following object is masked from 'package:ggplot2':
## 
##     last_plot

## The following object is masked from 'package:stats':
## 
##     filter

## The following object is masked from 'package:graphics':
## 
##     layout

library(GGally)

## Warning: package 'GGally' was built under R version 3.6.2

## Registered S3 method overwritten by 'GGally':
##   method from   
##   +.gg   ggplot2

## 
## Attaching package: 'GGally'

## The following object is masked from 'package:dplyr':
## 
##     nasa

library(ggplot2)
library(readr)
library(dplyr)
library(funModeling)

## Warning: package 'funModeling' was built under R version 3.6.2

## Loading required package: Hmisc

## Warning: package 'Hmisc' was built under R version 3.6.2

## Loading required package: lattice

## Loading required package: survival

## Warning: package 'survival' was built under R version 3.6.2

## Loading required package: Formula

## 
## Attaching package: 'Hmisc'

## The following object is masked from 'package:plotly':
## 
##     subplot

## The following objects are masked from 'package:dplyr':
## 
##     src, summarize

## The following objects are masked from 'package:base':
## 
##     format.pval, units

## funModeling v.1.9.3 :)
## Examples and tutorials at livebook.datascienceheroes.com
##  / Now in Spanish: librovivodecienciadedatos.ai

## 
## Attaching package: 'funModeling'

## The following object is masked from 'package:GGally':
## 
##     range01

library(lmtest)

## Warning: package 'lmtest' was built under R version 3.6.2

## Loading required package: zoo

## Warning: package 'zoo' was built under R version 3.6.2

## 
## Attaching package: 'zoo'

## The following objects are masked from 'package:base':
## 
##     as.Date, as.Date.numeric

library(car)

## Warning: package 'car' was built under R version 3.6.2

## Loading required package: carData

## 
## Attaching package: 'car'

## The following object is masked from 'package:dplyr':
## 
##     recode

## The following object is masked from 'package:purrr':
## 
##     some

library(MLmetrics)

## Warning: package 'MLmetrics' was built under R version 3.6.2

## 
## Attaching package: 'MLmetrics'

## The following object is masked from 'package:base':
## 
##     Recall

library(caret)

## Warning: package 'caret' was built under R version 3.6.2

## 
## Attaching package: 'caret'

## The following objects are masked from 'package:MLmetrics':
## 
##     MAE, RMSE

## The following object is masked from 'package:survival':
## 
##     cluster

## The following object is masked from 'package:purrr':
## 
##     lift

1 Background

This is a learn by building project to predict the chances of students getting admission for Masters Program in a university based on several academic performance measurement using Logistic Regression & K-Nearest Neighbor Analysis method.

2 Source of Dataset

The analysis will use dataset as follow:

Mohan S Acharya, Asfia Armaan, Aneeta S Antony : A Comparison of Regression Models for Prediction of Graduate Admissions, IEEE International Conference on Computational Intelligence in Data Science 2019

https://www.kaggle.com/mohansacharya/graduate-admissions#Admission_Predict.csv

3 Importing Dataset

admission <- read.csv("Admission_Predict.csv")
head(str(admission))

## 'data.frame':    400 obs. of  9 variables:
##  $ Serial.No.       : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ GRE.Score        : int  337 324 316 322 314 330 321 308 302 323 ...
##  $ TOEFL.Score      : int  118 107 104 110 103 115 109 101 102 108 ...
##  $ University.Rating: int  4 4 3 3 2 5 3 2 1 3 ...
##  $ SOP              : num  4.5 4 3 3.5 2 4.5 3 3 2 3.5 ...
##  $ LOR              : num  4.5 4.5 3.5 2.5 3 3 4 4 1.5 3 ...
##  $ CGPA             : num  9.65 8.87 8 8.67 8.21 9.34 8.2 7.9 8 8.6 ...
##  $ Research         : int  1 1 1 1 0 1 1 0 0 0 ...
##  $ Chance.of.Admit  : num  0.92 0.76 0.72 0.8 0.65 0.9 0.75 0.68 0.5 0.45 ...

## NULL

3.1 Context

This dataset is created for prediction of Graduate Admissions from an Indian perspective.

3.2 Content

The dataset contains several parameters which are considered important during the application for Masters Programs.

The parameters included are as follows:
1. GRE Scores (out of 340)
2. TOEFL Scores (out of 120)
3. University Rating (out of 5)
4. Statement of Purpose / SOP (out of 5)
5. Letter of Recommendation Strength / LOR (out of 5)
6. Undergraduate GPA (out of 10)
7. Research Experience (either 0 or 1)
8. Chance of Admit (ranging from 0 to 1)

4 Inspecting Dataset

admission %>% 
   is.na() %>% 
  colSums(is.na(admission))

##        Serial.No.         GRE.Score       TOEFL.Score University.Rating 
##                 0                 0                 0                 0 
##               SOP               LOR              CGPA          Research 
##                 0                 0                 0                 0 
##   Chance.of.Admit 
##                 0

There is no na or missing values in each columns of dataset.

admission_new <- admission %>% 
   select(-Serial.No.)
admission_new

The variable Serial.No is excluded from predictor variable due to no relationship with other variables.

admission_new <- admission_new %>% 
   mutate(Label.of.Admit = Chance.of.Admit)
admission_new

admission_new$Label.of.Admit <- ifelse(admission_new$Label.of.Admit > 0.5, "1", "0")
admission_new

The variable Label.of.Admit is divided into category “1” (admitted) and “0” (not admitted).

admission_new <- admission_new %>% 
   select(-Chance.of.Admit)
admission_new

The variable Chance.of.Admit is excluded from predictor variable due to correlationship with variable Label.of.Admit.

admission_new$Research <- as.factor(admission_new$Research)
admission_new$University.Rating <- as.factor(admission_new$University.Rating)
admission_new$Label.of.Admit <- as.factor(admission_new$Label.of.Admit)
head(str(admission_new))

## 'data.frame':    400 obs. of  8 variables:
##  $ GRE.Score        : int  337 324 316 322 314 330 321 308 302 323 ...
##  $ TOEFL.Score      : int  118 107 104 110 103 115 109 101 102 108 ...
##  $ University.Rating: Factor w/ 5 levels "1","2","3","4",..: 4 4 3 3 2 5 3 2 1 3 ...
##  $ SOP              : num  4.5 4 3 3.5 2 4.5 3 3 2 3.5 ...
##  $ LOR              : num  4.5 4.5 3.5 2.5 3 3 4 4 1.5 3 ...
##  $ CGPA             : num  9.65 8.87 8 8.67 8.21 9.34 8.2 7.9 8 8.6 ...
##  $ Research         : Factor w/ 2 levels "0","1": 2 2 2 2 1 2 2 1 1 1 ...
##  $ Label.of.Admit   : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 1 1 ...

## NULL

The type of variables University.Rating and Research are changed from integer into factor type. The type of variable Label.of.Admit is changed from character into factor type.

5 Solving Business Problem

The model for solving business problem is to predict the chances of students getting admission for a Masters Program in a university based on several academic performance measurement . The model will be developed using variables as follows:

Target variable: Chance.of.Admit
Predictor variable: GRE.Score, TOEFL.Score, University.Rating, Research, SOP, LOR, CGPA

6 Exploratory Data Analysis

6.1 Checking data distribution

summary(admission_new)

##    GRE.Score      TOEFL.Score    University.Rating      SOP     
##  Min.   :290.0   Min.   : 92.0   1: 26             Min.   :1.0  
##  1st Qu.:308.0   1st Qu.:103.0   2:107             1st Qu.:2.5  
##  Median :317.0   Median :107.0   3:133             Median :3.5  
##  Mean   :316.8   Mean   :107.4   4: 74             Mean   :3.4  
##  3rd Qu.:325.0   3rd Qu.:112.0   5: 60             3rd Qu.:4.0  
##  Max.   :340.0   Max.   :120.0                     Max.   :5.0  
##       LOR             CGPA       Research Label.of.Admit
##  Min.   :1.000   Min.   :6.800   0:181    0: 35         
##  1st Qu.:3.000   1st Qu.:8.170   1:219    1:365         
##  Median :3.500   Median :8.610                          
##  Mean   :3.453   Mean   :8.599                          
##  3rd Qu.:4.000   3rd Qu.:9.062                          
##  Max.   :5.000   Max.   :9.920

plot_num(admission_new)

The variables GRE.Score, TOEFL.Score, SOP, LOR, CGPA look like to have a distributed data, which can be seen from the median between min and max figures. The histogram shows such a normal distributed data.

6.2 Checking class-imbalance

round(prop.table(table(admission_new$Label.of.Admit)),2)

## 
##    0    1 
## 0.09 0.91

The class have an imbalance figure, which is mostly in positive target variable Label.of.Admit.

7 Preparing Train and Test Dataset (Cross Validation)

7.1 For Logistic Regression

set.seed(417)
for_train <- sample(nrow(admission_new), nrow(admission_new)*0.8)
admission.train <- admission_new[for_train, ]
admission.test <- admission_new[-for_train, ]

round(prop.table(table(admission_new$Label.of.Admit)),2)

## 
##    0    1 
## 0.09 0.91

The data test shows the same portion with data train.

7.2 For K-Nearest Neighbor

set.seed(417)
for_train <- sample(nrow(admission_new), nrow(admission_new)*0.8)
admission.train.knn <- admission_new[for_train, ]
admission.test.knn <- admission_new[-for_train, ]

8 Developing Model

The model for solving business problem is developed using Logistic Regression. We use the 5 (five) predictors as we used in the Ordinary Least Square (OLS) model.

model_admission <- glm(Label.of.Admit ~ GRE.Score + TOEFL.Score + LOR + 
    CGPA + Research, data = admission.train, family = "binomial")
summary(model_admission)

## 
## Call:
## glm(formula = Label.of.Admit ~ GRE.Score + TOEFL.Score + LOR + 
##     CGPA + Research, family = "binomial", data = admission.train)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.97161   0.03122   0.11767   0.26511   1.81518  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -39.74380   10.27267  -3.869 0.000109 ***
## GRE.Score     0.02323    0.04159   0.558 0.576517    
## TOEFL.Score   0.10104    0.10443   0.968 0.333274    
## LOR           0.75639    0.44758   1.690 0.091037 .  
## CGPA          2.72822    0.92169   2.960 0.003076 ** 
## Research1     0.14456    0.69873   0.207 0.836096    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 180.360  on 319  degrees of freedom
## Residual deviance:  98.596  on 314  degrees of freedom
## AIC: 110.6
## 
## Number of Fisher Scoring iterations: 7

model_admission2 <- step(model_admission, direction = "backward")

## Start:  AIC=110.6
## Label.of.Admit ~ GRE.Score + TOEFL.Score + LOR + CGPA + Research
## 
##               Df Deviance    AIC
## - Research     1   98.639 108.64
## - GRE.Score    1   98.909 108.91
## - TOEFL.Score  1   99.549 109.55
## <none>             98.596 110.60
## - LOR          1  101.559 111.56
## - CGPA         1  108.010 118.01
## 
## Step:  AIC=108.64
## Label.of.Admit ~ GRE.Score + TOEFL.Score + LOR + CGPA
## 
##               Df Deviance    AIC
## - GRE.Score    1   98.995 107.00
## - TOEFL.Score  1   99.560 107.56
## <none>             98.639 108.64
## - LOR          1  101.922 109.92
## - CGPA         1  108.391 116.39
## 
## Step:  AIC=107
## Label.of.Admit ~ TOEFL.Score + LOR + CGPA
## 
##               Df Deviance    AIC
## <none>             98.995 107.00
## - TOEFL.Score  1  101.246 107.25
## - LOR          1  102.414 108.41
## - CGPA         1  110.104 116.10

9 Testing Model

9.1 Check linearity of predictor & log of odds

GGally::ggpairs(admission_new)

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The 3 (three) predictor variables from the model: TOEFL.Score, LOR, CGPA, have a linear relationship with the target variable Label.of.Admit.

9.2 Check multicolinearity

vif(model_admission)

##   GRE.Score TOEFL.Score         LOR        CGPA    Research 
##    1.688381    2.094007    1.182333    1.634875    1.111747

The VIF numbers are below 10, meaning that there is no multicolinearity among predictor variables. This assumption of no multicolinearity is fullfilled.

10 Predicting Target

10.1 Logistic Regression

pred_test <- round(predict(model_admission2, newdata = admission.test, type = "response"),2)
pred_test

##    9   14   21   23   24   26   33   38   39   41   55   59   63   64   65   70 
## 0.66 0.94 0.80 1.00 1.00 1.00 1.00 0.70 0.41 0.95 0.96 0.06 0.94 0.98 1.00 1.00 
##   73   83   84   92   94   99  102  109  111  116  118  120  126  129  139  151 
## 1.00 1.00 1.00 0.65 0.70 1.00 0.93 1.00 0.98 1.00 0.54 0.99 0.97 1.00 1.00 1.00 
##  152  161  167  171  173  174  181  185  189  200  203  204  206  237  244  249 
## 1.00 0.68 0.99 0.90 1.00 1.00 0.93 0.98 1.00 1.00 1.00 1.00 0.61 1.00 1.00 1.00 
##  262  265  271  272  274  276  277  283  292  293  297  326  327  331  333  337 
## 0.96 0.99 0.94 0.47 0.57 1.00 1.00 0.99 0.66 0.64 0.99 1.00 0.70 0.99 0.93 0.99 
##  340  342  347  353  360  369  377  379  381  382  386  388  389  390  395  398 
## 0.99 1.00 0.35 0.89 0.93 0.36 0.21 0.48 0.99 0.99 1.00 0.95 0.46 0.99 1.00 1.00

The prediction probability Label.of.Admit for data test (admission.test) will be saved in new variable pred.Admit.

admission.test$pred.Admit <- round(predict(model_admission2, newdata = admission.test, type = "response"),2)

The data test which is classified by pred.Admit will be saved in new variable pred.Label.

admission.test$pred.Label <- ifelse(admission.test$pred.Admit > 0.5, "1", "0")

admission.test1 <- admission.test %>% 
  mutate(pred.Label = as.factor(pred.Label))
admission.test1

admission.test1 %>% 
  select(pred.Admit, pred.Label, Label.of.Admit)

admission.test1

10.2 K-Nearest Neighbor

10.2.1 Preparing predictor data

admission.train.knn1 <- admission.train.knn %>% 
  select(-University.Rating, -Research, -Label.of.Admit)
admission.train.knn1

admission.test.knn1 <- admission.test.knn %>% 
  select(-University.Rating, -Research, -Label.of.Admit)
admission.test.knn1

10.2.2 Preparing label data

admission.train.knn2 <- admission.train.knn %>% 
  select(Label.of.Admit)
admission.train.knn2

admission.test.knn2 <- admission.test.knn %>% 
  select(Label.of.Admit)
admission.test.knn2

10.2.3 Scaling predictors data

train_pred <- scale(x = admission.train.knn1)
test_pred <- scale(x = admission.test.knn1,
                     center = attr(train_pred, "scaled:center"),
                      scale = attr(train_pred, "scaled:scale"))

train_label <- admission.train.knn2[,1]
test_label <- admission.test.knn2[,1]

10.2.4 Finding optimum k

round(sqrt(nrow(admission.train.knn)),0)

## [1] 18

We will use figure 17 as optimum k for target: 1 = ‘Admitted’ and 0 = ‘Not Admitted’.

11 Evaluating Model

11.1 Logistic Regression

table(predicted = admission.test1$pred.Label,
      actual = admission.test1$Label.of.Admit)

##          actual
## predicted  0  1
##         0  4  4
##         1  5 67

confusionMatrix(data = admission.test1$pred.Label,
                reference = admission.test1$Label.of.Admit,
                positive = "1")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0  4  4
##          1  5 67
##                                           
##                Accuracy : 0.8875          
##                  95% CI : (0.7972, 0.9472)
##     No Information Rate : 0.8875          
##     P-Value [Acc > NIR] : 0.5876          
##                                           
##                   Kappa : 0.4079          
##                                           
##  Mcnemar's Test P-Value : 1.0000          
##                                           
##             Sensitivity : 0.9437          
##             Specificity : 0.4444          
##          Pos Pred Value : 0.9306          
##          Neg Pred Value : 0.5000          
##              Prevalence : 0.8875          
##          Detection Rate : 0.8375          
##    Detection Prevalence : 0.9000          
##       Balanced Accuracy : 0.6941          
##                                           
##        'Positive' Class : 1               
##

11.2 K-Nearest Neighbor

library(class)
knn_pred <- knn(train = train_pred,
                  test = test_pred,
                  cl = train_label,
                k = 17)

knn_pred <- as.data.frame(knn_pred)
str(knn_pred)

## 'data.frame':    80 obs. of  1 variable:
##  $ knn_pred: Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...

test_label <- as.data.frame(test_label)
str(test_label)

## 'data.frame':    80 obs. of  1 variable:
##  $ test_label: Factor w/ 2 levels "0","1": 1 2 2 2 2 2 2 2 2 1 ...

confusionMatrix(data = knn_pred$knn_pred,
                reference = test_label$test_label,
                positive = "1")

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction  0  1
##          0  4  2
##          1  5 69
##                                          
##                Accuracy : 0.9125         
##                  95% CI : (0.828, 0.9641)
##     No Information Rate : 0.8875         
##     P-Value [Acc > NIR] : 0.3098         
##                                          
##                   Kappa : 0.4872         
##                                          
##  Mcnemar's Test P-Value : 0.4497         
##                                          
##             Sensitivity : 0.9718         
##             Specificity : 0.4444         
##          Pos Pred Value : 0.9324         
##          Neg Pred Value : 0.6667         
##              Prevalence : 0.8875         
##          Detection Rate : 0.8625         
##    Detection Prevalence : 0.9250         
##       Balanced Accuracy : 0.7081         
##                                          
##        'Positive' Class : 1              
##

12 Summary

The evaluation of confusion matrix of logistic regression method from data train are as follows:
1. Accuracy reveals value 0.8875, meaning that 88.75% of our data is correctly classified.
2. Sensitivity / Recall reveals value 0.9437, meaning that FN (False Negative) is very small, that 94.37% of our positive outcomes are correctly classified.
3. Pos Pred Value / Precision reveals value 0.9306, meaning that FP (False Positive) is very small, that 93.06 % of our positive predictions are correct.

The evaluation of confusion matrix of KNN method from data train are as follows:
1. Accuracy reveals value 0.9125, meaning that 91.25% of our data is correctly classified.
2. Sensitivity / Recall reveals value 0.9718, meaning that FN (False Negative) is very small, that 97.18% of our positive outcomes are correctly classified.
3. Pos Pred Value / Precision reveals value 0.9324, meaning that FP (False Positive) is very small, that 93.24 % of our positive predictions are correct.

From comparison of the confusion matrix above, the predicted values from K-Nearest Neighbor show better prediction than Logistic Regression method.

Logistic Regression & K-Nearest Neighbor Analysis - Graduate Admission Dataset

Laurensius Wiwiek Winarta

2020-02-24