## Warning: package 'tidyverse' was built under R version 3.6.2
## -- Attaching packages -------------------------------------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.2.1 v purrr 0.3.3
## v tibble 2.1.3 v dplyr 0.8.3
## v tidyr 1.0.0 v stringr 1.4.0
## v readr 1.3.1 v forcats 0.4.0
## Warning: package 'ggplot2' was built under R version 3.6.2
## Warning: package 'tibble' was built under R version 3.6.2
## Warning: package 'tidyr' was built under R version 3.6.2
## Warning: package 'readr' was built under R version 3.6.2
## Warning: package 'purrr' was built under R version 3.6.2
## Warning: package 'dplyr' was built under R version 3.6.2
## Warning: package 'stringr' was built under R version 3.6.2
## Warning: package 'forcats' was built under R version 3.6.2
## -- Conflicts ----------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
## Warning: package 'plotly' was built under R version 3.6.2
##
## Attaching package: 'plotly'
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
## Warning: package 'GGally' was built under R version 3.6.2
## Registered S3 method overwritten by 'GGally':
## method from
## +.gg ggplot2
##
## Attaching package: 'GGally'
## The following object is masked from 'package:dplyr':
##
## nasa
## Warning: package 'funModeling' was built under R version 3.6.2
## Loading required package: Hmisc
## Warning: package 'Hmisc' was built under R version 3.6.2
## Loading required package: lattice
## Loading required package: survival
## Warning: package 'survival' was built under R version 3.6.2
## Loading required package: Formula
##
## Attaching package: 'Hmisc'
## The following object is masked from 'package:plotly':
##
## subplot
## The following objects are masked from 'package:dplyr':
##
## src, summarize
## The following objects are masked from 'package:base':
##
## format.pval, units
## funModeling v.1.9.3 :)
## Examples and tutorials at livebook.datascienceheroes.com
## / Now in Spanish: librovivodecienciadedatos.ai
##
## Attaching package: 'funModeling'
## The following object is masked from 'package:GGally':
##
## range01
## Warning: package 'lmtest' was built under R version 3.6.2
## Loading required package: zoo
## Warning: package 'zoo' was built under R version 3.6.2
##
## Attaching package: 'zoo'
## The following objects are masked from 'package:base':
##
## as.Date, as.Date.numeric
## Warning: package 'car' was built under R version 3.6.2
## Loading required package: carData
##
## Attaching package: 'car'
## The following object is masked from 'package:dplyr':
##
## recode
## The following object is masked from 'package:purrr':
##
## some
## Warning: package 'MLmetrics' was built under R version 3.6.2
##
## Attaching package: 'MLmetrics'
## The following object is masked from 'package:base':
##
## Recall
## Warning: package 'caret' was built under R version 3.6.2
##
## Attaching package: 'caret'
## The following objects are masked from 'package:MLmetrics':
##
## MAE, RMSE
## The following object is masked from 'package:survival':
##
## cluster
## The following object is masked from 'package:purrr':
##
## lift
Background
This is a learn by building project to predict the chances of students getting admission for Masters Program in a university based on several academic performance measurement using Logistic Regression & K-Nearest Neighbor Analysis method.
Importing Dataset
## 'data.frame': 400 obs. of 9 variables:
## $ Serial.No. : int 1 2 3 4 5 6 7 8 9 10 ...
## $ GRE.Score : int 337 324 316 322 314 330 321 308 302 323 ...
## $ TOEFL.Score : int 118 107 104 110 103 115 109 101 102 108 ...
## $ University.Rating: int 4 4 3 3 2 5 3 2 1 3 ...
## $ SOP : num 4.5 4 3 3.5 2 4.5 3 3 2 3.5 ...
## $ LOR : num 4.5 4.5 3.5 2.5 3 3 4 4 1.5 3 ...
## $ CGPA : num 9.65 8.87 8 8.67 8.21 9.34 8.2 7.9 8 8.6 ...
## $ Research : int 1 1 1 1 0 1 1 0 0 0 ...
## $ Chance.of.Admit : num 0.92 0.76 0.72 0.8 0.65 0.9 0.75 0.68 0.5 0.45 ...
## NULL
Context
This dataset is created for prediction of Graduate Admissions from an Indian perspective.
Content
The dataset contains several parameters which are considered important during the application for Masters Programs.
The parameters included are as follows:
1. GRE Scores (out of 340)
2. TOEFL Scores (out of 120)
3. University Rating (out of 5)
4. Statement of Purpose / SOP (out of 5)
5. Letter of Recommendation Strength / LOR (out of 5)
6. Undergraduate GPA (out of 10)
7. Research Experience (either 0 or 1)
8. Chance of Admit (ranging from 0 to 1)
Inspecting Dataset
## Serial.No. GRE.Score TOEFL.Score University.Rating
## 0 0 0 0
## SOP LOR CGPA Research
## 0 0 0 0
## Chance.of.Admit
## 0
There is no na or missing values in each columns of dataset.
The variable Serial.No is excluded from predictor variable due to no relationship with other variables.
The variable Label.of.Admit is divided into category “1” (admitted) and “0” (not admitted).
The variable Chance.of.Admit is excluded from predictor variable due to correlationship with variable Label.of.Admit.
## 'data.frame': 400 obs. of 8 variables:
## $ GRE.Score : int 337 324 316 322 314 330 321 308 302 323 ...
## $ TOEFL.Score : int 118 107 104 110 103 115 109 101 102 108 ...
## $ University.Rating: Factor w/ 5 levels "1","2","3","4",..: 4 4 3 3 2 5 3 2 1 3 ...
## $ SOP : num 4.5 4 3 3.5 2 4.5 3 3 2 3.5 ...
## $ LOR : num 4.5 4.5 3.5 2.5 3 3 4 4 1.5 3 ...
## $ CGPA : num 9.65 8.87 8 8.67 8.21 9.34 8.2 7.9 8 8.6 ...
## $ Research : Factor w/ 2 levels "0","1": 2 2 2 2 1 2 2 1 1 1 ...
## $ Label.of.Admit : Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 1 1 ...
## NULL
The type of variables University.Rating and Research are changed from integer into factor type. The type of variable Label.of.Admit is changed from character into factor type.
Solving Business Problem
The model for solving business problem is to predict the chances of students getting admission for a Masters Program in a university based on several academic performance measurement . The model will be developed using variables as follows:
Target variable: Chance.of.Admit
Predictor variable: GRE.Score, TOEFL.Score, University.Rating, Research, SOP, LOR, CGPA
Exploratory Data Analysis
Checking data distribution
## GRE.Score TOEFL.Score University.Rating SOP
## Min. :290.0 Min. : 92.0 1: 26 Min. :1.0
## 1st Qu.:308.0 1st Qu.:103.0 2:107 1st Qu.:2.5
## Median :317.0 Median :107.0 3:133 Median :3.5
## Mean :316.8 Mean :107.4 4: 74 Mean :3.4
## 3rd Qu.:325.0 3rd Qu.:112.0 5: 60 3rd Qu.:4.0
## Max. :340.0 Max. :120.0 Max. :5.0
## LOR CGPA Research Label.of.Admit
## Min. :1.000 Min. :6.800 0:181 0: 35
## 1st Qu.:3.000 1st Qu.:8.170 1:219 1:365
## Median :3.500 Median :8.610
## Mean :3.453 Mean :8.599
## 3rd Qu.:4.000 3rd Qu.:9.062
## Max. :5.000 Max. :9.920
The variables GRE.Score, TOEFL.Score, SOP, LOR, CGPA look like to have a distributed data, which can be seen from the median between min and max figures. The histogram shows such a normal distributed data.
Checking class-imbalance
##
## 0 1
## 0.09 0.91
The class have an imbalance figure, which is mostly in positive target variable Label.of.Admit.
Preparing Train and Test Dataset (Cross Validation)
For Logistic Regression
##
## 0 1
## 0.09 0.91
The data test shows the same portion with data train.
Developing Model
The model for solving business problem is developed using Logistic Regression. We use the 5 (five) predictors as we used in the Ordinary Least Square (OLS) model.
##
## Call:
## glm(formula = Label.of.Admit ~ GRE.Score + TOEFL.Score + LOR +
## CGPA + Research, family = "binomial", data = admission.train)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -2.97161 0.03122 0.11767 0.26511 1.81518
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -39.74380 10.27267 -3.869 0.000109 ***
## GRE.Score 0.02323 0.04159 0.558 0.576517
## TOEFL.Score 0.10104 0.10443 0.968 0.333274
## LOR 0.75639 0.44758 1.690 0.091037 .
## CGPA 2.72822 0.92169 2.960 0.003076 **
## Research1 0.14456 0.69873 0.207 0.836096
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 180.360 on 319 degrees of freedom
## Residual deviance: 98.596 on 314 degrees of freedom
## AIC: 110.6
##
## Number of Fisher Scoring iterations: 7
## Start: AIC=110.6
## Label.of.Admit ~ GRE.Score + TOEFL.Score + LOR + CGPA + Research
##
## Df Deviance AIC
## - Research 1 98.639 108.64
## - GRE.Score 1 98.909 108.91
## - TOEFL.Score 1 99.549 109.55
## <none> 98.596 110.60
## - LOR 1 101.559 111.56
## - CGPA 1 108.010 118.01
##
## Step: AIC=108.64
## Label.of.Admit ~ GRE.Score + TOEFL.Score + LOR + CGPA
##
## Df Deviance AIC
## - GRE.Score 1 98.995 107.00
## - TOEFL.Score 1 99.560 107.56
## <none> 98.639 108.64
## - LOR 1 101.922 109.92
## - CGPA 1 108.391 116.39
##
## Step: AIC=107
## Label.of.Admit ~ TOEFL.Score + LOR + CGPA
##
## Df Deviance AIC
## <none> 98.995 107.00
## - TOEFL.Score 1 101.246 107.25
## - LOR 1 102.414 108.41
## - CGPA 1 110.104 116.10
Testing Model
Check linearity of predictor & log of odds
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The 3 (three) predictor variables from the model: TOEFL.Score, LOR, CGPA, have a linear relationship with the target variable Label.of.Admit.
Check multicolinearity
## GRE.Score TOEFL.Score LOR CGPA Research
## 1.688381 2.094007 1.182333 1.634875 1.111747
The VIF numbers are below 10, meaning that there is no multicolinearity among predictor variables. This assumption of no multicolinearity is fullfilled.
Predicting Target
Logistic Regression
## 9 14 21 23 24 26 33 38 39 41 55 59 63 64 65 70
## 0.66 0.94 0.80 1.00 1.00 1.00 1.00 0.70 0.41 0.95 0.96 0.06 0.94 0.98 1.00 1.00
## 73 83 84 92 94 99 102 109 111 116 118 120 126 129 139 151
## 1.00 1.00 1.00 0.65 0.70 1.00 0.93 1.00 0.98 1.00 0.54 0.99 0.97 1.00 1.00 1.00
## 152 161 167 171 173 174 181 185 189 200 203 204 206 237 244 249
## 1.00 0.68 0.99 0.90 1.00 1.00 0.93 0.98 1.00 1.00 1.00 1.00 0.61 1.00 1.00 1.00
## 262 265 271 272 274 276 277 283 292 293 297 326 327 331 333 337
## 0.96 0.99 0.94 0.47 0.57 1.00 1.00 0.99 0.66 0.64 0.99 1.00 0.70 0.99 0.93 0.99
## 340 342 347 353 360 369 377 379 381 382 386 388 389 390 395 398
## 0.99 1.00 0.35 0.89 0.93 0.36 0.21 0.48 0.99 0.99 1.00 0.95 0.46 0.99 1.00 1.00
The prediction probability Label.of.Admit for data test (admission.test) will be saved in new variable pred.Admit.
The data test which is classified by pred.Admit will be saved in new variable pred.Label.
K-Nearest Neighbor
Finding optimum k
## [1] 18
We will use figure 17 as optimum k for target: 1 = ‘Admitted’ and 0 = ‘Not Admitted’.
Evaluating Model
Logistic Regression
## actual
## predicted 0 1
## 0 4 4
## 1 5 67
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 4 4
## 1 5 67
##
## Accuracy : 0.8875
## 95% CI : (0.7972, 0.9472)
## No Information Rate : 0.8875
## P-Value [Acc > NIR] : 0.5876
##
## Kappa : 0.4079
##
## Mcnemar's Test P-Value : 1.0000
##
## Sensitivity : 0.9437
## Specificity : 0.4444
## Pos Pred Value : 0.9306
## Neg Pred Value : 0.5000
## Prevalence : 0.8875
## Detection Rate : 0.8375
## Detection Prevalence : 0.9000
## Balanced Accuracy : 0.6941
##
## 'Positive' Class : 1
##
K-Nearest Neighbor
## 'data.frame': 80 obs. of 1 variable:
## $ knn_pred: Factor w/ 2 levels "0","1": 2 2 2 2 2 2 2 2 2 2 ...
## 'data.frame': 80 obs. of 1 variable:
## $ test_label: Factor w/ 2 levels "0","1": 1 2 2 2 2 2 2 2 2 1 ...
## Confusion Matrix and Statistics
##
## Reference
## Prediction 0 1
## 0 4 2
## 1 5 69
##
## Accuracy : 0.9125
## 95% CI : (0.828, 0.9641)
## No Information Rate : 0.8875
## P-Value [Acc > NIR] : 0.3098
##
## Kappa : 0.4872
##
## Mcnemar's Test P-Value : 0.4497
##
## Sensitivity : 0.9718
## Specificity : 0.4444
## Pos Pred Value : 0.9324
## Neg Pred Value : 0.6667
## Prevalence : 0.8875
## Detection Rate : 0.8625
## Detection Prevalence : 0.9250
## Balanced Accuracy : 0.7081
##
## 'Positive' Class : 1
##
Summary
The evaluation of confusion matrix of logistic regression method from data train are as follows:
1. Accuracy reveals value 0.8875, meaning that 88.75% of our data is correctly classified.
2. Sensitivity / Recall reveals value 0.9437, meaning that FN (False Negative) is very small, that 94.37% of our positive outcomes are correctly classified.
3. Pos Pred Value / Precision reveals value 0.9306, meaning that FP (False Positive) is very small, that 93.06 % of our positive predictions are correct.
The evaluation of confusion matrix of KNN method from data train are as follows:
1. Accuracy reveals value 0.9125, meaning that 91.25% of our data is correctly classified.
2. Sensitivity / Recall reveals value 0.9718, meaning that FN (False Negative) is very small, that 97.18% of our positive outcomes are correctly classified.
3. Pos Pred Value / Precision reveals value 0.9324, meaning that FP (False Positive) is very small, that 93.24 % of our positive predictions are correct.
From comparison of the confusion matrix above, the predicted values from K-Nearest Neighbor show better prediction than Logistic Regression method.