kNN

load("knn data.RData")
library(class)

## Warning: package 'class' was built under R version 3.5.2

Today rather I’m going to build a basic predictive model using the kNN method, or k-nearest neighbors algorythm. Think of it as an extension of clusting used to help statistical model perform better than your usual linear regressions. Wiki article here for the details: https://en.wikipedia.org/wiki/K-nearest_neighbors_algorithm

I’ll be using a dataset on graduate admissions chances in india that on found on kaggle at this link: https://www.kaggle.com/mohansacharya/graduate-admissions Using information such as GRE scores, cGPA, and TOEFL scores, we’ll try to see if we can produce an accurate model that can show if your chances are good, bad, or best based on this information.

Let’s begin.

First let’s load the data set and take a peek at the data and variables.

Admission_Predict_Ver1.1 <- read.csv("C:/Users/Alexander Chang/Downloads/graduate-admissions/Admission_Predict_Ver1.1.csv")
View(Admission_Predict_Ver1.1)
str(Admission_Predict_Ver1.1)

## 'data.frame':    500 obs. of  9 variables:
##  $ Serial.No.       : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ GRE.Score        : int  337 324 316 322 314 330 321 308 302 323 ...
##  $ TOEFL.Score      : int  118 107 104 110 103 115 109 101 102 108 ...
##  $ University.Rating: int  4 4 3 3 2 5 3 2 1 3 ...
##  $ SOP              : num  4.5 4 3 3.5 2 4.5 3 3 2 3.5 ...
##  $ LOR              : num  4.5 4.5 3.5 2.5 3 3 4 4 1.5 3 ...
##  $ CGPA             : num  9.65 8.87 8 8.67 8.21 9.34 8.2 7.9 8 8.6 ...
##  $ Research         : int  1 1 1 1 0 1 1 0 0 0 ...
##  $ Chance.of.Admit  : num  0.92 0.76 0.72 0.8 0.65 0.9 0.75 0.68 0.5 0.45 ...

Let’s clean this up a bit by removing the variable “Serial.No.” as it is unimportant for our model.

Admin <- Admission_Predict_Ver1.1[-1]

Typically, it’s best to normalize all of your data before proceeding. In laypeople terms it means converting something like the GRE score into a scale between 0 and 1. I’m following an excellent guide found here: https://www.analyticsvidhya.com/blog/2015/08/learning-concept-knn-algorithms-programming/

#create normalizing function 
normalize <- function(x) {
return ((x - min(x)) / (max(x) - min(x))) }
#normalize my data
admin_n <- as.data.frame(lapply(Admin[1:7],normalize))
#Checking normalization using GRE Scores 
summary(admin_n$GRE.Score)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  0.3600  0.5400  0.5294  0.7000  1.0000

Let’s serpeate the training and testing data sets of our inputs, I’m going by a 2:3 ratio of train to test data.

admin_train <- admin_n[1:200,]
admin_test <- admin_n[201:500,]

Before doing the same with the output, I want to change the numerical numbers of “chances of admission” from a 0 to 1 scale to percentiles split into thirds. So 0-33% is “bad”, 34-66% is “good”, and 67-100% is “best”. This is so that the model output can be categorical, and thereby represented in a chi-square projection, which is much more interpretable.

Percentile_00  = min(Admin$Chance.of.Admit)
Percentile_33  = quantile(Admin$Chance.of.Admit, 0.33333)
Percentile_67  = quantile(Admin$Chance.of.Admit, 0.66667)
Percentile_100 = max(Admin$Chance.of.Admit)


RB = rbind(Percentile_00, Percentile_33, Percentile_67, Percentile_100)

dimnames(RB)[[2]] = "Value"

RB

##                    Value
## Percentile_00  0.3400000
## Percentile_33  0.6633167
## Percentile_67  0.7900000
## Percentile_100 0.9700000

Admin$Group[Admin$Chance.of.Admit >= Percentile_00 & Admin$Chance.of.Admit <  Percentile_33]  = "Bad"
Admin$Group[Admin$Chance.of.Admit >= Percentile_33 & Admin$Chance.of.Admit <  Percentile_67]  = "Good"
Admin$Group[Admin$Chance.of.Admit >= Percentile_67 & Admin$Chance.of.Admit <= Percentile_100] = "Best"

Now to split the training and test output data.

admin_out_train <- Admin[1:200,9]
admin_out_test <- Admin[201:500,9]

To use the kNN funciton, I need to first install and load the requesite packages, the most popular one I found is “class”

Let’s build the model

admin_test_pred <- knn(train = admin_train, test = admin_test,cl = admin_out_train, k=22)

Now to validate it and see it’s performance.

library(gmodels)

## Warning: package 'gmodels' was built under R version 3.5.2

CrossTable(x = admin_out_test, y = admin_test_pred,chisq= FALSE)

## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## | Chi-square contribution |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  300 
## 
##  
##                | admin_test_pred 
## admin_out_test |       Bad |      Best |      Good | Row Total | 
## ---------------|-----------|-----------|-----------|-----------|
##            Bad |        84 |         0 |        10 |        94 | 
##                |    38.422 |    31.020 |     4.806 |           | 
##                |     0.894 |     0.000 |     0.106 |     0.313 | 
##                |     0.609 |     0.000 |     0.159 |           | 
##                |     0.280 |     0.000 |     0.033 |           | 
## ---------------|-----------|-----------|-----------|-----------|
##           Best |         3 |        80 |        15 |        98 | 
##                |    39.280 |    70.237 |     1.513 |           | 
##                |     0.031 |     0.816 |     0.153 |     0.327 | 
##                |     0.022 |     0.808 |     0.238 |           | 
##                |     0.010 |     0.267 |     0.050 |           | 
## ---------------|-----------|-----------|-----------|-----------|
##           Good |        51 |        19 |        38 |       108 | 
##                |     0.035 |     7.769 |    10.348 |           | 
##                |     0.472 |     0.176 |     0.352 |     0.360 | 
##                |     0.370 |     0.192 |     0.603 |           | 
##                |     0.170 |     0.063 |     0.127 |           | 
## ---------------|-----------|-----------|-----------|-----------|
##   Column Total |       138 |        99 |        63 |       300 | 
##                |     0.460 |     0.330 |     0.210 |           | 
## ---------------|-----------|-----------|-----------|-----------|
## 
##

Interpreting this graph is quite simple, the higher the value in the boxes with idential comlumn and row labes, the better the performance. For example, our model excelled at predicting “Bad” admissions chances, with 85 true positives and 9 False negatives, with out model predicting that 9 “good” chances were in fact, “bad” when compared to our testing model. The same is true for the “Best” category, with 79 true positives versus 19 false negatives.

Where the model struggled is whether or not you had “good” chances of acceptance, which is understandable given that this is the grey area of the graph where most of the statistical noise is. On the bright side, the model was conservative, choosing to say your chances were “bad” when they were in fact “good”.

I went ahead and put together a small visual that shows where admission chances fall between the two variables “GRE Score” and “Cumlative GPA”

library(ggvis)

## Warning: package 'ggvis' was built under R version 3.5.2

Admin %>% ggvis(~GRE.Score, ~CGPA, fill = ~Group) %>% layer_points()

For the most part they fall right where one would expect them to be, Notice however, the middle of each variable you get a healthy mix of “good” and “bad”.

Well that’s it for today. I’ll try something a little more challenging next time.

kNN

Alex Chang

January 15, 2019