Instructions

Please find kNN.R

This R script requires a dataset, labelcol and K (number of nearest neighbors to be considered)

The dataset MUST Be numeric, except the labelcol The labelcol must be the last column in the data.frame All the other columns must be before the labelcol

To DO:

Please find icu.csv The formula to fit is “STA ~ TYP + COMA + AGE + INF”

Read the icu.csv subset it with these 5 features in the formula and STA is the labelcol.

Split the icu 70/30 train/test and run the kNN.R for K=(3,5,7,15,25,50)

submit the result confusionMatrix, Accuracy for each K

Plot Accuracy vs K.

write a short summary of your findings.

Grade–>40 Changing the code 10 Running for different values of K 10 Plot Accuracy 10 Summary 10

Work

The exercise will be to take the below provided KNN script written to classify the iris dataset and repurpose for use with the data in icu.csv. Let’s load that now.

icu_df <- read.csv("icu.csv")

Some preliminary EDA before we go about altering the KNN script developed for the iris data:

str(icu_df)
## 'data.frame':    200 obs. of  21 variables:
##  $ ID  : int  8 12 14 28 32 38 40 41 42 50 ...
##  $ STA : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ AGE : int  27 59 77 54 87 69 63 30 35 70 ...
##  $ SEX : int  1 0 0 0 1 0 0 1 0 1 ...
##  $ RACE: int  1 1 1 1 1 1 1 1 2 1 ...
##  $ SER : int  0 0 1 0 1 0 1 0 0 1 ...
##  $ CAN : int  0 0 0 0 0 0 0 0 0 1 ...
##  $ CRN : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ INF : int  1 0 0 1 1 1 0 0 0 0 ...
##  $ CPR : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ SYS : int  142 112 100 142 110 110 104 144 108 138 ...
##  $ HRA : int  88 80 70 103 154 132 66 110 60 103 ...
##  $ PRE : int  0 1 0 0 1 0 0 0 0 0 ...
##  $ TYP : int  1 1 0 1 1 1 0 1 1 0 ...
##  $ FRA : int  0 0 0 1 0 0 0 0 0 0 ...
##  $ PO2 : int  0 0 0 0 0 1 0 0 0 0 ...
##  $ PH  : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ PCO : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ BIC : int  0 0 0 0 0 1 0 0 0 0 ...
##  $ CRE : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ LOC : int  0 0 0 0 0 0 0 0 0 0 ...

Based on the blackboard instructions, we need to add a derived column:

Add a variable COMA in the dataset, here is how you can derive COMA from LOC variable:

If LOC is 2 set COMA to 1 otherwise set COMA to 0

icu_df$COMA <- ifelse(icu_df$LOC ==2, 1, 0)

Verify COMA was added as expected.

head(icu_df)
##   ID STA AGE SEX RACE SER CAN CRN INF CPR SYS HRA PRE TYP FRA PO2 PH PCO
## 1  8   0  27   1    1   0   0   0   1   0 142  88   0   1   0   0  0   0
## 2 12   0  59   0    1   0   0   0   0   0 112  80   1   1   0   0  0   0
## 3 14   0  77   0    1   1   0   0   0   0 100  70   0   0   0   0  0   0
## 4 28   0  54   0    1   0   0   0   1   0 142 103   0   1   1   0  0   0
## 5 32   0  87   1    1   1   0   0   1   0 110 154   1   1   0   0  0   0
## 6 38   0  69   0    1   0   0   0   1   0 110 132   0   1   0   1  0   0
##   BIC CRE LOC COMA
## 1   0   0   0    0
## 2   0   0   0    0
## 3   0   0   0    0
## 4   0   0   0    0
## 5   0   0   0    0
## 6   1   0   0    0

Per the instructions, we will subset this df to only include the five variables of interest and ensure that the STA feature is the rightmost column in the df.

#"STA ~ TYP + COMA + AGE + INF"
icu <- icu_df[c("TYP", "COMA", "AGE", "INF", "STA")]
head(icu)
##   TYP COMA AGE INF STA
## 1   1    0  27   1   0
## 2   1    0  59   0   0
## 3   0    0  77   0   0
## 4   1    0  54   1   0
## 5   1    0  87   1   0
## 6   1    0  69   1   0

Provided KNN.R script

euclideanDist <- function(a, b){
  d = 0
  for(i in c(1:(length(a)) ))
  {
    d = d + (a[[i]]-b[[i]])^2
  }
  d = sqrt(d)
  return(d)
}




knn_predict2 <- function(test_data, train_data, k_value, labelcol){
  pred <- c()  #empty pred vector 
  #LOOP-1
  for(i in c(1:nrow(test_data))){   #looping over each record of test data
    eu_dist =c()          #eu_dist & eu_char empty  vector
    eu_char = c()
    good = 0              #good & bad variable initialization with 0 value
    bad = 0
    
    #LOOP-2-looping over train data 
    for(j in c(1:nrow(train_data))){
 
      #adding euclidean distance b/w test data point and train data to eu_dist vector
      eu_dist <- c(eu_dist, euclideanDist(test_data[i,-c(labelcol)], train_data[j,-c(labelcol)]))
 
      #adding class variable of training data in eu_char
      eu_char <- c(eu_char, as.character(train_data[j,][[labelcol]]))
    }
    
    eu <- data.frame(eu_char, eu_dist) #eu dataframe created with eu_char & eu_dist columns
 
    eu <- eu[order(eu$eu_dist),]       #sorting eu dataframe to gettop K neighbors
    eu <- eu[1:k_value,]               #eu dataframe with top K neighbors
 
    tbl.sm.df<-table(eu$eu_char)
    cl_label<-  names(tbl.sm.df)[[as.integer(which.max(tbl.sm.df))]]
    
    pred <- c(pred, cl_label)
    }
    return(pred) #return pred vector
  }
  

accuracy <- function(test_data,labelcol,predcol){
  correct = 0
  for(i in c(1:nrow(test_data))){
    if(test_data[i,labelcol] == test_data[i,predcol]){ 
      correct = correct+1
    }
  }
  accu = (correct/nrow(test_data)) * 100  
  return(accu)
}

#load data
knn.df<-iris
labelcol <- 5 # for iris it is the fifth col 
predictioncol<-labelcol+1
# create train/test partitions
set.seed(2)
n<-nrow(knn.df)
knn.df<- knn.df[sample(n),]

train.df <- knn.df[1:as.integer(0.7*n),]

K = 3 # number of neighbors to determine the class
table(train.df[,labelcol])
## 
##     setosa versicolor  virginica 
##         36         37         32
test.df <- knn.df[as.integer(0.7*n +1):n,]
table(test.df[,labelcol])
## 
##     setosa versicolor  virginica 
##         14         13         18
predictions <- knn_predict2(test.df, train.df, K,labelcol) #calling knn_predict()

test.df[,predictioncol] <- predictions #Adding predictions in test data as 7th column
print(accuracy(test.df,labelcol,predictioncol))
## [1] 95.55556
table(test.df[[predictioncol]],test.df[[labelcol]])
##             
##              setosa versicolor virginica
##   setosa         14          0         0
##   versicolor      0         12         1
##   virginica       0          1        17

This is interesting. K = 5 generates an accuracy of .85, which actually outperforms K = 7. K = 15 yields an accuracy of .883 and does not improve as K is increased to 25 or 50. Therefore, based on the data provided, we may want to use K = 15 for classification - or perhaps see if we could go down to K = 10 and maintain the same accuracy. K = 15 could lead to overfitting. It’s possible the K = 5 algorithm may yield the right mix of bias and variance in the end.

The fewer number of neighbors used in the algorithm, the less computationally expensive it will be at runtime, which could be another argument in K = 5’s favor.

Overall, KNN is a simple and lazy (not necessarily a bad thing) classifier. We don’t need to train a model. We don’t have parametric assumptions regarding the underlying data. And, as we saw, it’s capable or accurate predictions.

KNN does have drawbacks, however. It requires significant CPU resources and memory. The algorithm is sensitive to outliers in terms of Euclidean distance.