Homework 2

Consider a dataset as shown below:

#importing required librarires
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

#importing the dataset from github
data = read.csv("https://raw.githubusercontent.com/maharjansudhan/DATA622/master/data.csv", header=TRUE, sep=",")
head(data)

##   X Y label
## 1 5 a  BLUE
## 2 5 b BLACK
## 3 5 c  BLUE
## 4 5 d BLACK
## 5 5 e BLACK
## 6 5 f BLACK

dim(data)

## [1] 36  3

#to check the structure of the dataset and its properties
glimpse(data)

## Rows: 36
## Columns: 3
## $ X     <int> 5, 5, 5, 5, 5, 5, 19, 19, 19, 19, 19, 19, 35, 35, 35, 35, 35, 3…
## $ Y     <fct> a, b, c, d, e, f, a, b, c, d, e, f, a, b, c, d, e, f, a, b, c, …
## $ label <fct> BLUE, BLACK, BLUE, BLACK, BLACK, BLACK, BLUE, BLUE, BLUE, BLUE,…

There are 36 objects and 3 variables in this dataset. All the variables are factor except x. We need to convert x into factors as well.

#convert x to factors
data[,]<- lapply(data, as.factor)
str(data)

## 'data.frame':    36 obs. of  3 variables:
##  $ X    : Factor w/ 6 levels "5","19","35",..: 1 1 1 1 1 1 2 2 2 2 ...
##  $ Y    : Factor w/ 6 levels "a","b","c","d",..: 1 2 3 4 5 6 1 2 3 4 ...
##  $ label: Factor w/ 2 levels "BLACK","BLUE": 2 1 2 1 1 1 2 2 2 2 ...

Since, we converted all the variables to a factor variables, let’s build our algorithms.

Run kNN, Tree, NB, LDA and LR, SVM with RBS Kernel (60%) and

determine the AUC, ACCURACY, TPR,FPR for each algorithm, create a table as shown below

ALGO AUC,ACC,TPR,FPR LR LDA NB SVM kNN TREE

Summarize and provide a explanatory commentary on the observed performance of these classifiers:

Model Building

Since, we are builiding different algorithms, MLeval is a very handy tool to work with. I came accross

https://www.r-bloggers.com/how-to-easily-make-a-roc-curve-in-r/

while searching how to work with different machine learning models. This tool can be used for ROC curve, PR curve, PR gain curve and so on.

MLeval can be used directly on data frame of predicted probabilites or on Caret ‘train’ function output which performs cross validation to avoid overfitting. Different models can be compared at the same time using MLevel tool.

#import required libraries
library(caret)

## Loading required package: lattice

## Loading required package: ggplot2

set.seed(123)

#running Caret on using different methods
ctrl <- trainControl(method="cv", classProbs=T,savePredictions = T)

# train kNN model
fit1 <- train(label ~ ., data=data, method="knn", trControl=ctrl)
fit1

## k-Nearest Neighbors 
## 
## 36 samples
##  2 predictor
##  2 classes: 'BLACK', 'BLUE' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 32, 32, 32, 32, 32, 33, ... 
## Resampling results across tuning parameters:
## 
##   k  Accuracy   Kappa
##   5  0.7416667  0.49 
##   7  0.6833333  0.29 
##   9  0.7083333  0.34 
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 5.

# training Random Forest model 
fit2 <- train(label ~ ., data=data, method="rf", trControl=ctrl)
fit2

## Random Forest 
## 
## 36 samples
##  2 predictor
##  2 classes: 'BLACK', 'BLUE' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 33, 31, 32, 33, 33, 33, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa
##    2    0.7166667  0.32 
##    6    0.7833333  0.52 
##   10    0.7333333  0.42 
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 6.

# training Linear Discriminant Analysis model
fit3 <- train(label ~ ., data=data, method="lda", trControl=ctrl)
fit3

## Linear Discriminant Analysis 
## 
## 36 samples
##  2 predictor
##  2 classes: 'BLACK', 'BLUE' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 32, 33, 32, 33, 32, 33, ... 
## Resampling results:
## 
##   Accuracy   Kappa
##   0.7666667  0.51

# training Naive Bayes model
# for this model we have to setup a grid, c(0,0.5,1.0), since 0 was giving NaN value, I omitted the 0 from this grid
grid <- data.frame(fL=c(0.5,1.0), usekernel = TRUE, adjust=c(0.5,1.0))
fit4 <- train(label ~ ., data=data, tuneGrid=grid, method="nb", trControl=ctrl)
fit4

## Naive Bayes 
## 
## 36 samples
##  2 predictor
##  2 classes: 'BLACK', 'BLUE' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 32, 33, 33, 32, 32, 33, ... 
## Resampling results across tuning parameters:
## 
##   fL   adjust  Accuracy   Kappa
##   0.5  0.5     0.6433333  0.1  
##   1.0  1.0     0.6433333  0.1  
## 
## Tuning parameter 'usekernel' was held constant at a value of TRUE
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were fL = 0.5, usekernel = TRUE and
##  adjust = 0.5.

# training Generalized Linear model
fit5 <- train(label ~ ., data=data, method="glm", trControl=ctrl)
fit5

## Generalized Linear Model 
## 
## 36 samples
##  2 predictor
##  2 classes: 'BLACK', 'BLUE' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 33, 32, 32, 32, 32, 33, ... 
## Resampling results:
## 
##   Accuracy  Kappa
##   0.8       0.55

# training SVM model
fit6 <- train(label ~ ., data=data, method="svmRadial", trControl=ctrl)
fit6

## Support Vector Machines with Radial Basis Function Kernel 
## 
## 36 samples
##  2 predictor
##  2 classes: 'BLACK', 'BLUE' 
## 
## No pre-processing
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 33, 32, 33, 32, 32, 32, ... 
## Resampling results across tuning parameters:
## 
##   C     Accuracy   Kappa
##   0.25  0.8333333  0.64 
##   0.50  0.8583333  0.69 
##   1.00  0.7500000  0.44 
## 
## Tuning parameter 'sigma' was held constant at a value of 0.05987395
## Accuracy was used to select the optimal model using the largest value.
## The final values used for the model were sigma = 0.05987395 and C = 0.5.

Now, lets run MLeval on the given algorithms:

#import required library
library(MLeval)

## 
## Attaching package: 'MLeval'

## The following objects are masked _by_ '.GlobalEnv':
## 
##     fit1, fit2, fit3

#run res and plot ROC 
res <-evalm(list(fit1,fit2,fit3,fit4,fit5,fit6),
            gnames=c('knn','rf','lda','nb','glm','svmRadial'), 
            plots='r')

## ***MLeval: Machine Learning Model Evaluation***

## Input: caret train function object

## Not averaging probs.

## Group 1 type: cv

## Group 2 type: cv

## Group 3 type: cv

## Group 4 type: cv

## Group 5 type: cv

## Group 6 type: cv

## Observations: 216

## Number of groups: 6

## Observations per group: 36

## Positive: BLUE

## Negative: BLACK

## Group: knn

## Positive: 14

## Negative: 22

## Group: rf

## Positive: 14

## Negative: 22

## Group: lda

## Positive: 14

## Negative: 22

## Group: nb

## Positive: 14

## Negative: 22

## Group: glm

## Positive: 14

## Negative: 22

## Group: svmRadial

## Positive: 14

## Negative: 22

## ***Performance Metrics***

## knn Optimal Informedness = 0.655844155844156

## rf Optimal Informedness = 0.701298701298701

## lda Optimal Informedness = 0.701298701298701

## nb Optimal Informedness = 0.487012987012987

## glm Optimal Informedness = 0.558441558441559

## svmRadial Optimal Informedness = 0.792207792207792

## knn AUC-ROC = 0.8

## rf AUC-ROC = 0.79

## lda AUC-ROC = 0.82

## nb AUC-ROC = 0.7

## glm AUC-ROC = 0.72

## svmRadial AUC-ROC = 0.88

Let’s prepare the final table

library(kableExtra)

## 
## Attaching package: 'kableExtra'

## The following object is masked from 'package:dplyr':
## 
##     group_rows

#let's collect all the required data for the final output

# kNN
algo_knn <- cbind(AUC = res$stdres$knn['AUC-ROC','Score'], 
                  ACC = mean(fit1$results[ ,'Accuracy']), 
                  FPR = res$stdres$knn['FPR','Score'], 
                  TPR = res$stdres$knn['TP','Score']/(res$stdres$knn['TP','Score']+res$stdres$knn['FN','Score']))

# RF
algo_rf <- cbind(AUC = res$stdres$rf['AUC-ROC','Score'], 
                 ACC = mean(fit2$results[ ,'Accuracy']),
                 FPR = res$stdres$rf['FPR','Score'],
                 TPR = res$stdres$rf['TP','Score']/(res$stdres$rf['TP','Score']+res$stdres$rf['FN','Score']))

# LDA
algo_lda <- cbind(AUC = res$stdres$lda['AUC-ROC','Score'], 
                  ACC = mean(fit3$results[ ,'Accuracy']), 
                  FPR = res$stdres$lda['FPR','Score'], 
                  TPR = res$stdres$lda['TP','Score']/(res$stdres$lda['TP','Score']+res$stdres$lda['FN','Score']))

# NB
algo_nb <- cbind(AUC = res$stdres$nb['AUC-ROC','Score'], 
                 ACC = mean(fit4$results[ ,'Accuracy']),
                 FPR = res$stdres$nb['FPR','Score'],
                 TPR = res$stdres$nb['TP','Score']/(res$stdres$nb['TP','Score']+res$stdres$nb['FN','Score']))

# GLM
algo_glm <- cbind(AUC = res$stdres$glm['AUC-ROC','Score'], 
                  ACC = mean(fit5$results[ ,'Accuracy']), 
                  FPR = res$stdres$glm['FPR','Score'], 
                  TPR = res$stdres$glm['TP','Score']/(res$stdres$glm['TP','Score']+res$stdres$glm['FN','Score']))

# SVM
algo_svm <- cbind(AUC = res$stdres$svm['AUC-ROC','Score'], 
                  ACC = mean(fit6$results[ ,'Accuracy']), 
                  FPR = res$stdres$svm['FPR','Score'], 
                  TPR = res$stdres$svm['TP','Score']/(res$stdres$svm['TP','Score']+res$stdres$svm['FN','Score']))



final_table <- rbind(algo_knn, algo_rf, algo_lda, algo_nb, algo_glm, algo_svm)

#name the rows
rownames(final_table) <- c('k-Nearest Neighbors','Random Forest','Linear Discriminant Analysis',
                           'Naive Bayes','Generalised Linear Model','Support Vector Machines')


#final table
final_table %>%
  kable() %>%
  kable_styling()

	AUC	ACC	FPR	TPR
k-Nearest Neighbors	0.80	0.7111111	0.273	0.7857143
Random Forest	0.79	0.7444444	0.136	0.6428571
Linear Discriminant Analysis	0.82	0.7666667	0.227	0.7857143
Naive Bayes	0.70	0.6433333	0.045	0.1428571
Generalised Linear Model	0.72	0.8000000	0.227	0.7857143
Support Vector Machines	0.88	0.8138889	0.136	0.8571429

Conclusion

SVM model scores highest accross the board except the FPR value whereas Naive Bayes scores the lowest amongst all the algorithms. SVM has the AUC score of 0.88 and Naive Bayes has 0.70.

LDA, KNN, Random Forest also performs well whereas GLM tends to follow Naive Bayes.

The AUC, ACC, FPR and TPR score for LDA, KNN, and Random Forest are betwen high and low (kind of average). These are on the safe side which means either you are dealing with huge or small datasets, these are the good algorithms to work with.

The accuracy value is also high for all algorithms except Naive Bayes which is the lowest on the list.

SVM model is popular when you are dealing with small datasets and Naive Bayes is used when you are dealing with huge dataset with multi class predictions.

Assignment2_Sudhan_Maharjan

Sudhan Maharjan

4/8/2020

Homework 2

Model Building

Conclusion