Data622 HW2

Run the Model Exercise

Consider a dataset as shown below and run kNN, Tree, NB, LDA and LR, SVM with RBF Kernel and determine the AUC, ACCURACY, TPR, FPR for each algorithm.

Data Exploration

library(caret)
#  reading data 
data <- read.csv("/Users/Olga/Desktop/data622/HW2/data", header = TRUE)
head(data)

##   x y label
## 1 5 a  BLUE
## 2 5 b BLACK
## 3 5 c  BLUE
## 4 5 d BLACK
## 5 5 e BLACK
## 6 5 f BLACK

dim(data)

## [1] 36  3

# checking data structure
str(data)

## 'data.frame':    36 obs. of  3 variables:
##  $ x    : int  5 5 5 5 5 5 19 19 19 19 ...
##  $ y    : Factor w/ 6 levels "a","b","c","d",..: 1 2 3 4 5 6 1 2 3 4 ...
##  $ label: Factor w/ 2 levels "BLACK","BLUE": 2 1 2 1 1 1 2 2 2 2 ...

summary(data)

##        x      y       label   
##  Min.   : 5   a:6   BLACK:22  
##  1st Qu.:19   b:6   BLUE :14  
##  Median :43   c:6             
##  Mean   :38   d:6             
##  3rd Qu.:55   e:6             
##  Max.   :63   f:6

Data set is very small - 36 obs with 3 variables. All variables look categorical, variable “x” should be converted to factor.

Data set is moderately imbalanced: “BLACK”: “BLUE” is 60:40

Data Preparation

# changing type of all variables to factor
data[] <- lapply(data, as.factor)

Model Building

As data set is very small, I will use bootstrap resampling technique with 100 repetitions instead of basic train-test split. However bootstrap does not magically grant us extra power. As we have a small sample, we have little power.

Bootstrap works by sampling with replacement from the original data, and take the “not chosen” data points as test cases. It will be run several times (n= 100) and the average score as estimation of the models performance will be calculated.

Random Forest, kNN, GLM, Naive Bayes and SVM model with RBF kernel will be trained. Basic tuning parameters will be used for each model. Caret and MLeval packages/libraries will be used.

library(caret)

set.seed(300)

#  applying Leave One Out Cross Validation with 30 training instances vs 6 test obs
ctrl <- trainControl(method="boot", n=100, classProbs=T,  savePredictions = T)

# training Random Forest
fit1 <- train(label ~ .,data=data, method="rf", trControl=ctrl)


# training kNN model 
fit2 <- train(label ~ .,data=data, method="knn", trControl=ctrl)


#  training GLM model
fit3 <- train(label ~ .,data=data, method="glm",family="binomial", trControl=ctrl)


#  training Naive Bayes model
grid <- data.frame(fL=c(0,0.5,1.0), usekernel = TRUE, adjust=c(0,0.5,1.0))
fit4 <- train(label ~ .,data=data, method="nb", trControl=ctrl, tuneGrid=grid)


#  training SVM model with RBF kernel
fit5 <- train(label ~ .,data=data, method="svmRadial", trControl=ctrl)


# training LDA model
fit6 <- train(label ~ .,data=data, method="lda", trControl=ctrl)

Performance Summary

# https://cran.r-project.org/web/packages/MLeval/vignettes/introduction.pdf

library(MLeval)

res <- evalm(list(fit1,fit2,fit3, fit4,fit5, fit6), gnames=c('rf','knn', 'glm', 'nb','svmRadial','lda'), rlinethick=0.5, fsize=10, plots='r')

# https://cran.r-project.org/web/packages/MLeval/vignettes/introduction.pdf

m1<-cbind(AUC=res$stdres$rf['AUC-ROC','Score'], Accuracy = mean(fit1$results[,'Accuracy']), FPR=res$stdres$rf['FPR','Score'], TPR = res$stdres$rf['TP','Score']/(res$stdres$rf['TP','Score']+res$stdres$rf['FN','Score']))

m2<-cbind(AUC=res$stdres$knn['AUC-ROC','Score'],Accuracy = mean(fit2$results[,'Accuracy']), FPR=res$stdres$knn['FPR','Score'],TPR = res$stdres$knn['TP','Score']/(res$stdres$knn['TP','Score']+res$stdres$knn['FN','Score']))

m3<-cbind(AUC=res$stdres$glm['AUC-ROC','Score'], Accuracy = mean(fit3$results[,'Accuracy']), FPR=res$stdres$glm['FPR','Score'],TPR = res$stdres$glm['TP','Score']/(res$stdres$glm['TP','Score']+res$stdres$glm['FN','Score']))

m4<-cbind(AUC=res$stdres$nb['AUC-ROC','Score'], Accuracy = mean(fit4$results[2:3,'Accuracy']), FPR=res$stdres$nb['FPR','Score'],TPR = res$stdres$nb['TP','Score']/(res$stdres$nb['TP','Score']+res$stdres$nb['FN','Score']))

m5<-cbind(AUC=res$stdres$svmRadial['AUC-ROC','Score'], Accuracy = mean(fit5$results[,'Accuracy']), FPR=res$stdres$svmRadial['FPR','Score'],TPR = res$stdres$svmRadial['TP','Score']/(res$stdres$svmRadial['TP','Score']+res$stdres$svmRadial['FN','Score']))

m6<-cbind(AUC=res$stdres$lda['AUC-ROC','Score'], Accuracy = mean(fit6$results[,'Accuracy']), FPR=res$stdres$lda['FPR','Score'],TPR = res$stdres$lda['TP','Score']/(res$stdres$lda['TP','Score']+res$stdres$lda['FN','Score']))

summary = rbind(m1, m2, m3, m4, m5, m6)
rownames(summary) <- c("RF", 'kNN',"GLM","NB","svmRadial","LDA")
summary

##            AUC  Accuracy   FPR       TPR
## RF        0.82 0.7300402 0.136 0.6428571
## kNN       0.84 0.6915328 0.136 0.7142857
## GLM       0.82 0.7328654 0.273 0.8571429
## NB        0.71 0.6289359 0.091 0.1428571
## svmRadial 0.87 0.7309892 0.136 0.9285714
## LDA       0.83 0.7286528 0.136 0.7142857

Conclusions

SVM model with RBF kernel provides the highest performance amongst all algorithms, because it has highest AUC, accuracy, low FPR and high TPR (close to 1). LDA, kNN and GLM also showed quite good results. Naive Bayes performed weaker across the board with low AUC, accuracy and TPR. Random Forest showed average success.

It is quite expected that SVM with RBF kernel outperformed other models, because it usually works well even on a small data sets - when the amount of samples is small, but greater than the number of features, as in our case.

Naive Bayes was a weak model. All variables in our data set are categorical and if categorical variable has a category (in test data set), which was not observed in training data set, then model will assign a 0 (zero) probability and will be unable to make a prediction. This is often known as “Zero Frequency”. To solve this, we can use the smoothing technique. One of the simplest smoothing techniques is called Laplace estimation. Also Naive Bayes needs big data set in order to make reliable estimations of the probability of each class, otherwise precision and recall will keep very low. And the assumption of independence among predictors makes Naive Bayes model hard to use, because in real life, it is almost impossible that we get a set of predictors which are completely independent.

AUC would be preferred measure of classifier performance than accuracy for the following reasons:

AUC does not bias on size of evaluation data
Also accuracy depends on setting a probability cut-off (for balanced data this is fine, but in the imbalanced case the minority class probabilities may be all below 0.5, while AUC considers the ranking of positives and negative according to probability minority.)
Metric like accuracy is calculated based on the class distribution of test dataset or cross-validation, but this ratio may change when we apply the classifier to real life data, because the underlying class distribution has been changed or unknown. On the other hand, TP rate and FP rate which are used to construct AUC will be not be affected by class distribution shifting.