A: Data summary

First we load and review the data. This is a small dataset, with 36 observations of:

  • 2 predictor variables
    • X: numeric values ranging from 5 to 63
    • Y: categorical values ranging from “a” to “f”
  • Target variable label: classes “BLACK” and “BLUE”.

In this analysis, we take “BLUE” to be the “positive” class and “BLACK” to be the reference class. Note that there are no missing values or outliers.

df <- read_csv("hw2.txt",  
               col_types = "iff") 
glimpse(df)
## Observations: 36
## Variables: 3
## $ X     <int> 5, 5, 5, 5, 5, 5, 19, 19, 19, 19, 19, 19, 35, 35, 35, 35...
## $ Y     <fct> a, b, c, d, e, f, a, b, c, d, e, f, a, b, c, d, e, f, a,...
## $ label <fct> BLUE, BLACK, BLUE, BLACK, BLACK, BLACK, BLUE, BLUE, BLUE...
summary(df)
##        X      Y       label   
##  Min.   : 5   a:6   BLUE :14  
##  1st Qu.:19   b:6   BLACK:22  
##  Median :43   c:6             
##  Mean   :38   d:6             
##  3rd Qu.:55   e:6             
##  Max.   :63   f:6

We plot the data below. We see from the scatter plot that the classes are not linearly separable in the predictor X-Y space. Also from the pairs plot, we see that the classes have distinct conditional distributions, for instance:

  • the distribution of X is skewed to higher values for “BLACK” compared to “BLUE”
  • the distribution of Y is weighted more to category “c” for “BLUE” compared to “BLACK”.
# scatter plot
ggplot(df) + geom_point(aes(x = X, y = Y, shape = label, col = label), size = 10) + 
  labs(title = "Scatter plot of dataset", shape = "Class", col = "Class") +
  theme(legend.position = "top", legend.title = element_text(face = "bold"))

# pairs plot
ggpairs(df, aes(fill = label, alpha = 0.5), 
        title = "Pairs plot of dataset")

B: Modeling

In this section we apply 6 learning algorithms to the dataset: logistic regression, linear discriminant analysis (LDA), naive Bayes, k-nearest neighbors (kNN), support vector machine (SVM), and classification tree. We use the training and evaluation framework provided in the caret library, and for each algorithm, we produce:

  • Training summary including resampled statistics
  • Model summary
  • Confusion matrix based on model predictions for the dataset.

In the caret framework, we implement the pre-processing, re-sampling, and tuning methods below by using the preProcess, trainControl, and other arguments of the train function:

  • Pre-processing of predictors: normalize variables (i.e., center and scale) for optimal performance of certain algorithms (e.g., LDA, kNN, SVM)
  • Re-sampling method: 10-fold cross validation repeated 10 times because of the small sample size
  • Selection criterion for parameter tuning: maximize sensitivity on the validation (holdout) samples, since classification of the “BLUE” instances seems to be problematic for some algorithms. Alternative choices include minimizing / maximizing other metrics such as accuracy, AUC, and specificity.

We fit and tune all models using the train function from the caret library, where we specify the chosen algorithm and tuning parameters (if any). In this exercise we tune only key parameters for each model; other parameters are held at their default values or are chosen automatically by R. We use the following algorithms and tuning parameters:

  • Logistic regression (glm with family = “binomial” from base R): no parameter tuning
  • Linear discriminant analysis (lda from MASS library): no parameter tuning
  • Naive Bayes (nb from klaR library): automatic tuning of usekernel parameter for normal density (FALSE) or kernel density estimate (TRUE); default values of fL (Laplace correction) and adjust (bandwidth adjustment) parameters
  • K-nearest neighbors (knn from base R): tuning of k parameter
  • Support vector machine with radial basis function kernel (svmRadial from kernlab library): tuning of cost parameter C with automatic estimation of sigma (inverse width) parameter
  • Classification tree / CART (rpart from rpart library): tuning of cp (complexity) parameter.
# define variables for modeling
df$label <- factor(df$label, levels = c("BLACK", "BLUE"))    # BLUE = positive case
xvars <- df[1:2]
xvarsnum <- xvars
xvarsnum$Y <- as.numeric(xvars$Y)   # numeric version of factor variable
yvar <- df[[3]]
# center & scale predictors
prep <- c("center", "scale")
# repeated 10-fold cross validation; set BLUE = positive case
summaryfcn <- function(data, lev = NULL, model = NULL) {
  twoClassSummary(data = data, lev = c("BLUE", "BLACK"), model = model)
}
ctrl <- trainControl(method = "repeatedcv", 
                     number = 10, 
                     repeats = 10, 
                     classProbs = TRUE, 
                     savePredictions = TRUE, 
                     summaryFunction = summaryfcn)
#ctrl <- trainControl(method = "LOOCV", 
#                     classProbs = TRUE, 
#                     summaryFunction = twoClassSummary)

##############################################################################
# fit logistic regression model; no tuning 
##############################################################################
#tunepars = data.frame()
set.seed(1012)
lrfit <- train(label ~ ., data = df, 
               #x = xvars, 
               #y = yvar, 
               method = "glm", family = "binomial",
               preProcess = prep, 
               #tuneGrid = tunepars, 
               trControl = ctrl, 
               metric = "Sens")
lrfit
## Generalized Linear Model 
## 
## 36 samples
##  2 predictor
##  2 classes: 'BLACK', 'BLUE' 
## 
## Pre-processing: centered (6), scaled (6) 
## Resampling: Cross-Validated (10 fold, repeated 10 times) 
## Summary of sample sizes: 33, 32, 32, 33, 33, 32, ... 
## Resampling results:
## 
##   ROC        Sens  Spec     
##   0.5283333  0.37  0.8416667
summary(lrfit)
## 
## Call:
## NULL
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.8321  -0.8572  -0.6662   0.8365   1.9931  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)  
## (Intercept) -4.656e-01  3.887e-01  -1.198   0.2310  
## X           -1.823e-01  3.829e-01  -0.476   0.6339  
## Yb          -4.054e-16  4.646e-01   0.000   1.0000  
## Yc           8.763e-01  5.298e-01   1.654   0.0981 .
## Yd          -4.393e-16  4.646e-01   0.000   1.0000  
## Ye          -3.484e-01  5.294e-01  -0.658   0.5104  
## Yf          -3.095e-16  4.646e-01   0.000   1.0000  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 48.114  on 35  degrees of freedom
## Residual deviance: 41.139  on 29  degrees of freedom
## AIC: 55.139
## 
## Number of Fisher Scoring iterations: 4
#plot(lrfit, main = paste0("Tuning performance profile: ", lrfit$modelInfo$label))
#plot(lrfit$finalModel)

# training predictions & metrics
lrpred <- predict(lrfit)
lrprob <- predict(lrfit, type = "prob") 
(lrconf <- confusionMatrix(lrpred, yvar, positive = "BLUE"))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction BLACK BLUE
##      BLACK    21    9
##      BLUE      1    5
##                                          
##                Accuracy : 0.7222         
##                  95% CI : (0.5481, 0.858)
##     No Information Rate : 0.6111         
##     P-Value [Acc > NIR] : 0.11434        
##                                          
##                   Kappa : 0.3478         
##                                          
##  Mcnemar's Test P-Value : 0.02686        
##                                          
##             Sensitivity : 0.3571         
##             Specificity : 0.9545         
##          Pos Pred Value : 0.8333         
##          Neg Pred Value : 0.7000         
##              Prevalence : 0.3889         
##          Detection Rate : 0.1389         
##    Detection Prevalence : 0.1667         
##       Balanced Accuracy : 0.6558         
##                                          
##        'Positive' Class : BLUE           
## 
lrroc <- roc(response = yvar, predictor = lrprob[['BLUE']])
#plot(lrroc, main = paste0("ROC curve: ", lrfit$modelInfo$label))

##############################################################################
# fit linear discriminant analysis model; no tuning
##############################################################################
#tunepars = data.frame()
set.seed(1012)
# use numeric predictors
ldafit <- train(label ~ ., data = df, 
                #x = xvarsnum, 
                #y = yvar, 
                method = "lda", 
                preProcess = prep, 
                #tuneGrid = tunepars, 
                trControl = ctrl, 
                metric = "Sens")
ldafit
## Linear Discriminant Analysis 
## 
## 36 samples
##  2 predictor
##  2 classes: 'BLACK', 'BLUE' 
## 
## Pre-processing: centered (6), scaled (6) 
## Resampling: Cross-Validated (10 fold, repeated 10 times) 
## Summary of sample sizes: 33, 32, 32, 33, 33, 32, ... 
## Resampling results:
## 
##   ROC        Sens  Spec     
##   0.5345833  0.37  0.8783333
#summary(ldafit)
#plot(ldafit, main = paste0("Tuning performance profile: ", ldafit$modelInfo$label))
#plot(ldafit$finalModel)

# training predictions & metrics
ldapred <- predict(ldafit)
ldaprob <- predict(ldafit, type = "prob") 
(ldaconf <- confusionMatrix(ldapred, yvar, positive = "BLUE"))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction BLACK BLUE
##      BLACK    21    9
##      BLUE      1    5
##                                          
##                Accuracy : 0.7222         
##                  95% CI : (0.5481, 0.858)
##     No Information Rate : 0.6111         
##     P-Value [Acc > NIR] : 0.11434        
##                                          
##                   Kappa : 0.3478         
##                                          
##  Mcnemar's Test P-Value : 0.02686        
##                                          
##             Sensitivity : 0.3571         
##             Specificity : 0.9545         
##          Pos Pred Value : 0.8333         
##          Neg Pred Value : 0.7000         
##              Prevalence : 0.3889         
##          Detection Rate : 0.1389         
##    Detection Prevalence : 0.1667         
##       Balanced Accuracy : 0.6558         
##                                          
##        'Positive' Class : BLUE           
## 
ldaroc <- roc(response = yvar, predictor = ldaprob[['BLUE']])
#plot(ldaroc, main = paste0("ROC curve: ", ldafit$modelInfo$label))

##############################################################################
# fit naive bayes model; no tuning (other than automatic tuning of usekernel)
##############################################################################
#tunepars = data.frame()
set.seed(1012)
nbfit <- train(#label ~ ., data = df, 
               x = xvars, 
               y = yvar, 
               method = "nb", 
               preProcess = prep, 
               #tuneGrid = tunepars, 
               trControl = ctrl, 
               metric = "Sens")
nbfit
## Naive Bayes 
## 
## 36 samples
##  2 predictor
##  2 classes: 'BLACK', 'BLUE' 
## 
## Pre-processing: centered (1), scaled (1), ignore (1) 
## Resampling: Cross-Validated (10 fold, repeated 10 times) 
## Summary of sample sizes: 33, 32, 32, 33, 33, 32, ... 
## Resampling results across tuning parameters:
## 
##   usekernel  ROC        Sens   Spec     
##   FALSE      0.5012500  0.370  0.8350000
##    TRUE      0.3354167  0.415  0.8766667
## 
## Tuning parameter 'fL' was held constant at a value of 0
## Tuning
##  parameter 'adjust' was held constant at a value of 1
## Sens was used to select the optimal model using the largest value.
## The final values used for the model were fL = 0, usekernel = TRUE
##  and adjust = 1.
#summary(nbfit)
plot(nbfit, highlight = TRUE, 
     main = paste0("Tuning performance profile: ", nbfit$modelInfo$label))

par(mfrow = c(1,2))
plot(nbfit$finalModel)

par(mfrow = c(1,1))

# training predictions & metrics
nbpred <- predict(nbfit)
nbprob <- predict(nbfit, type = "prob") 
(nbconf <- confusionMatrix(nbpred, yvar, positive = "BLUE"))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction BLACK BLUE
##      BLACK    21    5
##      BLUE      1    9
##                                           
##                Accuracy : 0.8333          
##                  95% CI : (0.6719, 0.9363)
##     No Information Rate : 0.6111          
##     P-Value [Acc > NIR] : 0.003604        
##                                           
##                   Kappa : 0.6301          
##                                           
##  Mcnemar's Test P-Value : 0.220671        
##                                           
##             Sensitivity : 0.6429          
##             Specificity : 0.9545          
##          Pos Pred Value : 0.9000          
##          Neg Pred Value : 0.8077          
##              Prevalence : 0.3889          
##          Detection Rate : 0.2500          
##    Detection Prevalence : 0.2778          
##       Balanced Accuracy : 0.7987          
##                                           
##        'Positive' Class : BLUE            
## 
nbroc <- roc(response = yvar, predictor = nbprob[['BLUE']])
#plot(nbroc, main = paste0("ROC curve: ", nbfit$modelInfo$label))

##############################################################################
# fit knn model; tune k 
##############################################################################
tunepars = data.frame(k = seq(1,19,2))
set.seed(1012)
# use numeric predictors
knnfit <- train(label ~ ., data = df,
                #x = xvarsnum, 
                #y = yvar, 
                method = "knn", 
                preProcess = prep, 
                tuneGrid = tunepars, 
                #tuneLength = 8,
                trControl = ctrl, 
                metric = "Sens")
knnfit
## k-Nearest Neighbors 
## 
## 36 samples
##  2 predictor
##  2 classes: 'BLACK', 'BLUE' 
## 
## Pre-processing: centered (6), scaled (6) 
## Resampling: Cross-Validated (10 fold, repeated 10 times) 
## Summary of sample sizes: 33, 32, 32, 33, 33, 32, ... 
## Resampling results across tuning parameters:
## 
##   k   ROC        Sens   Spec     
##    1  0.4587500  0.365  0.7066667
##    3  0.4612500  0.370  0.9183333
##    5  0.4416667  0.395  0.8950000
##    7  0.3770833  0.355  0.8600000
##    9  0.3179167  0.285  0.9616667
##   11  0.2425000  0.260  0.9800000
##   13  0.3200000  0.135  0.9816667
##   15  0.3454167  0.080  0.9916667
##   17  0.3816667  0.045  0.9950000
##   19  0.4025000  0.020  1.0000000
## 
## Sens was used to select the optimal model using the largest value.
## The final value used for the model was k = 5.
#summary(knnfit)
ggplot(knnfit, highlight = TRUE) + 
  labs(title = paste0("Tuning performance profile: ", knnfit$modelInfo$label))

#plot(knnfit$finalModel)

# training predictions & metrics
knnpred <- predict(knnfit)
knnprob <- predict(knnfit, type = "prob") 
(knnconf <- confusionMatrix(knnpred, yvar, positive = "BLUE"))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction BLACK BLUE
##      BLACK    21    9
##      BLUE      1    5
##                                          
##                Accuracy : 0.7222         
##                  95% CI : (0.5481, 0.858)
##     No Information Rate : 0.6111         
##     P-Value [Acc > NIR] : 0.11434        
##                                          
##                   Kappa : 0.3478         
##                                          
##  Mcnemar's Test P-Value : 0.02686        
##                                          
##             Sensitivity : 0.3571         
##             Specificity : 0.9545         
##          Pos Pred Value : 0.8333         
##          Neg Pred Value : 0.7000         
##              Prevalence : 0.3889         
##          Detection Rate : 0.1389         
##    Detection Prevalence : 0.1667         
##       Balanced Accuracy : 0.6558         
##                                          
##        'Positive' Class : BLUE           
## 
knnroc <- roc(response = yvar, predictor = knnprob[['BLUE']])
#plot(knnroc, main = paste0("ROC curve: ", knnfit$modelInfo$label))

##############################################################################
# fit svm model; tune C (use automatic estimate of sigma)
##############################################################################
#tunepars = expand.grid(sigma = c(0.1, 0.5, 1, 1.5, 2), C = c(0.5, 1:10))
set.seed(1012)
# use numeric predictors
svmfit <- train(#label ~ ., data = df,
                x = xvarsnum,
                y = yvar,  
                method = "svmRadial", prob.model = TRUE,
                preProcess = prep, 
                tuneLength = 15,
                #tuneGrid = tunepars, 
                trControl = ctrl, 
                metric = "Sens")
svmfit
## Support Vector Machines with Radial Basis Function Kernel 
## 
## 36 samples
##  2 predictor
##  2 classes: 'BLACK', 'BLUE' 
## 
## Pre-processing: centered (2), scaled (2) 
## Resampling: Cross-Validated (10 fold, repeated 10 times) 
## Summary of sample sizes: 33, 32, 32, 33, 33, 32, ... 
## Resampling results across tuning parameters:
## 
##   C        ROC        Sens   Spec     
##      0.25  0.4183333  0.095  0.8966667
##      0.50  0.4300000  0.090  0.9000000
##      1.00  0.3591667  0.155  0.9200000
##      2.00  0.3291667  0.230  0.8983333
##      4.00  0.2866667  0.240  0.8533333
##      8.00  0.2675000  0.280  0.8416667
##     16.00  0.2658333  0.310  0.8316667
##     32.00  0.2708333  0.280  0.8583333
##     64.00  0.3591667  0.200  0.8166667
##    128.00  0.3725000  0.185  0.8483333
##    256.00  0.4691667  0.140  0.8500000
##    512.00  0.4216667  0.140  0.8400000
##   1024.00  0.4166667  0.175  0.8483333
##   2048.00  0.4550000  0.170  0.8233333
##   4096.00  0.4083333  0.150  0.8383333
## 
## Tuning parameter 'sigma' was held constant at a value of 1.034474
## Sens was used to select the optimal model using the largest value.
## The final values used for the model were sigma = 1.034474 and C = 16.
#summary(svmfit)
ggplot(svmfit, highlight = TRUE) + 
  labs(title = paste0("Tuning performance profile: ", svmfit$modelInfo$label))

plot(svmfit$finalModel)

# training predictions & metrics
svmpred <- predict(svmfit)
svmprob <- predict(svmfit, type = "prob") 
(svmconf <- confusionMatrix(svmpred, yvar, positive = "BLUE"))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction BLACK BLUE
##      BLACK    22    1
##      BLUE      0   13
##                                           
##                Accuracy : 0.9722          
##                  95% CI : (0.8547, 0.9993)
##     No Information Rate : 0.6111          
##     P-Value [Acc > NIR] : 4.774e-07       
##                                           
##                   Kappa : 0.9408          
##                                           
##  Mcnemar's Test P-Value : 1               
##                                           
##             Sensitivity : 0.9286          
##             Specificity : 1.0000          
##          Pos Pred Value : 1.0000          
##          Neg Pred Value : 0.9565          
##              Prevalence : 0.3889          
##          Detection Rate : 0.3611          
##    Detection Prevalence : 0.3611          
##       Balanced Accuracy : 0.9643          
##                                           
##        'Positive' Class : BLUE            
## 
svmroc <- roc(response = yvar, predictor = svmprob[['BLUE']])
#plot(svmroc, main = paste0("ROC curve: ", svmfit$modelInfo$label))

##############################################################################
# fit tree model; tune cp
##############################################################################
#tunepars = data.frame(cp =)
set.seed(1012)
treefit <- train(label ~ ., data = df,
                 #x = xvars, 
                 #y = yvar, 
                 method = "rpart", control = rpart.control(minsplit = 6), 
                 preProcess = prep, 
                 tuneLength = 15,
                 #tuneGrid = tunepars, 
                 trControl = ctrl,  
                 metric = "Sens")
treefit
## CART 
## 
## 36 samples
##  2 predictor
##  2 classes: 'BLACK', 'BLUE' 
## 
## Pre-processing: centered (6), scaled (6) 
## Resampling: Cross-Validated (10 fold, repeated 10 times) 
## Summary of sample sizes: 33, 32, 32, 33, 33, 32, ... 
## Resampling results across tuning parameters:
## 
##   cp          ROC        Sens   Spec     
##   0.00000000  0.2058333  0.760  0.8116667
##   0.01020408  0.2058333  0.760  0.8116667
##   0.02040816  0.2058333  0.760  0.8116667
##   0.03061224  0.2058333  0.760  0.8116667
##   0.04081633  0.2108333  0.760  0.8016667
##   0.05102041  0.2158333  0.740  0.8116667
##   0.06122449  0.2158333  0.740  0.8116667
##   0.07142857  0.2158333  0.740  0.8116667
##   0.08163265  0.2708333  0.650  0.7833333
##   0.09183673  0.3087500  0.560  0.7783333
##   0.10204082  0.3087500  0.560  0.7783333
##   0.11224490  0.3087500  0.560  0.7783333
##   0.12244898  0.3287500  0.500  0.8050000
##   0.13265306  0.3450000  0.465  0.8100000
##   0.14285714  0.3450000  0.465  0.8100000
## 
## Sens was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.04081633.
#summary(treefit)
ggplot(treefit, highlight = TRUE) + 
  labs(title = paste0("Tuning performance profile: ", treefit$modelInfo$label))

#plot(treefit$finalModel, main = "Classification tree", margin = 0.2)
#text(treefit$finalModel, all = TRUE, pretty = TRUE, use.n = TRUE, fancy = TRUE)
plot(as.party(treefit$finalModel), main = "Classification tree")

# training predictions & metrics
treepred <- predict(treefit)
treeprob <- predict(treefit, type = "prob") 
(treeconf <- confusionMatrix(treepred, yvar, positive = "BLUE"))
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction BLACK BLUE
##      BLACK    20    2
##      BLUE      2   12
##                                           
##                Accuracy : 0.8889          
##                  95% CI : (0.7394, 0.9689)
##     No Information Rate : 0.6111          
##     P-Value [Acc > NIR] : 0.0002352       
##                                           
##                   Kappa : 0.7662          
##                                           
##  Mcnemar's Test P-Value : 1.0000000       
##                                           
##             Sensitivity : 0.8571          
##             Specificity : 0.9091          
##          Pos Pred Value : 0.8571          
##          Neg Pred Value : 0.9091          
##              Prevalence : 0.3889          
##          Detection Rate : 0.3333          
##    Detection Prevalence : 0.3889          
##       Balanced Accuracy : 0.8831          
##                                           
##        'Positive' Class : BLUE            
## 
treeroc <- roc(response = yvar, predictor = treeprob[['BLUE']])
#plot(treeroc, main = paste0("ROC curve: ", treefit$modelInfo$label))

C: Model evaluation

In this section we summarize performance metrics for the 6 algorithms. It’s important to distinguish two types of performance measures: training metrics and test metrics. Training metrics are measured for the final model as fitted on the training set, whereas test metrics are measured for the intermediate model as fitted on the validation training set and tested on the holdout sample. In this exercise we’re using repeated cross-validation as our re-sampling method, so the test metrics are averaged across all the holdout samples.

In general we expect the test metrics to be weaker than the training metrics, which we can observe in the comparison table of performance metrics below. In this table, we summarize accuracy, AUC, sensitivity (equivalent to the true positive rate or TPR), and specificity (equivalent to 1 less the false positive rate, or 1 - FPR), where “Tr_” prefixes all training metrics while “Res_” prefixes all test / resampled metrics. In addition we show a comparison of ROC curves and resampled metrics from the cross-validation procedure.

# summary metrics
modnames <-  c(lrfit$modelInfo$label, 
               ldafit$modelInfo$label, 
               nbfit$modelInfo$label, 
               knnfit$modelInfo$label, 
               svmfit$modelInfo$label,
               treefit$modelInfo$label)
modnames2 <- c("GLM", "LDA", "NB", "kNN", "SVM", "RPART")
auc <- c(auc(lrroc), 
         auc(ldaroc),
         auc(nbroc),
         auc(knnroc),
         auc(svmroc),
         auc(treeroc))
acc <- c(lrconf$overall['Accuracy'], 
         ldaconf$overall['Accuracy'], 
         nbconf$overall['Accuracy'], 
         knnconf$overall['Accuracy'], 
         svmconf$overall['Accuracy'], 
         treeconf$overall['Accuracy']) 
sens <- c(lrconf$byClass['Sensitivity'], 
          ldaconf$byClass['Sensitivity'], 
          nbconf$byClass['Sensitivity'], 
          knnconf$byClass['Sensitivity'], 
          svmconf$byClass['Sensitivity'], 
          treeconf$byClass['Sensitivity'])
spec <- c(lrconf$byClass['Specificity'], 
          ldaconf$byClass['Specificity'], 
          nbconf$byClass['Specificity'], 
          knnconf$byClass['Specificity'], 
          svmconf$byClass['Specificity'], 
          treeconf$byClass['Specificity'])          
nb_res_idx <- match(nbfit$bestTune['usekernel'], nbfit$results[['usekernel']])
knn_res_idx <- match(knnfit$bestTune['k'], knnfit$results[['k']])
svm_res_idx <- match(svmfit$bestTune['C'], svmfit$results[['C']])
tree_res_idx <- match(treefit$bestTune['cp'], treefit$results[['cp']])
auc_res <- c(lrfit$results$ROC, 
             ldafit$results$ROC, 
             nbfit$results[nb_res_idx, 'ROC'],       # adjust for tuning 
             knnfit$results[knn_res_idx, 'ROC'],     # adjust for tuning
             svmfit$results[svm_res_idx, 'ROC'],     # adjust for tuning
             treefit$results[tree_res_idx, 'ROC'])   # adjust for tuning
sens_res <- c(lrfit$results$Sens, 
              ldafit$results$Sens, 
              nbfit$results[nb_res_idx, 'Sens'],  
              knnfit$results[knn_res_idx, 'Sens'], 
              svmfit$results[svm_res_idx, 'Sens'], 
              treefit$results[tree_res_idx, 'Sens'])             
spec_res <- c(lrfit$results$Spec, 
              ldafit$results$Spec, 
              nbfit$results[nb_res_idx, 'Spec'], 
              knnfit$results[knn_res_idx, 'Spec'], 
              svmfit$results[svm_res_idx, 'Spec'], 
              treefit$results[tree_res_idx, 'Spec'])             
sum_df <- tibble(Model = modnames, 
                 Tr_Accuracy = acc,
                 Tr_AUC = auc,
                 Tr_Sensitivity = sens,
                 Tr_Specificity = spec,
                 Res_AUC = auc_res, 
                 Res_Sensitivity = sens_res,
                 Res_Specificity = spec_res)
kable(sum_df, 
      digits = 3, 
      caption = "Comparison of model performance: Training and resampled metrics")
Comparison of model performance: Training and resampled metrics
Model Tr_Accuracy Tr_AUC Tr_Sensitivity Tr_Specificity Res_AUC Res_Sensitivity Res_Specificity
Generalized Linear Model 0.722 0.705 0.357 0.955 0.528 0.370 0.842
Linear Discriminant Analysis 0.722 0.708 0.357 0.955 0.535 0.370 0.878
Naive Bayes 0.833 0.857 0.643 0.955 0.335 0.415 0.877
k-Nearest Neighbors 0.722 0.708 0.357 0.955 0.442 0.395 0.895
Support Vector Machines with Radial Basis Function Kernel 0.972 0.994 0.929 1.000 0.266 0.310 0.832
CART 0.889 0.959 0.857 0.909 0.211 0.760 0.802
# ROC plot
par(mfrow = c(1,1))
plot(lrroc, col = 1, legacy.axes = TRUE, main = "Comparison of ROC curves")
plot(ldaroc, add = TRUE, col = 2, legacy.axes = TRUE)
plot(nbroc, add = TRUE, col = 3, legacy.axes = TRUE)
plot(knnroc, add = TRUE, col = 4, legacy.axes = TRUE)
plot(svmroc, add = TRUE, col = 5, legacy.axes = TRUE)
plot(treeroc, add = TRUE, col = 6, legacy.axes = TRUE)
legend("bottomright", legend = modnames2,
       col = 1:6, lwd=2)

# plot resampled metrics
train_list <- list(lrfit, ldafit, nbfit, knnfit, svmfit, treefit)
names(train_list) <- modnames2
resamps <- resamples(train_list)
#resamps
#summary(resamps)
bwplot(resamps,
       layout = c(3,1), 
       between = list(x = 1), 
       main = "Comparison of resampled performance metrics")

D: Summary

We summarize our findings below:

  • Based on training metrics, SVM with RBF is the clear winner as it exhibits the strongest metrics across the board (highest accuracy, AUC, sensitivity, and specificity). Classification tree is the runner-up followed by naive Bayes, which both show stronger performance than the remaining models.
  • However, based on test / resampled metrics, the best model is not so clear. Evaluating the AUC metric, logistic regression (0.53) and LDA (0.54) appear to be the strongest; on the other hand, evaluating the sensitivity / specificity metrics, classification tree (0.76 / 0.80), naive Bayes (0.42 / 0.88) and kNN (0.40 / 0.90) appear to be the strongest.
  • Test / resampled metrics are clearly weaker than the training metrics, consistent with expectation.
  • All models uniformly had stronger performance at classifying “BLACK” instances (specificity) than “BLUE” instances (sensitivity). For instance, across test samples, the specificity ranged from 76% - 90% whereas the sensitivity ranged from 31% to 44%.
  • The sample size (36 observations) is very small, and even with 10-fold cross-validation repeated 10 times, any statistical conclusions should be validated on a larger dataset.
  • Logistic regression and LDA have similar performance profiles, which is to be expected since they both produce linear separation boundaries in the predictor X-Y space. As we saw in the scatter plot of the data, the classes are not linearly separable, which explains the mediocre accuracy on the training set (~72%).
  • Naive Bayes performs reasonably well on the training set (accuracy of 83% and AUC of 86%) and does a better job than most of the other classifiers (with the exception of SVM and classification tree) at classifying the “BLUE” instances (sensitivity of 64%); however on the resampled test sets, performance is much weaker. For this model, tuning was performed through cross-validation to determine the predictor distribution assumption, which was chosen to be a kernel density estimate (rather than the normal distribution).
  • K-nearest neighbors appears to perform in line with the linear classifiers (logistic regression and LDA) on both the training set and the test samples. In this case, the model was tuned to determine the optimal value of k, which was chosen to be 5. Perhaps with a larger dataset, performance of the kNN classifier would be more clearly distinguishable versus the linear classifiers.
  • SVM with radial basis function kernel shows some signs of over-fitting on the training set, as the training performance metrics are extremely strong (97% accuracy, 99% AUC, and 93% sensitivity), whereas the test metrics are the weakest of all the models considered (27% AUC and 31% sensitivity). With a larger dataset, it would be interesting to investigate this behavior, since in theory, the cross-validation procedure should have mitigated any tendency to overfit the training data. This model was tuned to determined the optimal choice of the cost parameter, which was \(C = 16\).
  • Classification tree shows strong performance metrics on the training set as well as test samples, except for the test AUC. In particular, this model out-performs the other models by far with respect to the ability to classify “BLUE” instances in the test samples (test sensitivity of 76% compared to a 30% to 40% range for the other models). Given the difficulty other models have in doing this, perhaps this might tip our assessment in favor of the classification tree as the best model overall. Tuning was performed for this model for the complexity parameter, resulting in \(cp = 0.04\).