First we load and review the data. This is a small dataset, with 36 observations of:
X: numeric values ranging from 5 to 63Y: categorical values ranging from “a” to “f”label: classes “BLACK” and “BLUE”.In this analysis, we take “BLUE” to be the “positive” class and “BLACK” to be the reference class. Note that there are no missing values or outliers.
df <- read_csv("hw2.txt",
col_types = "iff")
glimpse(df)
## Observations: 36
## Variables: 3
## $ X <int> 5, 5, 5, 5, 5, 5, 19, 19, 19, 19, 19, 19, 35, 35, 35, 35...
## $ Y <fct> a, b, c, d, e, f, a, b, c, d, e, f, a, b, c, d, e, f, a,...
## $ label <fct> BLUE, BLACK, BLUE, BLACK, BLACK, BLACK, BLUE, BLUE, BLUE...
summary(df)
## X Y label
## Min. : 5 a:6 BLUE :14
## 1st Qu.:19 b:6 BLACK:22
## Median :43 c:6
## Mean :38 d:6
## 3rd Qu.:55 e:6
## Max. :63 f:6
We plot the data below. We see from the scatter plot that the classes are not linearly separable in the predictor X-Y space. Also from the pairs plot, we see that the classes have distinct conditional distributions, for instance:
X is skewed to higher values for “BLACK” compared to “BLUE”Y is weighted more to category “c” for “BLUE” compared to “BLACK”.# scatter plot
ggplot(df) + geom_point(aes(x = X, y = Y, shape = label, col = label), size = 10) +
labs(title = "Scatter plot of dataset", shape = "Class", col = "Class") +
theme(legend.position = "top", legend.title = element_text(face = "bold"))
# pairs plot
ggpairs(df, aes(fill = label, alpha = 0.5),
title = "Pairs plot of dataset")
In this section we apply 6 learning algorithms to the dataset: logistic regression, linear discriminant analysis (LDA), naive Bayes, k-nearest neighbors (kNN), support vector machine (SVM), and classification tree. We use the training and evaluation framework provided in the caret library, and for each algorithm, we produce:
In the caret framework, we implement the pre-processing, re-sampling, and tuning methods below by using the preProcess, trainControl, and other arguments of the train function:
We fit and tune all models using the train function from the caret library, where we specify the chosen algorithm and tuning parameters (if any). In this exercise we tune only key parameters for each model; other parameters are held at their default values or are chosen automatically by R. We use the following algorithms and tuning parameters:
glm with family = “binomial” from base R): no parameter tuninglda from MASS library): no parameter tuningnb from klaR library): automatic tuning of usekernel parameter for normal density (FALSE) or kernel density estimate (TRUE); default values of fL (Laplace correction) and adjust (bandwidth adjustment) parametersknn from base R): tuning of k parametersvmRadial from kernlab library): tuning of cost parameter C with automatic estimation of sigma (inverse width) parameterrpart from rpart library): tuning of cp (complexity) parameter.# define variables for modeling
df$label <- factor(df$label, levels = c("BLACK", "BLUE")) # BLUE = positive case
xvars <- df[1:2]
xvarsnum <- xvars
xvarsnum$Y <- as.numeric(xvars$Y) # numeric version of factor variable
yvar <- df[[3]]
# center & scale predictors
prep <- c("center", "scale")
# repeated 10-fold cross validation; set BLUE = positive case
summaryfcn <- function(data, lev = NULL, model = NULL) {
twoClassSummary(data = data, lev = c("BLUE", "BLACK"), model = model)
}
ctrl <- trainControl(method = "repeatedcv",
number = 10,
repeats = 10,
classProbs = TRUE,
savePredictions = TRUE,
summaryFunction = summaryfcn)
#ctrl <- trainControl(method = "LOOCV",
# classProbs = TRUE,
# summaryFunction = twoClassSummary)
##############################################################################
# fit logistic regression model; no tuning
##############################################################################
#tunepars = data.frame()
set.seed(1012)
lrfit <- train(label ~ ., data = df,
#x = xvars,
#y = yvar,
method = "glm", family = "binomial",
preProcess = prep,
#tuneGrid = tunepars,
trControl = ctrl,
metric = "Sens")
lrfit
## Generalized Linear Model
##
## 36 samples
## 2 predictor
## 2 classes: 'BLACK', 'BLUE'
##
## Pre-processing: centered (6), scaled (6)
## Resampling: Cross-Validated (10 fold, repeated 10 times)
## Summary of sample sizes: 33, 32, 32, 33, 33, 32, ...
## Resampling results:
##
## ROC Sens Spec
## 0.5283333 0.37 0.8416667
summary(lrfit)
##
## Call:
## NULL
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.8321 -0.8572 -0.6662 0.8365 1.9931
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.656e-01 3.887e-01 -1.198 0.2310
## X -1.823e-01 3.829e-01 -0.476 0.6339
## Yb -4.054e-16 4.646e-01 0.000 1.0000
## Yc 8.763e-01 5.298e-01 1.654 0.0981 .
## Yd -4.393e-16 4.646e-01 0.000 1.0000
## Ye -3.484e-01 5.294e-01 -0.658 0.5104
## Yf -3.095e-16 4.646e-01 0.000 1.0000
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 48.114 on 35 degrees of freedom
## Residual deviance: 41.139 on 29 degrees of freedom
## AIC: 55.139
##
## Number of Fisher Scoring iterations: 4
#plot(lrfit, main = paste0("Tuning performance profile: ", lrfit$modelInfo$label))
#plot(lrfit$finalModel)
# training predictions & metrics
lrpred <- predict(lrfit)
lrprob <- predict(lrfit, type = "prob")
(lrconf <- confusionMatrix(lrpred, yvar, positive = "BLUE"))
## Confusion Matrix and Statistics
##
## Reference
## Prediction BLACK BLUE
## BLACK 21 9
## BLUE 1 5
##
## Accuracy : 0.7222
## 95% CI : (0.5481, 0.858)
## No Information Rate : 0.6111
## P-Value [Acc > NIR] : 0.11434
##
## Kappa : 0.3478
##
## Mcnemar's Test P-Value : 0.02686
##
## Sensitivity : 0.3571
## Specificity : 0.9545
## Pos Pred Value : 0.8333
## Neg Pred Value : 0.7000
## Prevalence : 0.3889
## Detection Rate : 0.1389
## Detection Prevalence : 0.1667
## Balanced Accuracy : 0.6558
##
## 'Positive' Class : BLUE
##
lrroc <- roc(response = yvar, predictor = lrprob[['BLUE']])
#plot(lrroc, main = paste0("ROC curve: ", lrfit$modelInfo$label))
##############################################################################
# fit linear discriminant analysis model; no tuning
##############################################################################
#tunepars = data.frame()
set.seed(1012)
# use numeric predictors
ldafit <- train(label ~ ., data = df,
#x = xvarsnum,
#y = yvar,
method = "lda",
preProcess = prep,
#tuneGrid = tunepars,
trControl = ctrl,
metric = "Sens")
ldafit
## Linear Discriminant Analysis
##
## 36 samples
## 2 predictor
## 2 classes: 'BLACK', 'BLUE'
##
## Pre-processing: centered (6), scaled (6)
## Resampling: Cross-Validated (10 fold, repeated 10 times)
## Summary of sample sizes: 33, 32, 32, 33, 33, 32, ...
## Resampling results:
##
## ROC Sens Spec
## 0.5345833 0.37 0.8783333
#summary(ldafit)
#plot(ldafit, main = paste0("Tuning performance profile: ", ldafit$modelInfo$label))
#plot(ldafit$finalModel)
# training predictions & metrics
ldapred <- predict(ldafit)
ldaprob <- predict(ldafit, type = "prob")
(ldaconf <- confusionMatrix(ldapred, yvar, positive = "BLUE"))
## Confusion Matrix and Statistics
##
## Reference
## Prediction BLACK BLUE
## BLACK 21 9
## BLUE 1 5
##
## Accuracy : 0.7222
## 95% CI : (0.5481, 0.858)
## No Information Rate : 0.6111
## P-Value [Acc > NIR] : 0.11434
##
## Kappa : 0.3478
##
## Mcnemar's Test P-Value : 0.02686
##
## Sensitivity : 0.3571
## Specificity : 0.9545
## Pos Pred Value : 0.8333
## Neg Pred Value : 0.7000
## Prevalence : 0.3889
## Detection Rate : 0.1389
## Detection Prevalence : 0.1667
## Balanced Accuracy : 0.6558
##
## 'Positive' Class : BLUE
##
ldaroc <- roc(response = yvar, predictor = ldaprob[['BLUE']])
#plot(ldaroc, main = paste0("ROC curve: ", ldafit$modelInfo$label))
##############################################################################
# fit naive bayes model; no tuning (other than automatic tuning of usekernel)
##############################################################################
#tunepars = data.frame()
set.seed(1012)
nbfit <- train(#label ~ ., data = df,
x = xvars,
y = yvar,
method = "nb",
preProcess = prep,
#tuneGrid = tunepars,
trControl = ctrl,
metric = "Sens")
nbfit
## Naive Bayes
##
## 36 samples
## 2 predictor
## 2 classes: 'BLACK', 'BLUE'
##
## Pre-processing: centered (1), scaled (1), ignore (1)
## Resampling: Cross-Validated (10 fold, repeated 10 times)
## Summary of sample sizes: 33, 32, 32, 33, 33, 32, ...
## Resampling results across tuning parameters:
##
## usekernel ROC Sens Spec
## FALSE 0.5012500 0.370 0.8350000
## TRUE 0.3354167 0.415 0.8766667
##
## Tuning parameter 'fL' was held constant at a value of 0
## Tuning
## parameter 'adjust' was held constant at a value of 1
## Sens was used to select the optimal model using the largest value.
## The final values used for the model were fL = 0, usekernel = TRUE
## and adjust = 1.
#summary(nbfit)
plot(nbfit, highlight = TRUE,
main = paste0("Tuning performance profile: ", nbfit$modelInfo$label))
par(mfrow = c(1,2))
plot(nbfit$finalModel)
par(mfrow = c(1,1))
# training predictions & metrics
nbpred <- predict(nbfit)
nbprob <- predict(nbfit, type = "prob")
(nbconf <- confusionMatrix(nbpred, yvar, positive = "BLUE"))
## Confusion Matrix and Statistics
##
## Reference
## Prediction BLACK BLUE
## BLACK 21 5
## BLUE 1 9
##
## Accuracy : 0.8333
## 95% CI : (0.6719, 0.9363)
## No Information Rate : 0.6111
## P-Value [Acc > NIR] : 0.003604
##
## Kappa : 0.6301
##
## Mcnemar's Test P-Value : 0.220671
##
## Sensitivity : 0.6429
## Specificity : 0.9545
## Pos Pred Value : 0.9000
## Neg Pred Value : 0.8077
## Prevalence : 0.3889
## Detection Rate : 0.2500
## Detection Prevalence : 0.2778
## Balanced Accuracy : 0.7987
##
## 'Positive' Class : BLUE
##
nbroc <- roc(response = yvar, predictor = nbprob[['BLUE']])
#plot(nbroc, main = paste0("ROC curve: ", nbfit$modelInfo$label))
##############################################################################
# fit knn model; tune k
##############################################################################
tunepars = data.frame(k = seq(1,19,2))
set.seed(1012)
# use numeric predictors
knnfit <- train(label ~ ., data = df,
#x = xvarsnum,
#y = yvar,
method = "knn",
preProcess = prep,
tuneGrid = tunepars,
#tuneLength = 8,
trControl = ctrl,
metric = "Sens")
knnfit
## k-Nearest Neighbors
##
## 36 samples
## 2 predictor
## 2 classes: 'BLACK', 'BLUE'
##
## Pre-processing: centered (6), scaled (6)
## Resampling: Cross-Validated (10 fold, repeated 10 times)
## Summary of sample sizes: 33, 32, 32, 33, 33, 32, ...
## Resampling results across tuning parameters:
##
## k ROC Sens Spec
## 1 0.4587500 0.365 0.7066667
## 3 0.4612500 0.370 0.9183333
## 5 0.4416667 0.395 0.8950000
## 7 0.3770833 0.355 0.8600000
## 9 0.3179167 0.285 0.9616667
## 11 0.2425000 0.260 0.9800000
## 13 0.3200000 0.135 0.9816667
## 15 0.3454167 0.080 0.9916667
## 17 0.3816667 0.045 0.9950000
## 19 0.4025000 0.020 1.0000000
##
## Sens was used to select the optimal model using the largest value.
## The final value used for the model was k = 5.
#summary(knnfit)
ggplot(knnfit, highlight = TRUE) +
labs(title = paste0("Tuning performance profile: ", knnfit$modelInfo$label))
#plot(knnfit$finalModel)
# training predictions & metrics
knnpred <- predict(knnfit)
knnprob <- predict(knnfit, type = "prob")
(knnconf <- confusionMatrix(knnpred, yvar, positive = "BLUE"))
## Confusion Matrix and Statistics
##
## Reference
## Prediction BLACK BLUE
## BLACK 21 9
## BLUE 1 5
##
## Accuracy : 0.7222
## 95% CI : (0.5481, 0.858)
## No Information Rate : 0.6111
## P-Value [Acc > NIR] : 0.11434
##
## Kappa : 0.3478
##
## Mcnemar's Test P-Value : 0.02686
##
## Sensitivity : 0.3571
## Specificity : 0.9545
## Pos Pred Value : 0.8333
## Neg Pred Value : 0.7000
## Prevalence : 0.3889
## Detection Rate : 0.1389
## Detection Prevalence : 0.1667
## Balanced Accuracy : 0.6558
##
## 'Positive' Class : BLUE
##
knnroc <- roc(response = yvar, predictor = knnprob[['BLUE']])
#plot(knnroc, main = paste0("ROC curve: ", knnfit$modelInfo$label))
##############################################################################
# fit svm model; tune C (use automatic estimate of sigma)
##############################################################################
#tunepars = expand.grid(sigma = c(0.1, 0.5, 1, 1.5, 2), C = c(0.5, 1:10))
set.seed(1012)
# use numeric predictors
svmfit <- train(#label ~ ., data = df,
x = xvarsnum,
y = yvar,
method = "svmRadial", prob.model = TRUE,
preProcess = prep,
tuneLength = 15,
#tuneGrid = tunepars,
trControl = ctrl,
metric = "Sens")
svmfit
## Support Vector Machines with Radial Basis Function Kernel
##
## 36 samples
## 2 predictor
## 2 classes: 'BLACK', 'BLUE'
##
## Pre-processing: centered (2), scaled (2)
## Resampling: Cross-Validated (10 fold, repeated 10 times)
## Summary of sample sizes: 33, 32, 32, 33, 33, 32, ...
## Resampling results across tuning parameters:
##
## C ROC Sens Spec
## 0.25 0.4183333 0.095 0.8966667
## 0.50 0.4300000 0.090 0.9000000
## 1.00 0.3591667 0.155 0.9200000
## 2.00 0.3291667 0.230 0.8983333
## 4.00 0.2866667 0.240 0.8533333
## 8.00 0.2675000 0.280 0.8416667
## 16.00 0.2658333 0.310 0.8316667
## 32.00 0.2708333 0.280 0.8583333
## 64.00 0.3591667 0.200 0.8166667
## 128.00 0.3725000 0.185 0.8483333
## 256.00 0.4691667 0.140 0.8500000
## 512.00 0.4216667 0.140 0.8400000
## 1024.00 0.4166667 0.175 0.8483333
## 2048.00 0.4550000 0.170 0.8233333
## 4096.00 0.4083333 0.150 0.8383333
##
## Tuning parameter 'sigma' was held constant at a value of 1.034474
## Sens was used to select the optimal model using the largest value.
## The final values used for the model were sigma = 1.034474 and C = 16.
#summary(svmfit)
ggplot(svmfit, highlight = TRUE) +
labs(title = paste0("Tuning performance profile: ", svmfit$modelInfo$label))
plot(svmfit$finalModel)
# training predictions & metrics
svmpred <- predict(svmfit)
svmprob <- predict(svmfit, type = "prob")
(svmconf <- confusionMatrix(svmpred, yvar, positive = "BLUE"))
## Confusion Matrix and Statistics
##
## Reference
## Prediction BLACK BLUE
## BLACK 22 1
## BLUE 0 13
##
## Accuracy : 0.9722
## 95% CI : (0.8547, 0.9993)
## No Information Rate : 0.6111
## P-Value [Acc > NIR] : 4.774e-07
##
## Kappa : 0.9408
##
## Mcnemar's Test P-Value : 1
##
## Sensitivity : 0.9286
## Specificity : 1.0000
## Pos Pred Value : 1.0000
## Neg Pred Value : 0.9565
## Prevalence : 0.3889
## Detection Rate : 0.3611
## Detection Prevalence : 0.3611
## Balanced Accuracy : 0.9643
##
## 'Positive' Class : BLUE
##
svmroc <- roc(response = yvar, predictor = svmprob[['BLUE']])
#plot(svmroc, main = paste0("ROC curve: ", svmfit$modelInfo$label))
##############################################################################
# fit tree model; tune cp
##############################################################################
#tunepars = data.frame(cp =)
set.seed(1012)
treefit <- train(label ~ ., data = df,
#x = xvars,
#y = yvar,
method = "rpart", control = rpart.control(minsplit = 6),
preProcess = prep,
tuneLength = 15,
#tuneGrid = tunepars,
trControl = ctrl,
metric = "Sens")
treefit
## CART
##
## 36 samples
## 2 predictor
## 2 classes: 'BLACK', 'BLUE'
##
## Pre-processing: centered (6), scaled (6)
## Resampling: Cross-Validated (10 fold, repeated 10 times)
## Summary of sample sizes: 33, 32, 32, 33, 33, 32, ...
## Resampling results across tuning parameters:
##
## cp ROC Sens Spec
## 0.00000000 0.2058333 0.760 0.8116667
## 0.01020408 0.2058333 0.760 0.8116667
## 0.02040816 0.2058333 0.760 0.8116667
## 0.03061224 0.2058333 0.760 0.8116667
## 0.04081633 0.2108333 0.760 0.8016667
## 0.05102041 0.2158333 0.740 0.8116667
## 0.06122449 0.2158333 0.740 0.8116667
## 0.07142857 0.2158333 0.740 0.8116667
## 0.08163265 0.2708333 0.650 0.7833333
## 0.09183673 0.3087500 0.560 0.7783333
## 0.10204082 0.3087500 0.560 0.7783333
## 0.11224490 0.3087500 0.560 0.7783333
## 0.12244898 0.3287500 0.500 0.8050000
## 0.13265306 0.3450000 0.465 0.8100000
## 0.14285714 0.3450000 0.465 0.8100000
##
## Sens was used to select the optimal model using the largest value.
## The final value used for the model was cp = 0.04081633.
#summary(treefit)
ggplot(treefit, highlight = TRUE) +
labs(title = paste0("Tuning performance profile: ", treefit$modelInfo$label))
#plot(treefit$finalModel, main = "Classification tree", margin = 0.2)
#text(treefit$finalModel, all = TRUE, pretty = TRUE, use.n = TRUE, fancy = TRUE)
plot(as.party(treefit$finalModel), main = "Classification tree")
# training predictions & metrics
treepred <- predict(treefit)
treeprob <- predict(treefit, type = "prob")
(treeconf <- confusionMatrix(treepred, yvar, positive = "BLUE"))
## Confusion Matrix and Statistics
##
## Reference
## Prediction BLACK BLUE
## BLACK 20 2
## BLUE 2 12
##
## Accuracy : 0.8889
## 95% CI : (0.7394, 0.9689)
## No Information Rate : 0.6111
## P-Value [Acc > NIR] : 0.0002352
##
## Kappa : 0.7662
##
## Mcnemar's Test P-Value : 1.0000000
##
## Sensitivity : 0.8571
## Specificity : 0.9091
## Pos Pred Value : 0.8571
## Neg Pred Value : 0.9091
## Prevalence : 0.3889
## Detection Rate : 0.3333
## Detection Prevalence : 0.3889
## Balanced Accuracy : 0.8831
##
## 'Positive' Class : BLUE
##
treeroc <- roc(response = yvar, predictor = treeprob[['BLUE']])
#plot(treeroc, main = paste0("ROC curve: ", treefit$modelInfo$label))
In this section we summarize performance metrics for the 6 algorithms. It’s important to distinguish two types of performance measures: training metrics and test metrics. Training metrics are measured for the final model as fitted on the training set, whereas test metrics are measured for the intermediate model as fitted on the validation training set and tested on the holdout sample. In this exercise we’re using repeated cross-validation as our re-sampling method, so the test metrics are averaged across all the holdout samples.
In general we expect the test metrics to be weaker than the training metrics, which we can observe in the comparison table of performance metrics below. In this table, we summarize accuracy, AUC, sensitivity (equivalent to the true positive rate or TPR), and specificity (equivalent to 1 less the false positive rate, or 1 - FPR), where “Tr_” prefixes all training metrics while “Res_” prefixes all test / resampled metrics. In addition we show a comparison of ROC curves and resampled metrics from the cross-validation procedure.
# summary metrics
modnames <- c(lrfit$modelInfo$label,
ldafit$modelInfo$label,
nbfit$modelInfo$label,
knnfit$modelInfo$label,
svmfit$modelInfo$label,
treefit$modelInfo$label)
modnames2 <- c("GLM", "LDA", "NB", "kNN", "SVM", "RPART")
auc <- c(auc(lrroc),
auc(ldaroc),
auc(nbroc),
auc(knnroc),
auc(svmroc),
auc(treeroc))
acc <- c(lrconf$overall['Accuracy'],
ldaconf$overall['Accuracy'],
nbconf$overall['Accuracy'],
knnconf$overall['Accuracy'],
svmconf$overall['Accuracy'],
treeconf$overall['Accuracy'])
sens <- c(lrconf$byClass['Sensitivity'],
ldaconf$byClass['Sensitivity'],
nbconf$byClass['Sensitivity'],
knnconf$byClass['Sensitivity'],
svmconf$byClass['Sensitivity'],
treeconf$byClass['Sensitivity'])
spec <- c(lrconf$byClass['Specificity'],
ldaconf$byClass['Specificity'],
nbconf$byClass['Specificity'],
knnconf$byClass['Specificity'],
svmconf$byClass['Specificity'],
treeconf$byClass['Specificity'])
nb_res_idx <- match(nbfit$bestTune['usekernel'], nbfit$results[['usekernel']])
knn_res_idx <- match(knnfit$bestTune['k'], knnfit$results[['k']])
svm_res_idx <- match(svmfit$bestTune['C'], svmfit$results[['C']])
tree_res_idx <- match(treefit$bestTune['cp'], treefit$results[['cp']])
auc_res <- c(lrfit$results$ROC,
ldafit$results$ROC,
nbfit$results[nb_res_idx, 'ROC'], # adjust for tuning
knnfit$results[knn_res_idx, 'ROC'], # adjust for tuning
svmfit$results[svm_res_idx, 'ROC'], # adjust for tuning
treefit$results[tree_res_idx, 'ROC']) # adjust for tuning
sens_res <- c(lrfit$results$Sens,
ldafit$results$Sens,
nbfit$results[nb_res_idx, 'Sens'],
knnfit$results[knn_res_idx, 'Sens'],
svmfit$results[svm_res_idx, 'Sens'],
treefit$results[tree_res_idx, 'Sens'])
spec_res <- c(lrfit$results$Spec,
ldafit$results$Spec,
nbfit$results[nb_res_idx, 'Spec'],
knnfit$results[knn_res_idx, 'Spec'],
svmfit$results[svm_res_idx, 'Spec'],
treefit$results[tree_res_idx, 'Spec'])
sum_df <- tibble(Model = modnames,
Tr_Accuracy = acc,
Tr_AUC = auc,
Tr_Sensitivity = sens,
Tr_Specificity = spec,
Res_AUC = auc_res,
Res_Sensitivity = sens_res,
Res_Specificity = spec_res)
kable(sum_df,
digits = 3,
caption = "Comparison of model performance: Training and resampled metrics")
| Model | Tr_Accuracy | Tr_AUC | Tr_Sensitivity | Tr_Specificity | Res_AUC | Res_Sensitivity | Res_Specificity |
|---|---|---|---|---|---|---|---|
| Generalized Linear Model | 0.722 | 0.705 | 0.357 | 0.955 | 0.528 | 0.370 | 0.842 |
| Linear Discriminant Analysis | 0.722 | 0.708 | 0.357 | 0.955 | 0.535 | 0.370 | 0.878 |
| Naive Bayes | 0.833 | 0.857 | 0.643 | 0.955 | 0.335 | 0.415 | 0.877 |
| k-Nearest Neighbors | 0.722 | 0.708 | 0.357 | 0.955 | 0.442 | 0.395 | 0.895 |
| Support Vector Machines with Radial Basis Function Kernel | 0.972 | 0.994 | 0.929 | 1.000 | 0.266 | 0.310 | 0.832 |
| CART | 0.889 | 0.959 | 0.857 | 0.909 | 0.211 | 0.760 | 0.802 |
# ROC plot
par(mfrow = c(1,1))
plot(lrroc, col = 1, legacy.axes = TRUE, main = "Comparison of ROC curves")
plot(ldaroc, add = TRUE, col = 2, legacy.axes = TRUE)
plot(nbroc, add = TRUE, col = 3, legacy.axes = TRUE)
plot(knnroc, add = TRUE, col = 4, legacy.axes = TRUE)
plot(svmroc, add = TRUE, col = 5, legacy.axes = TRUE)
plot(treeroc, add = TRUE, col = 6, legacy.axes = TRUE)
legend("bottomright", legend = modnames2,
col = 1:6, lwd=2)
# plot resampled metrics
train_list <- list(lrfit, ldafit, nbfit, knnfit, svmfit, treefit)
names(train_list) <- modnames2
resamps <- resamples(train_list)
#resamps
#summary(resamps)
bwplot(resamps,
layout = c(3,1),
between = list(x = 1),
main = "Comparison of resampled performance metrics")
We summarize our findings below:
k, which was chosen to be 5. Perhaps with a larger dataset, performance of the kNN classifier would be more clearly distinguishable versus the linear classifiers.