CLASSIFICATION MODEL COMPARISONS

Accuracies of four classification models are compared for various data sets using cross validation (CV) performed by bespoke functions.

Classification methods comapred are: - glm(family = "binomial") logistic regression and referred to glm for convenience - lda() linear discriminant analysis - qda() quadratic discriminant analysis - knn() k nearest neighbours

My CV approach is that of validation set method but instead of 50/50 split I use 70/30 train/test(hold-out) split by default. Each model is run n_trials times with a different overlapping train/test split each time. The mean accuracy and boxplot of each model’s accuracy distribution is output.

Comparator functions are loaded from classification_model_comparators.R which also contains their documentation. Make sure this file is in the same directory.

Data sets used: - ISLR::Auto - ISLR::Weekly - mlbench::Sonar - mlbench::PimaIndiansDiabetes2

library(tidyverse)
library(MASS)
select <- dplyr::select
library(boot)
library(ISLR)
library(mlbench)
library(GGally)
source(file.path(getwd(), 'classification_model_helpers.R'))

The primary objective here is a brief overview of comparing classification methods, therefore optimising k for knn is not considered and a nominal k = 5 is used for all examples. A method for optimising k can be found in knn.Rmd.

k_test <- 5

ISLR::Auto

Based on the Auto data set which is an expanded version of mtcars, auto_preds includes a new feature from Auto: mpg01 which classifies each car as above median mpg, 1 or below 0.

auto_preds <- Auto %>%
  mutate(mpg01 = ifelse(mpg > median(mpg), 1, 0) %>% as.factor()) %>% 
  select(-name, -mpg)

All predictors are used to classify mpg01 (except mpg and name). At around n_trials of 500, the glm (logistic regression), lda and qda methods display similar accuracy distributions which is to be expected given they classify by probabilities. Knn appears to give a slightly better accuracy.

compare_class_method(auto_preds, "mpg01", ".", n_trials=500, k=k_test)
## [[1]]
## # A tibble: 4 x 2
##   method mean_acc
##   <chr>     <dbl>
## 1 glm       0.900
## 2 knn       0.913
## 3 lda       0.910
## 4 qda       0.905
## 
## [[2]]

Comparing my CV approach with leave one out cross validation (LOOCV) and k-fold CV for Auto data set regressing mpg01 on cylinders only.

compare_class_method(auto_preds, "mpg01", "cylinders", n_trials=200, k=k_test)
## [[1]]
## # A tibble: 4 x 2
##   method mean_acc
##   <chr>     <dbl>
## 1 glm       0.906
## 2 knn       0.914
## 3 lda       0.902
## 4 qda       0.906
## 
## [[2]]

# LOOCV accuracy rate
1 - cv.glm(auto_preds, glm(mpg01 ~ cylinders, "binomial", auto_preds))$delta[1] 
## [1] 0.9107309
# k-fold CV accuracy rate
1 - cv.glm(auto_preds, glm(mpg01 ~ cylinders, "binomial", auto_preds), K=10)$delta[1] 
## [1] 0.9107579

ISLR::Weekly

The Weekly data set consists of: “Weekly percentage returns for the S&P 500 stock index between 1990 and 2010” with Direction of Up or Down indicating whether the market had a positive or negative return on a given week based off the percentage return 2 weeks previous (Lag2).

compare_class_method(Weekly, "Direction", "Lag2", n_trials=100, k=k_test)
## [[1]]
## # A tibble: 4 x 2
##   method mean_acc
##   <chr>     <dbl>
## 1 glm       0.552
## 2 knn       0.883
## 3 lda       0.555
## 4 qda       0.549
## 
## [[2]]

# null error rate
Weekly %>% mutate(Up = "Up", Compare = Direction == Up) %>% summarise(n(),mean(Compare))
##    n() mean(Compare)
## 1 1089     0.5555556

glm, lda and qda perform no better than the null error rate since the probability distributions overlap as shown below.

ggplot(Weekly, aes(Lag2, fill=Direction)) + geom_density(position="dodge", alpha=0.5)

knn performs much better since groups of Direction are present as illustrated by the jitter plot below. Test points are more easily classified by their proximity to similar neighbours.

ggplot(Weekly, aes(x=0, Lag2, col=Direction)) + geom_jitter()

MASS::Boston

# Boston_df <- Boston %>% mutate(chas=as.factor(chas), rad=as.factor(rad))
# compare_class_method(Boston_df, "chas", "indus", n_trials=100, k=k_test)

mlbench::Sonar

Class in the Sonar data set records whether an object is classified as metal or rock given 60 numerical (0 to 1) sonar readings of different energy bands.

data("Sonar")
colnames(Sonar)
##  [1] "V1"    "V2"    "V3"    "V4"    "V5"    "V6"    "V7"    "V8"    "V9"   
## [10] "V10"   "V11"   "V12"   "V13"   "V14"   "V15"   "V16"   "V17"   "V18"  
## [19] "V19"   "V20"   "V21"   "V22"   "V23"   "V24"   "V25"   "V26"   "V27"  
## [28] "V28"   "V29"   "V30"   "V31"   "V32"   "V33"   "V34"   "V35"   "V36"  
## [37] "V37"   "V38"   "V39"   "V40"   "V41"   "V42"   "V43"   "V44"   "V45"  
## [46] "V46"   "V47"   "V48"   "V49"   "V50"   "V51"   "V52"   "V53"   "V54"  
## [55] "V55"   "V56"   "V57"   "V58"   "V59"   "V60"   "Class"

knn performs better than glm or lda (some groups are too small for qda).

compare_class_method(Sonar, "Class",  methods=c("glm", "lda", "knn"), n_trials = 100)
## [[1]]
## # A tibble: 3 x 2
##   method mean_acc
##   <chr>     <dbl>
## 1 glm       0.705
## 2 knn       0.855
## 3 lda       0.727
## 
## [[2]]

The pairplot below illustrates why knn performs better for 2 predictors chosen as an example. The density plots on the diagonal show the overlap between the predictors, whilst it can be seen on the scatter plot that groups of metal and rocks are distinguishable.

ggpairs(Sonar, aes(col=Class, alpha=0.5), columns = c("V4", "V37"))

mlbench::PimaIndiansDiabetes2

This data set classifies whether a person has diabetes using various measures such as age and body mass index (bmi) coded as mass.

data("PimaIndiansDiabetes2")
compare_class_method(PimaIndiansDiabetes2 %>% na.omit(), "diabetes", k=k_test, n_trials = 200)
## [[1]]
## # A tibble: 4 x 2
##   method mean_acc
##   <chr>     <dbl>
## 1 glm       0.777
## 2 knn       0.754
## 3 lda       0.779
## 4 qda       0.772
## 
## [[2]]

On this data set, glm, lda and qda perform slightly better than knn. This is likely due to the influence of glucose as shown below which enables a decent prediction. The higher the plasma glucose concentration, the more likely a patient will test positive.

ggpairs(PimaIndiansDiabetes2 %>% na.omit(), aes(col=diabetes, alpha=0.5))

The summary for the glm (logistic regression) method indeed shows that glucose is the most significant predictor.

summary(glm(diabetes ~ ., family = "binomial", PimaIndiansDiabetes2))
## 
## Call:
## glm(formula = diabetes ~ ., family = "binomial", data = PimaIndiansDiabetes2)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.7823  -0.6603  -0.3642   0.6409   2.5612  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -1.004e+01  1.218e+00  -8.246  < 2e-16 ***
## pregnant     8.216e-02  5.543e-02   1.482  0.13825    
## glucose      3.827e-02  5.768e-03   6.635 3.24e-11 ***
## pressure    -1.420e-03  1.183e-02  -0.120  0.90446    
## triceps      1.122e-02  1.708e-02   0.657  0.51128    
## insulin     -8.253e-04  1.306e-03  -0.632  0.52757    
## mass         7.054e-02  2.734e-02   2.580  0.00989 ** 
## pedigree     1.141e+00  4.274e-01   2.669  0.00760 ** 
## age          3.395e-02  1.838e-02   1.847  0.06474 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 498.10  on 391  degrees of freedom
## Residual deviance: 344.02  on 383  degrees of freedom
##   (376 observations deleted due to missingness)
## AIC: 362.02
## 
## Number of Fisher Scoring iterations: 5

Future considerations: - This is a very brief look at various classification methods. For a start, k number of neighbours in knn could be optimised as well as optimising the probability cut-off for logistic regression, lda and qda using area under the ROC curve.