Penguin Classification - LDA, QDA, Naive Bayes & K-Nearest Neighbors

Introduction

Out of the many supervised learning classification methods, LDA (Linear Discriminant Analysis), QDA (Quadratic Discriminant Analysis), NB (Naive Bayes) and KNN (K-Nearest Neighbors) are studied in this project with the use of the classic Palmer Penguins dataset.

LDA and its cousin QDA both assume that observations from each class are drawn from the Gaussian distribution. LDA approximates Baye’s classifier by estimating \(\pi_k\), \(\mu_k\), \(\sigma_k\) and assigning an observations to a class for which the discriminant function is the largest (which as implied by its name is a linear function in x, resulting in a linear decision boundary). In the case of p>1 predictors, the LDA classifier assumes that the observations in the kth class are drawn from a multivariate Gaussian distribution \(N(\mu_k, \sum)\). Unlike LDA, QDA assumes that each class has its own covariance matrix. In other words, it is assumed that an observation from the kth class is of the form \(N(\mu_k, \sum_k)\). With QDA, x appears as a quadratic in the discriminant function, hence the decision boundary is quadratic. LDA tends to perform better than QDA if there are relatively few observations and so reducing variance is crucial. In contrast, QDA is recommended if the training set is very large, so that the variance of the classifier is not a major concern, or if the assumption of a common covariance matrix for the K classes is clearly untenable.

Naives Bayes (NB) is a fast algorithm that performs well with small amounts of training data and scales well to large data sets. It relies on the often-faulty assumption of equally important and independent features which results in biased posterior probabilities. What is usually needed is not a propensity (exact posterior probability) for each record that is accurate in absolute terms but just a reasonably accurate rank ordering of propensities. NB is part of the class of Generalized Additive Models.

K-Nearest Neighbords (KNN) classifies an observations to a class by assigning it to the majority class of its K nearest neighbors. KNN dominates LDA when the decision boundary is highly non-linear but as it is non-parametric and no coefficients are estimated, it does not tell us which predictors are important in predictions. This methods also requires standardization of quantitative predictors in order to avoid scale dominance.

Setup

library(skimr)
library(tidyverse)
library(caret) # For featureplot, classification report, knn
library(MASS) # For LDA, QDA
library(e1071) # For Naive Bayes
library(corrplot) # For correlation matrix
library(klaR) # For QDA partition plot

Data Exploration

The penguin dataset is composed of 344 observations with 8 variables, 5 of which are numeric and 3 which are qualitative. The dataset is mostly complete with just a few observations with missing values that will need to be handled.

Data summary
Name	data
Number of rows	344
Number of columns	8
_______________________
Column type frequency:
factor	3
numeric	5
________________________
Group variables	None

Variable type: factor

skim_variable	n_missing	complete_rate	ordered	n_unique	top_counts
species	0	1.00	FALSE	3	Ade: 152, Gen: 124, Chi: 68
island	0	1.00	FALSE	3	Bis: 168, Dre: 124, Tor: 52
sex	11	0.97	FALSE	2	mal: 168, fem: 165

Variable type: numeric

skim_variable	n_missing	complete_rate	mean	sd	p0	p25	p50	p75	p100	hist
bill_length_mm	2	0.99	43.92	5.46	32.1	39.23	44.45	48.5	59.6	▃▇▇▆▁
bill_depth_mm	2	0.99	17.15	1.97	13.1	15.60	17.30	18.7	21.5	▅▅▇▇▂
flipper_length_mm	2	0.99	200.92	14.06	172.0	190.00	197.00	213.0	231.0	▂▇▃▅▂
body_mass_g	2	0.99	4201.75	801.95	2700.0	3550.00	4050.00	4750.0	6300.0	▃▇▆▃▂
year	0	1.00	2008.03	0.82	2007.0	2007.00	2008.00	2009.0	2009.0	▇▁▇▁▇

The target variable of interest is the species of penguins, which are categorized into three groups: Adelie, Gentoo and Chinstrap penguins.

## [1] Adelie    Gentoo    Chinstrap
## Levels: Adelie Chinstrap Gentoo

Species Distribution on Islands

From this plot, we can make a few key observations:

Gentoo penguins are only found on Biscoe Island
Chinstrap pengiuns only found on Dream Island
Adelie penguins are found on all three islands
Torgersen Island only has Adelie penguins

These island observations are valuable information in differentiating penguin species.

## Warning: It is deprecated to specify `guide = FALSE` to remove a guide. Please
## use `guide = "none"` instead.

Sex Distribution

However, the sex of the penguins does not offer much information as the proportion is about even across all species. We can also note a few missing observations labeled as NA.

## Warning: It is deprecated to specify `guide = FALSE` to remove a guide. Please
## use `guide = "none"` instead.

Missing Values & Variable Selection

We noted from the data summary above that 11 observations were missing for the sex variable. There is also no reason to believe that the year the observation was taken would have any impact on the morphology of the penguins. We are not looking for any time series modeling. Therefore, we also drop year from our predictor variables. There are also two observations which are missing body measurements altogether, so these rows will be dropped altogether.

data[!complete.cases(data), ]

data <- data[complete.cases(data), ]
data <- dplyr::select(data, -c(year, island))

Body Measurements

When looking at body measurements we see that Adelie and Chinstrap penguins largely overlap except for bill_length. This suggests that we might be able to use bill_depth, body_mass and flipper_length to differentiate the Gentoo penguins from the other species. However, the Adelie penguin stands out from the other others in bill_length

Using the featurePlot function from the caret package we can easily display data distributions such as the scatter plot matrix similar to the one above.

We see on the univariate feature plots below that the data is aproximatelly normally distibuted. This is an important assumption of LDA and QDA.

Multicollinearity

Taking a look at the correlation matrix below, we can make a few observations, notably that flipper_length is highly positively correlated with body_mass which makes sense given that larger penguins should have larger flippers. The other correlations are less obvious to interpret. Given that the dataset only contains a few predictors, we choose not to exclude any variables based on multicollinearity at this time.

Data Splitting

The data is split into training and testing sets 80%/20%. The test set contains 65 observations.

set.seed(622)
trainIndex <- createDataPartition(data$species, p = .8, list = FALSE, times = 1)

training <- data[ trainIndex,]
testing  <- data[-trainIndex,]

dim(testing)

## [1] 65  6

Linear Discriminant Analysis

LDA assumes that the predictors are distributed as multivariate gaussian with common covariance. Based on the distribution plot above, LDA seems to be a good fit.

By the proportion of trace, we can see that the first linear discriminant LD1 achieves 87.7% separation and the second, LDA 12.3%.

lda.fit <- lda(species ~ ., data=training)
lda.fit

## Call:
## lda(species ~ ., data = training)
## 
## Prior probabilities of groups:
##    Adelie Chinstrap    Gentoo 
## 0.4365672 0.2052239 0.3582090 
## 
## Group means:
##           bill_length_mm bill_depth_mm flipper_length_mm body_mass_g   sexmale
## Adelie          38.76325      18.37521          189.4786    3688.675 0.4957265
## Chinstrap       48.84545      18.44182          196.0364    3757.727 0.4727273
## Gentoo          47.73125      15.03125          217.5208    5104.948 0.5104167
## 
## Coefficients of linear discriminants:
##                            LD1          LD2
## bill_length_mm     0.132134680 -0.413367283
## bill_depth_mm     -0.937321597 -0.203363198
## flipper_length_mm  0.088727372  0.011625232
## body_mass_g        0.001439478  0.001402053
## sexmale           -0.600470011  0.867736702
## 
## Proportion of trace:
##    LD1    LD2 
## 0.8724 0.1276

lda.pred <- predict(lda.fit, testing)

The LDA model performed very well with 100% accuracy on the test set. Another iteration of this model that omitted the sex variable yielded worst performance with 96.9% accuracy and 2 misclassified penguins. This shows that sex has some predictive power and should be included in the model.

scores <- data.frame()
cm.lda <- confusionMatrix(lda.pred$class, testing$species)
lda.acc <- cm.lda[[3]][1]
scores <-rbind(scores, data.frame(model="LDA", accuracy=lda.acc))
cm.lda

## Confusion Matrix and Statistics
## 
##            Reference
## Prediction  Adelie Chinstrap Gentoo
##   Adelie        29         0      0
##   Chinstrap      0        13      0
##   Gentoo         0         0     23
## 
## Overall Statistics
##                                      
##                Accuracy : 1          
##                  95% CI : (0.9448, 1)
##     No Information Rate : 0.4462     
##     P-Value [Acc > NIR] : < 2.2e-16  
##                                      
##                   Kappa : 1          
##                                      
##  Mcnemar's Test P-Value : NA         
## 
## Statistics by Class:
## 
##                      Class: Adelie Class: Chinstrap Class: Gentoo
## Sensitivity                 1.0000              1.0        1.0000
## Specificity                 1.0000              1.0        1.0000
## Pos Pred Value              1.0000              1.0        1.0000
## Neg Pred Value              1.0000              1.0        1.0000
## Prevalence                  0.4462              0.2        0.3538
## Detection Rate              0.4462              0.2        0.3538
## Detection Prevalence        0.4462              0.2        0.3538
## Balanced Accuracy           1.0000              1.0        1.0000

The partition plot below helps to visualize LDA. The observations that end up misclassified are shown in red. In the case of LDA, the decision boundary is linear.

Quadratic Discriminant Analysis

Unlike LDA, QDA assumes that each class has its own covariance matrix. As a result, the decision boundary is quadratic.

qda.fit <- qda(species ~ ., data=training)
qda.fit

## Call:
## qda(species ~ ., data = training)
## 
## Prior probabilities of groups:
##    Adelie Chinstrap    Gentoo 
## 0.4365672 0.2052239 0.3582090 
## 
## Group means:
##           bill_length_mm bill_depth_mm flipper_length_mm body_mass_g   sexmale
## Adelie          38.76325      18.37521          189.4786    3688.675 0.4957265
## Chinstrap       48.84545      18.44182          196.0364    3757.727 0.4727273
## Gentoo          47.73125      15.03125          217.5208    5104.948 0.5104167

The QDA model performed even better with 98.5% accuracy and only 1 observation misclassified. The QDA model performs the same when the categorical variable sex was omitted.

qda.pred <- predict(qda.fit, testing)
cm.qda <- confusionMatrix(qda.pred$class, testing$species)
qda.acc <- cm.qda[[3]][1]
scores <-rbind(scores, data.frame(model="QDA", accuracy=qda.acc))
cm.qda

## Confusion Matrix and Statistics
## 
##            Reference
## Prediction  Adelie Chinstrap Gentoo
##   Adelie        28         0      0
##   Chinstrap      1        13      0
##   Gentoo         0         0     23
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9846          
##                  95% CI : (0.9172, 0.9996)
##     No Information Rate : 0.4462          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9759          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: Adelie Class: Chinstrap Class: Gentoo
## Sensitivity                 0.9655           1.0000        1.0000
## Specificity                 1.0000           0.9808        1.0000
## Pos Pred Value              1.0000           0.9286        1.0000
## Neg Pred Value              0.9730           1.0000        1.0000
## Prevalence                  0.4462           0.2000        0.3538
## Detection Rate              0.4308           0.2000        0.3538
## Detection Prevalence        0.4308           0.2154        0.3538
## Balanced Accuracy           0.9828           0.9904        1.0000

Unlike the LDA partition plot, we see below that the decision boundary is quadratic. We can note in particular that the curvature of the pink region in the sixth subplot helps limit the number of Chinstrap penguins classified as Gentoo better than LDA did for the same plot. In this case, it seems that even with QDA most decision boundaries remain nearly linear.

Naive Bayes

The Naive Bayes classifier assumes that all features are equally important and independent which is often not the case and may result in some bias. However, the assumption of independence simplifies the comptutations by turning conditional probabilities into products of probabilities. Here we do not need to determine the exact posterior probability but simply which class is more likely.

The classification result for Naive Bayes on the test set yielded 96.7% accuracy and two misclassified penguins.

features <- setdiff(names(training), "species")
x <- training[,features]
y <- training$species

model.naive <- naiveBayes(x = x,y = y, laplace = 1)
result.naive <- predict(model.naive, testing %>% dplyr::select(-species))

# Make confusion matrix
cm.naive <- confusionMatrix(result.naive, testing$species)
nb.acc <- cm.naive[[3]][1]
scores <-rbind(scores, data.frame(model="NB", accuracy=nb.acc))
cm.naive

## Confusion Matrix and Statistics
## 
##            Reference
## Prediction  Adelie Chinstrap Gentoo
##   Adelie        28         1      0
##   Chinstrap      1        12      0
##   Gentoo         0         0     23
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9692          
##                  95% CI : (0.8932, 0.9963)
##     No Information Rate : 0.4462          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9516          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: Adelie Class: Chinstrap Class: Gentoo
## Sensitivity                 0.9655           0.9231        1.0000
## Specificity                 0.9722           0.9808        1.0000
## Pos Pred Value              0.9655           0.9231        1.0000
## Neg Pred Value              0.9722           0.9808        1.0000
## Prevalence                  0.4462           0.2000        0.3538
## Detection Rate              0.4308           0.1846        0.3538
## Detection Prevalence        0.4462           0.2000        0.3538
## Balanced Accuracy           0.9689           0.9519        1.0000

K-Nearest Neighbors

The KNN algorithms requires minor data processing. Firstly, predictor values that are factors should be conversted to numeric. Secondly, because KNN uses distance between points to determine their classification, it is important for the points to be on the scaled appropriately. Here we pass the scale argument to the preProcess parameter of the training function to standardize each variable. The data is then split into training and testing sets 80%/20%. The test set contains 65 observations and the train set 268 observations.

# Processing
penguins_knn <- data
penguins_knn$sex <- as.numeric(penguins_knn$sex)-1 # recode as 1 or 0
# Data Partitioning
knn_training <- penguins_knn[trainIndex,]
knn_testing  <- penguins_knn[-trainIndex,]

Modeling

10-fold cross-validation is performed on the training data to determine the optimal parameter k for our model. The resulting accuracy for each value of k is displayed and plotted below. The maximum accuracy is reached with values of k=3. We gain a full percentage point in cross-validation accuracy on the training data using the tuned model over models with slightly more or fewer neighbors.

trControl <- trainControl(method  = "cv",
                          number  = 10)
knn.fit <- train(species ~ .,
             method     = "knn",
             tuneGrid   = expand.grid(k = 1:10),
             trControl  = trControl,
             preProcess = c("center","scale"),
             metric     = "Accuracy",
             data       = knn_training)

knn.fit

## k-Nearest Neighbors 
## 
## 268 samples
##   5 predictor
##   3 classes: 'Adelie', 'Chinstrap', 'Gentoo' 
## 
## Pre-processing: centered (5), scaled (5) 
## Resampling: Cross-Validated (10 fold) 
## Summary of sample sizes: 243, 241, 242, 240, 242, 241, ... 
## Resampling results across tuning parameters:
## 
##   k   Accuracy   Kappa    
##    1  0.9886040  0.9820817
##    2  0.9813289  0.9706151
##    3  0.9924501  0.9880238
##    4  0.9851750  0.9764033
##    5  0.9813289  0.9703987
##    6  0.9850326  0.9763967
##    7  0.9851750  0.9764033
##    8  0.9851750  0.9764033
##    9  0.9851750  0.9764033
##   10  0.9851750  0.9764033
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was k = 3.

The evaluation of the tuned K-NN model on the testing data reveals that the model was able to classify species with perfect accuracy. However, it is important to note that 100% prediction accuracy is typically rare and that this model benefited from fairly clean class separations and limited overlap in the original dataset.

knnPredict <- predict(knn.fit, newdata = knn_testing) 
cm.knn <- confusionMatrix(knnPredict, knn_testing$species)
knn.acc <- cm.knn[[3]][1]
scores <-rbind(scores, data.frame(model="KNN", accuracy=knn.acc))
cm.knn

## Confusion Matrix and Statistics
## 
##            Reference
## Prediction  Adelie Chinstrap Gentoo
##   Adelie        29         0      0
##   Chinstrap      0        13      0
##   Gentoo         0         0     23
## 
## Overall Statistics
##                                      
##                Accuracy : 1          
##                  95% CI : (0.9448, 1)
##     No Information Rate : 0.4462     
##     P-Value [Acc > NIR] : < 2.2e-16  
##                                      
##                   Kappa : 1          
##                                      
##  Mcnemar's Test P-Value : NA         
## 
## Statistics by Class:
## 
##                      Class: Adelie Class: Chinstrap Class: Gentoo
## Sensitivity                 1.0000              1.0        1.0000
## Specificity                 1.0000              1.0        1.0000
## Pos Pred Value              1.0000              1.0        1.0000
## Neg Pred Value              1.0000              1.0        1.0000
## Prevalence                  0.4462              0.2        0.3538
## Detection Rate              0.4462              0.2        0.3538
## Detection Prevalence        0.4462              0.2        0.3538
## Balanced Accuracy           1.0000              1.0        1.0000

Model Comparison

The data frame below summarizes the model accuracies from above. LDA and KNN were the best performing models on the test set. However, the test set only comprised of 65 observations. We can expand our model evaluation on the test with alternate train-test data splits.

scores

The function below is created to test our models from above on more observations. By iterating over 100 train/test splits of the data, we can evaluate the performance of the models on more unseen data. Note that new models are fitted to each new split of the data.

set.seed(25)
sim_test_set <- function() {
  
  sim_scores <- data.frame()
  features <- setdiff(names(training), "species")
  
  for (i in seq(1:100)) {
    trainIndex <- createDataPartition(data$species, p = .8, list = FALSE, times = 1)
    training <- data[ trainIndex,]
    testing  <- data[-trainIndex,]
    
    # LDA
    lda.fit <- lda(species ~ ., data=training)
    lda.pred <- predict(lda.fit, testing)
    cm.lda <- confusionMatrix(lda.pred$class, testing$species)
    lda.acc <- cm.lda[[3]][1]
    
    # QDA
    qda.fit <- qda(species ~ ., data=training)
    qda.pred <- predict(qda.fit, testing)
    cm.qda <- confusionMatrix(qda.pred$class, testing$species)
    qda.acc <- cm.qda[[3]][1]
    
    # Naive Bayes
    x <- training[,features]
    y <- training$species
    model.naive <- naiveBayes(x = x,y = y, laplace = 1)
    result.naive <- predict(model.naive, testing %>% dplyr::select(-species))
    cm.naive <- confusionMatrix(result.naive, testing$species)
    nb.acc <- cm.naive[[3]][1]
    
    # KNN
    training$sex <- as.numeric(training$sex)-1 # recode as 1 or 0
    testing$sex <- as.numeric(testing$sex)-1 # recode as 1 or 0
    knn.fit <- train(species ~ .,
             method     = "knn",
             tuneGrid   = expand.grid(k = 1:10),
             trControl  = trControl,
             preProcess = c("center","scale"),
             metric     = "Accuracy",
             data       = training)
    knn.pred <- predict(knn.fit, newdata = testing) 
    cm.knn <- confusionMatrix(knnPredict, knn_testing$species)
    knn.acc <- cm.knn[[3]][1]
    
    sim_scores <-rbind(sim_scores, data.frame(lda=lda.acc, qda=qda.acc, nb=nb.acc, knn=knn.acc))
  }
  
  return(sim_scores)
}

sim_test_acc <- sim_test_set()
sim_test_acc

The performance is summarized in the boxplot above. While the performance of the KNN and LDA models was the best over the original test set, we see here that over 100 data splits, KNN results in perfect classification for each iteration and that LDA and QDA models have nearly identical performance. The Naive Bayes model tend to have lower accuracy and a longer lower tail than LDA/QDA.

In general LDA models are more stable than logistic regression (not studied here) when the classes are well separated and the features are approximately normal, which is the case here. However, by assuming that the predictors are distributed as multivariate gaussian with common covariance we are limiting the flexibility of the fit. When comparing partition plots between LDA and QDA we saw that most QDA boundaries were nearly linear showing that the LDA assumptions nearly hold. The correlation between bill depth and flipper length as well as body weight and flipper length was an early indication that QDA might be a better fit. However, QDA provided no meaningful gain in performance over LDA. Therefore between the two discriminant methods, we prefer the LDA model which is more parsimious since it has fewer covariance parameters to estimate.

While the Naive Bayes model is easy to implement it likely suffers in our case from the assumption of feature independence. As stated above, some of the predictor variables were correlated, thereby violating the assumption and introducing bias.

As stated previous, KNN is non-parameteric, estimates no coefficients, and its non-linear boundary affords it the most flexible fit which in this case resulted in the best overall performance.