LDA, QDA & Naive Bayes

Setup

library(skimr)
library(caret) # For featureplot, classification report
library(MASS) # For LDA, QDA
library(e1071) # For Naive Bayes
library(corrplot) # For correlation matrix
library("palmerpenguins")

library(klaR) # For QDA partition plot
library(dplyr)
library(tidyr)

Data Exploration

The penguin dataset is composed of 344 observations with 8 variables, 5 of which are numeric and 3 which are qualitative. The dataset is mostly complete with just a few observations with missing values that will need to be handled. We will remove the records with NA values.

##       species          island    bill_length_mm  bill_depth_mm  
##  Adelie   :146   Biscoe   :163   Min.   :32.10   Min.   :13.10  
##  Chinstrap: 68   Dream    :123   1st Qu.:39.50   1st Qu.:15.60  
##  Gentoo   :119   Torgersen: 47   Median :44.50   Median :17.30  
##                                  Mean   :43.99   Mean   :17.16  
##                                  3rd Qu.:48.60   3rd Qu.:18.70  
##                                  Max.   :59.60   Max.   :21.50  
##  flipper_length_mm  body_mass_g       sex     
##  Min.   :172       Min.   :2700   female:165  
##  1st Qu.:190       1st Qu.:3550   male  :168  
##  Median :197       Median :4050               
##  Mean   :201       Mean   :4207               
##  3rd Qu.:213       3rd Qu.:4775               
##  Max.   :231       Max.   :6300

The target variable of interest is the species of penguins, which are categorized into three groups: Adelie, Gentoo and Chinstrap penguins.

## [1] Adelie    Gentoo    Chinstrap
## Levels: Adelie Chinstrap Gentoo

We have 2 categorical variables island and sex and 4 continuous or numerical variables bill_length_mm,bill_depth_mm,flipper_length_mm,body_mass_g.

We will drop the categorical variables island and sex since we will be using disciminant analysis which needs continous variables.

penguins_df <- subset(penguins_df, select = -c(island,sex))

Looks like island feature has some interesting observations. If penguin lives in Torgersen island then we can say that it belongs to Adelie species. Dream island has only Adelie and Chinstrap penguins. Biscoe island has Adelie or Gentoo penguins. We will keep island feature in our model.

Now lets check continous or numerican features and plot them against each other.

featurePlot(x=penguins_df[,2:5], y=penguins_df$species, plot="pairs",auto.key = list(columns = 3))

By looking at the featureplot we can say that feature bill_length is playing important role in distingushing the species.bill_depth,flipper_length,body_mass also play some role in distinguishing species along with bill_length.

featurePlot(x=penguins_df[,2:5], y=penguins_df$species, plot="box",
            
             scales = list(y = list(relation="free"),
                          x = list(rot = 90)),  
            layout = c(4,1 ), 
            auto.key = list(columns = 3))

From the above boxplot it looks like flipper_length and body_mass are probably correlated.

featurePlot(x=penguins_df[,2:5], y=penguins_df$species, plot="density",
            
            ## Pass in options to xyplot() to 
            ## make it prettier
            scales = list(x = list(relation="free"), 
                          y = list(relation="free")), 
            adjust = 1.5, 
            pch = "|", 
            layout = c(4, 1), 
            auto.key = list(columns = 3))

From the density plot, it looks like the above features are aproximatelly normally distributed. This is an important assumption of LDA and QDA.

Let’s find the corelation between the features.

##                   bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## bill_length_mm         1.0000000    -0.2286256         0.6530956   0.5894511
## bill_depth_mm         -0.2286256     1.0000000        -0.5777917  -0.4720157
## flipper_length_mm      0.6530956    -0.5777917         1.0000000   0.8729789
## body_mass_g            0.5894511    -0.4720157         0.8729789   1.0000000

As we observed before flipper_length and body_mass are highly correlated . We will keep body_mass since it is less correlated compared to flipper_length and will drop flipper_length feature from our model.

penguins_final <- subset(penguins_df, select = -c(flipper_length_mm))

Splittng Data between train and test dataset

The data is split into training and testing sets 80%/20%. The test set contains 65 observations.

set.seed(622)
trainIndex <- createDataPartition(penguins_final$species, p = .8, list = FALSE, times = 1)
training <- penguins_final[ trainIndex,]
testing  <- penguins_final[-trainIndex,]
dim(testing)

## [1] 65  4

Linear Discriminant Analysis

LDA assumes that the predictors are distributed as multivariate gaussian with common covariance. From our distribution plots above, LDA seems to be a good fit.

By the proportion of trace, we can see that the first linear discriminant LD1 achieves 83.7% separation and the second, LDA 16.26%.

model_lda <- lda(species ~ ., data=training)
model_lda

## Call:
## lda(species ~ ., data = training)
## 
## Prior probabilities of groups:
##    Adelie Chinstrap    Gentoo 
## 0.4365672 0.2052239 0.3582090 
## 
## Group means:
##           bill_length_mm bill_depth_mm body_mass_g
## Adelie          38.81880      18.36752    3707.692
## Chinstrap       48.73091      18.51091    3738.182
## Gentoo          47.56562      15.01979    5088.281
## 
## Coefficients of linear discriminants:
##                         LD1          LD2
## bill_length_mm  0.141524248 -0.408190182
## bill_depth_mm  -1.044885950 -0.034633623
## body_mass_g     0.001989374  0.001903976
## 
## Proportion of trace:
##    LD1    LD2 
## 0.8374 0.1626

Comments about our LDA model

Prior probabilities of groups: the proportion of training observations in each group. For example, there are 43% of the training observations in the Adelie group

Group means: group center of gravity. Shows the mean of each variable in each group.

Coefficients of linear discriminants: Shows the linear combination of predictor variables that are used to form the LDA decision rule. for example,

LD1 = 0.141bill_length_mm - 1.044bill_depth_mm + 0.00198*body_mass_g

Similarly, LD2 = -0.408bill_length_mm - 0.034bill_depth_mm + 0.0019*body_mass_g

The “proportion of trace” is the percentage separation achieved by each discriminant function.In our case LD1 provided 83.7% and LD2 provided 16.26% separation.

Using the function plot() produces plots of the linear discriminants, obtained by computing LD1 and LD2 for each of the training observations.

plot(model_lda,col = as.integer(training$species), main='LDA Model')

Make Predictions

predictions_train <- predict(model_lda, training)
predictions_test <- predict(model_lda, testing)

calculate error

#training prediction error

mean(training$species != predictions_train$class)

## [1] 0.007462687

#testing prediction error

mean(testing$species != predictions_test$class)

## [1] 0.01538462

#table(predicted = model_lda.pred$class, actual = testing$species)
cm.lda <- confusionMatrix( predictions_test$class, testing$species)
cm.lda

## Confusion Matrix and Statistics
## 
##            Reference
## Prediction  Adelie Chinstrap Gentoo
##   Adelie        28         0      0
##   Chinstrap      1        13      0
##   Gentoo         0         0     23
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9846          
##                  95% CI : (0.9172, 0.9996)
##     No Information Rate : 0.4462          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9759          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: Adelie Class: Chinstrap Class: Gentoo
## Sensitivity                 0.9655           1.0000        1.0000
## Specificity                 1.0000           0.9808        1.0000
## Pos Pred Value              1.0000           0.9286        1.0000
## Neg Pred Value              0.9730           1.0000        1.0000
## Prevalence                  0.4462           0.2000        0.3538
## Detection Rate              0.4308           0.2000        0.3538
## Detection Prevalence        0.4308           0.2154        0.3538
## Balanced Accuracy           0.9828           0.9904        1.0000

From the confusion matrix we can see that we are getting 98.46% accuracy which is quite reasonable

The partition plot below helps to visualize LDA. The observations that end up misclassified are shown in red. In the case of LDA, the decision boundary is linear.

Quadratic Discriminant Analysis

Unlike LDA, QDA assumes that each class has its own covariance matrix. As a result, the decision boundary is quadratic.

model_qda <- qda(species ~ ., data=training)
model_qda

## Call:
## qda(species ~ ., data = training)
## 
## Prior probabilities of groups:
##    Adelie Chinstrap    Gentoo 
## 0.4365672 0.2052239 0.3582090 
## 
## Group means:
##           bill_length_mm bill_depth_mm body_mass_g
## Adelie          38.81880      18.36752    3707.692
## Chinstrap       48.73091      18.51091    3738.182
## Gentoo          47.56562      15.01979    5088.281

qda_predictions <- predict(model_qda, testing)
cm.qda <- confusionMatrix(qda_predictions$class, testing$species)
#qda.acc <- cm.qda[[3]][1]
#scores <-rbind(scores, training.frame(model="QDA", accuracy=qda.acc))
cm.qda

## Confusion Matrix and Statistics
## 
##            Reference
## Prediction  Adelie Chinstrap Gentoo
##   Adelie        28         0      0
##   Chinstrap      1        13      0
##   Gentoo         0         0     23
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9846          
##                  95% CI : (0.9172, 0.9996)
##     No Information Rate : 0.4462          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9759          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: Adelie Class: Chinstrap Class: Gentoo
## Sensitivity                 0.9655           1.0000        1.0000
## Specificity                 1.0000           0.9808        1.0000
## Pos Pred Value              1.0000           0.9286        1.0000
## Neg Pred Value              0.9730           1.0000        1.0000
## Prevalence                  0.4462           0.2000        0.3538
## Detection Rate              0.4308           0.2000        0.3538
## Detection Prevalence        0.4308           0.2154        0.3538
## Balanced Accuracy           0.9828           0.9904        1.0000

Looks like QDA model also has same accuracy 98.46% as LDA model. It seems that even with QDA most decision boundaries remain nearly linear.

Naive Bayes

The Naive Bayes classifier assumes that all features are equally important and independent which is often not the case and may result in some bias. However, the assumption of independence simplifies the computations by turning conditional probabilities into products of probabilities. Here we do not need to determine the exact posterior probability but simply which class is more likely.

The classification result for Naive Bayes on the test set yielded 96.9% accuracy and two misclassified penguins.

features <- setdiff(names(training), "species")
x <- training[,features]
y <- training$species
model.naive <- naiveBayes(x = x,y = y, laplace = 1)
result.naive <- predict(model.naive, testing %>% dplyr::select(-species))
# Make confusion matrix
cm.naive <- confusionMatrix(result.naive, testing$species)
nb.acc <- cm.naive[[3]][1]
cm.naive

## Confusion Matrix and Statistics
## 
##            Reference
## Prediction  Adelie Chinstrap Gentoo
##   Adelie        27         0      0
##   Chinstrap      2        13      0
##   Gentoo         0         0     23
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9692          
##                  95% CI : (0.8932, 0.9963)
##     No Information Rate : 0.4462          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9522          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: Adelie Class: Chinstrap Class: Gentoo
## Sensitivity                 0.9310           1.0000        1.0000
## Specificity                 1.0000           0.9615        1.0000
## Pos Pred Value              1.0000           0.8667        1.0000
## Neg Pred Value              0.9474           1.0000        1.0000
## Prevalence                  0.4462           0.2000        0.3538
## Detection Rate              0.4154           0.2000        0.3538
## Detection Prevalence        0.4154           0.2308        0.3538
## Balanced Accuracy           0.9655           0.9808        1.0000

The classification result for Naive Bayes on the test set yielded 96.9% accuracy and two misclassified penguins.

Model Comparison

LDA and QDA models are nearly identical and performed with the same accuracy. The Naive Bayes model tend to have lower accuracy and a longer lower tail than LDA.

In general LDA models are more stable than logistic regression when the classes are well separated and the features are approximately normal, which is the case here. However, by assuming that the predictors are distributed as multivariate gaussian with common covariance we are limiting the flexibility of the fit. When comparing partition plots between LDA and QDA we saw that most QDA boundaries were nearly linear showing that the LDA assumptions nearly hold. The correlation between bill depth and flipper length as well as body weight and flipper length was an early indication that QDA might be a better fit. However, QDA provided no meaningful gain in performance over LDA. Therefore we prefer the LDA model which is more parsimious since it has fewer covariance parameters to estaimte.

While the Naive Bayes model is easy to implement it likely suffers in our case from the assumption of feature independence. As stated above, some of the predictor variables were correlated, thereby violating the assumption and introducing bias.

DATA 622 - HW2 Penguin Classification using LDA,QDA,Naive Bayes

Santosh Manjrekar

3/15/2021