Introduction

This assignment analyzes the Palmer Penguins dataset using linear (LDA) and quadratic (QDA) discriminant analysis and naive Bayes classifier. The dataset contains the physical measurements and traits of penguins collected over a 3 year period 2007-2009 from three closely related species: Gentoo, Chinstrap and Adelie in the Palmer Archipelago, Antarctica. In Section 1 we apply LDA and dive into feature plots and dividing up the training and test data. In Section 2 we apply QDA and take a short digression into covariance comparison. In Section 3 we apply Naive Bayes. In Section 4 we compare all 3 methodologies and 6 models in the context of this dataset. We show that LDA beats out the other 2 methods.

1 Linear Discriminant Analysis

1.1 Data Selection

Because this study uses the same dataset for related purposes as in Assignment 1, I refer the reader to my exploratory data analysis which is contained here. Minimal efforts at data wrangling were required as only a handful of observations lacked a value for the feature sex.

Our dataset omits those observations where the penguin sex is unknown. It also omits the feature of year and island because previous data analysis showed little time variation in variables of interest.

## [1] 333   6

The final dataset has \(N=\) 333 observations and \(M=\) 6 columns. All variables are non-null across all \(N\) observations.

1.2 FeaturePlot Analysis

Three feature plots using the caret package provide the key observational insights we use to implement our modeling with discriminant analysis and Naive Bayesian classifiers. We use the featurePlot analysis to look at quantitative variable box plots across the 3 species.

We clearly observe that the median and interquartile ranges of the three penguin species are quite separated - though by different features in each species. By separation, I mean that the means and interquartile range of at least one class \(k\) is distinct and don’t overlap the other classes.

  1. flipper_length_mm separates Gentoo from the other two species Adelie and Chinstrap.
  2. body_mass_g separates Gentoo from Adelie and Chinstrap.
  3. bill_length_mm separates Adelie from Chinstrap and Gentoo.
  4. bill_depth_mm separates Gentoo from Adelie and Chinstrap.

A density plot of the quantitative features basically tells a similar story as the box plots.

The biggest challenge revealed by looking at the marginal distributions of the 4 features is masking of Chinstrap penguins. There is no single variable that separates Chinstrap from all other species.

An ellipse featureplot shows for each pair of input variables the ellipse obtained by drawing the bivariate normal distribution with matching correlation as the subgroup. Each ellipse is drawn at the 95 percent confidence level below. The mathematical details, surprisingly, are undocumented in the caret reference and requires checking the R source code.

In the ellipse plot, please note the excellent separation of ellipse centers for bill_length_mm vs. bill_depth_mm. By contrast, the ellipses almost coincide in body_mass_g versus bill_depth_mm for Adelie and Chinstrap.

The above feature plots suggest LDA and QDA ought to produce reasonably effective predictive models.

1.3 Partition the Data Sample

Before constructing the LDA models, we partition the data into a training and test set and pre-process the training data to center and standardize the quantitative features. We allocate 80% of the sample to the training set and 20% to the test set using the caret createDataPartition method. Then use the training data to define sample means and standard deviations to convert each training and test observation to \(Z\) score equivalent values. The data transformations help to improve model performance slightly but are considered good practice for LDA, QDA and Naive Bayes classification.

We chose the 80-20 split to be consistent with standard practice.

1.4 Model Building

We like two models for LDA. Both models do an effective job in classifying penguins by species.

The first model which I call lda_comp uses all 5 features to predict species of the penguin. The main reason I am uncomfortable to fully rely on this model is the absence of clear documentation in the MASS package of how the linear discriminant analysis function lda handles categorical variables like sex.

By assumption, the LDA methodology assumes all input variables are continuous Gaussian distributed variables with a common covariance for all classes. It is unclear how lda handles species as no documentation on this matter exists in the MASS reference manual.

## Call:
## lda(species ~ ., data = train.transform)
## 
## Prior probabilities of groups:
##    Adelie Chinstrap    Gentoo 
## 0.4365672 0.2052239 0.3582090 
## 
## Group means:
##           bill_length_mm bill_depth_mm flipper_length_mm body_mass_g   sexmale
## Adelie        -0.9555507     0.6009140        -0.7707877  -0.6092842 0.5213675
## Chinstrap      0.9088301     0.6564651        -0.3716006  -0.5901947 0.5454545
## Gentoo         0.6438935    -1.1084638         1.1522937   1.0806975 0.5104167
## 
## Coefficients of linear discriminants:
##                          LD1        LD2
## bill_length_mm     0.7462651 -2.5071107
## bill_depth_mm     -1.8835293 -0.3692413
## flipper_length_mm  1.0638644  0.4674442
## body_mass_g        1.2362143  0.9725483
## sexmale           -0.7182874  0.9342117
## 
## Proportion of trace:
##   LD1   LD2 
## 0.841 0.159

Consequently, we also consider a smaller model lda4 with the 4 continuous variables bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g and sexmale.

## Call:
## lda(species ~ bill_length_mm + bill_depth_mm + body_mass_g + 
##     flipper_length_mm, data = train.transform)
## 
## Prior probabilities of groups:
##    Adelie Chinstrap    Gentoo 
## 0.4365672 0.2052239 0.3582090 
## 
## Group means:
##           bill_length_mm bill_depth_mm body_mass_g flipper_length_mm
## Adelie        -0.9555507     0.6009140  -0.6092842        -0.7707877
## Chinstrap      0.9088301     0.6564651  -0.5901947        -0.3716006
## Gentoo         0.6438935    -1.1084638   1.0806975         1.1522937
## 
## Coefficients of linear discriminants:
##                          LD1          LD2
## bill_length_mm     0.5415983 -2.408094172
## bill_depth_mm     -2.1000729  0.004472972
## body_mass_g        1.0280418  1.274291326
## flipper_length_mm  1.0940111  0.404513143
## 
## Proportion of trace:
##    LD1    LD2 
## 0.8462 0.1538

We briefly comment on the model structure. Both model seem to produce similar coefficients on the quantitative variables. LD1 places the greatest absolute magnitude on bill_depth_mm with opposing weight in bill_length_mm while LD2 places the greatest absolute magnitude on bill_length_mm with zero weight on bill_depth_mm. This suggests that the dominant feature to discriminate penguin species is the ratio of bill_depth_mm to bill_length_mm. Weight and flipper length play a secondary role but retain the same sign in both LD1 and LD2. Therefore, consistent with our visual inspection of the ellipse featureplot, the models regard bill_depth_mm and bill_length_mm to be most helpful features to distinguish species. Next, let’s look at the model accuracy and fit statistics.

1.5 Fit Statistics and Accuracy Rates

## Confusion Matrix and Statistics
## 
##            Reference
## Prediction  Adelie Chinstrap Gentoo
##   Adelie        29         1      0
##   Chinstrap      0        12      0
##   Gentoo         0         0     23
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9846          
##                  95% CI : (0.9172, 0.9996)
##     No Information Rate : 0.4462          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9757          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: Adelie Class: Chinstrap Class: Gentoo
## Sensitivity                 1.0000           0.9231        1.0000
## Specificity                 0.9722           1.0000        1.0000
## Pos Pred Value              0.9667           1.0000        1.0000
## Neg Pred Value              1.0000           0.9811        1.0000
## Prevalence                  0.4462           0.2000        0.3538
## Detection Rate              0.4462           0.1846        0.3538
## Detection Prevalence        0.4615           0.1846        0.3538
## Balanced Accuracy           0.9861           0.9615        1.0000
## Confusion Matrix and Statistics
## 
##            Reference
## Prediction  Adelie Chinstrap Gentoo
##   Adelie        28         1      0
##   Chinstrap      1        12      0
##   Gentoo         0         0     23
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9692          
##                  95% CI : (0.8932, 0.9963)
##     No Information Rate : 0.4462          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9516          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: Adelie Class: Chinstrap Class: Gentoo
## Sensitivity                 0.9655           0.9231        1.0000
## Specificity                 0.9722           0.9808        1.0000
## Pos Pred Value              0.9655           0.9231        1.0000
## Neg Pred Value              0.9722           0.9808        1.0000
## Prevalence                  0.4462           0.2000        0.3538
## Detection Rate              0.4308           0.1846        0.3538
## Detection Prevalence        0.4462           0.2000        0.3538
## Balanced Accuracy           0.9689           0.9519        1.0000

The models appear to be pretty accurate. lda-comp has 98.4% accuracy and 97.6% kappa on the test data. lda4 comes close with 96.9% accuracy and 95.16% kappa. We conclude that adding the sex categorical variable improves model performance slightly. The model confusion arises in separating Adelie and Chinstrap penguins as we can see.

Next, we compare model diagnostic charts of the lda-comp and lda4 models side by side below. We see that when the classes are plotted on a scatter plot on the LD1 and LD2 axes, the left hand side (lda-comp) shows slightly better separation of Adelie and Chinstrap clusters than the right hand plot (lda4).

Even though we remain slightly suspicious of lda-comp due to possible model issues with the categorical variable, the larger model outperforms the smaller one.

Lastly, we examine the partition plot of the LDA 4 model.

2 Quadratic Discriminant Analysis

As we have already completed the preliminary steps of data cleansing and data splitting into training and test sets, the same data will be apply to the quadratic discriminant analysis.

Two questions arise in using QDA which deserve consideration are:

  1. Do group-wise covariances differ sufficiently to justify the use of QDA over LDA? Do the \(4 \times 4\) covariance matrices differ by penguin species?

  2. Does MASS::qda handle categorical feature variables like sex correctly in estimating the discriminant? From a theoretical perspective, QDA does not support such variables.

We found a way to address the first question in the next section. Because we have no satisfactory answer to the second question, we build two QDA models: with and without sex as feature.

2.1 Exploring Penguin Covariance

In this section, we argue that while species means were different, the species level covariances of features are broadly similar. A visual method is applied to explore all the pairwise ellipse plots in \(2D\).

A paper by Friendly, Sigal argues that it is possible to visualize a multidimensional covariance matrix by analyzing all its pairwise covariance ellipses. This is because a \(p \times p\) covariance matrix is equivalent by a \(p\)-dimensional ellipsoid. An R package call heplots implements data visualizations to show the pairwise ellipse plots. We then decide if the covariance matrices for each of the penguin species are different by visual means. See https://mattsigal.github.io/eqcov_supp/iris-ex.html for more details.

The first plot shows the ellipses for each species in a different color for each pair of features. Each ellipse is centered at its species mean and shows the covariance between the two variables. We see significant different in the species means but the thickness and length of the ellipses appear comparable.

The second plot below shows each ellipse centered at the pooled mean. This is the more important plot since we detect the problematic covariances here. If any ellipse does not overlap well with the pooled ellipse, that suggests a violation of the LDA homoskedasticity assumption. Looking at the plot below, we see no evidence that ellipses don’t overlap the pooled ellipse. Therefore, we argue that QDA is not required to correct LDA homoskedacity assumptions.

2.2 Model Building

The function qda from the MASS package allows us to fit QDA models. The two QDA models which we build are called: qda-comp for the complete model with 4 continuous and 1 categorical variable sex and qda4 for the model with just 4 continuous variables. The approach follows the same lines as what we did for the LQA models. The qda-comp model specification are shown below:

## Call:
## qda(species ~ ., data = train.transform)
## 
## Prior probabilities of groups:
##    Adelie Chinstrap    Gentoo 
## 0.4365672 0.2052239 0.3582090 
## 
## Group means:
##           bill_length_mm bill_depth_mm flipper_length_mm body_mass_g   sexmale
## Adelie        -0.9555507     0.6009140        -0.7707877  -0.6092842 0.5213675
## Chinstrap      0.9088301     0.6564651        -0.3716006  -0.5901947 0.5454545
## Gentoo         0.6438935    -1.1084638         1.1522937   1.0806975 0.5104167

The qda4 model is specified in the output below:

## Call:
## qda(species ~ bill_length_mm + bill_depth_mm + flipper_length_mm + 
##     body_mass_g, data = train.transform)
## 
## Prior probabilities of groups:
##    Adelie Chinstrap    Gentoo 
## 0.4365672 0.2052239 0.3582090 
## 
## Group means:
##           bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## Adelie        -0.9555507     0.6009140        -0.7707877  -0.6092842
## Chinstrap      0.9088301     0.6564651        -0.3716006  -0.5901947
## Gentoo         0.6438935    -1.1084638         1.1522937   1.0806975

Unfortunately, QDA model coefficients are not as straightforward to report.

2.3 QDA Fit Statistics and Accuracy Rates

The confusion matrix for qda-comp gives an accuracy of 96.92% and a kappa of 95.1%. By comparison, qda4 gives an accuracy of 95.38% and a kappa of 92.7%. The tables below show the statistics. These data imply the bigger model qda-comp has slightly stronger predictive power than qda4.

## Confusion Matrix and Statistics
## 
##            Reference
## Prediction  Adelie Chinstrap Gentoo
##   Adelie        29         2      0
##   Chinstrap      0        11      0
##   Gentoo         0         0     23
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9692          
##                  95% CI : (0.8932, 0.9963)
##     No Information Rate : 0.4462          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.951           
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: Adelie Class: Chinstrap Class: Gentoo
## Sensitivity                 1.0000           0.8462        1.0000
## Specificity                 0.9444           1.0000        1.0000
## Pos Pred Value              0.9355           1.0000        1.0000
## Neg Pred Value              1.0000           0.9630        1.0000
## Prevalence                  0.4462           0.2000        0.3538
## Detection Rate              0.4462           0.1692        0.3538
## Detection Prevalence        0.4769           0.1692        0.3538
## Balanced Accuracy           0.9722           0.9231        1.0000

The qda4 model statistics are shown below.

## Confusion Matrix and Statistics
## 
##            Reference
## Prediction  Adelie Chinstrap Gentoo
##   Adelie        28         2      0
##   Chinstrap      1        11      0
##   Gentoo         0         0     23
## 
## Overall Statistics
##                                          
##                Accuracy : 0.9538         
##                  95% CI : (0.871, 0.9904)
##     No Information Rate : 0.4462         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.927          
##                                          
##  Mcnemar's Test P-Value : NA             
## 
## Statistics by Class:
## 
##                      Class: Adelie Class: Chinstrap Class: Gentoo
## Sensitivity                 0.9655           0.8462        1.0000
## Specificity                 0.9444           0.9808        1.0000
## Pos Pred Value              0.9333           0.9167        1.0000
## Neg Pred Value              0.9714           0.9623        1.0000
## Prevalence                  0.4462           0.2000        0.3538
## Detection Rate              0.4308           0.1692        0.3538
## Detection Prevalence        0.4615           0.1846        0.3538
## Balanced Accuracy           0.9550           0.9135        1.0000

We also produce the partition plots to see where the QDA class boundaries fall in 2D plots. For some feature pairs, the classified points clearly have high error rates. For example, in the top rightmost plot below showing flipper_length_mm versus bill_depth_mm, many red points indicate errors where tagged points would be misclassified if we only relied on those two variables.

We omit the partition plot for the qda-comp model because it is identical to qda4 except for the categorical variable sex. For the latter variable, partition plots are uninformative because the data points are bunched up at two data values male and female.

3 Naive Bayes

For the Naive Bayes classifiers, we consider two simple models using the klaR package implementation. We likewise use the same transformed data sets and data splits as previously used in the above LDA and QDA models. The Laplace correction isn’t really necessary since the only categorical variable sex have observations. However, this correction can be used by setting the parameter fL to a nonzero number.

As before, we specify a 4 feature model nb4 which only uses the continuous feature variables and a second model nb_comp that uses all variables which means the 4 continuous factors and 1 categorical variable (sex).

Here are confusion Matrix results for Naive Bayes on the nb4 model:

## Confusion Matrix and Statistics
## 
##            Reference
## Prediction  Adelie Chinstrap Gentoo
##   Adelie        28         2      0
##   Chinstrap      1        11      0
##   Gentoo         0         0     23
## 
## Overall Statistics
##                                          
##                Accuracy : 0.9538         
##                  95% CI : (0.871, 0.9904)
##     No Information Rate : 0.4462         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.927          
##                                          
##  Mcnemar's Test P-Value : NA             
## 
## Statistics by Class:
## 
##                      Class: Adelie Class: Chinstrap Class: Gentoo
## Sensitivity                 0.9655           0.8462        1.0000
## Specificity                 0.9444           0.9808        1.0000
## Pos Pred Value              0.9333           0.9167        1.0000
## Neg Pred Value              0.9714           0.9623        1.0000
## Prevalence                  0.4462           0.2000        0.3538
## Detection Rate              0.4308           0.1692        0.3538
## Detection Prevalence        0.4615           0.1846        0.3538
## Balanced Accuracy           0.9550           0.9135        1.0000

Here are confusion Matrix results for Naive Bayes on the nb_comp model:

## Confusion Matrix and Statistics
## 
##            Reference
## Prediction  Adelie Chinstrap Gentoo
##   Adelie        28         2      0
##   Chinstrap      1        11      0
##   Gentoo         0         0     23
## 
## Overall Statistics
##                                          
##                Accuracy : 0.9538         
##                  95% CI : (0.871, 0.9904)
##     No Information Rate : 0.4462         
##     P-Value [Acc > NIR] : < 2.2e-16      
##                                          
##                   Kappa : 0.927          
##                                          
##  Mcnemar's Test P-Value : NA             
## 
## Statistics by Class:
## 
##                      Class: Adelie Class: Chinstrap Class: Gentoo
## Sensitivity                 0.9655           0.8462        1.0000
## Specificity                 0.9444           0.9808        1.0000
## Pos Pred Value              0.9333           0.9167        1.0000
## Neg Pred Value              0.9714           0.9623        1.0000
## Prevalence                  0.4462           0.2000        0.3538
## Detection Rate              0.4308           0.1692        0.3538
## Detection Prevalence        0.4615           0.1846        0.3538
## Balanced Accuracy           0.9550           0.9135        1.0000

We concluded that the Naive Bayes models are pretty good. nb4 has an accuracy of 95.4% and a kappa of 92.7. Likewise, the nb_comp model has the confusion matrix so all of its statistic performance data is identical. Both models fall short in their analysis of a few Chinstrap penguins – they are misclassified as Adelie penguins.

4 Model Comparison and Concluding Remarks

We summarize the entire study’s models in a single table below showing their methodology, model name, accuracy and kappa. On the basis of kappa, the LDA methodology beats out QDA and Naive Bayes methods uniformly. The models all fail on the same data issue: the confusion of a small handful of Chinstrap penguins as Adelie penguins. We explain this confusion as being caused by masking – a common problem in classification – of Chinstrap penguins.

The methodologies all perform reasonably well at an absolute level given the limited number of variables. However, the lda_comp model wins at a 98.5% accuracy and 97.6% kappa. The only reason not to choose to use it is because of uncertainty on how the software implementation works.

As I discussed earlier, the MASS::lda function does not adequately document how categorical variables are handled by the software. Clearly, it is making sensible decisions behind the scenes. If model transparency is a concern, I would recommend the smaller model lda4 which has nearly as good performance but omits the categorical variable.

Model Performance on Palmer Penguins
model method numVars accuracy kappa
lda_comp LDA 5 0.985 0.976
lda4 LDA 4 0.969 0.952
qda_comp QDA 5 0.969 0.951
qda4 QDA 4 0.954 0.927
nb_comp Naive Bayes 5 0.954 0.927
nb4 Naive Bayes 4 0.954 0.927
Note:
All models used the same data with 80-20% (268/65 observations) split of training to test data.

In comparing methodologies, I find little to recommend the use of QDA for this particular data set. Its increased flexibility in handling curved classification boundaries is offset by the greater number of parameters to calibrate. In the Palmer Penguin situation, we learned that species level covariances were not significantly different by using the R package heplots to systematically compare ellipses.

Naive Bayes gave a reasonable performance. Naive Bayes is able to handle both categorical and continuous variables in the same model. It makes the limited assumption of independence of the features which is usually untrue. It has the added benefit of much faster computation time but for small data sets like palmerpenguins there is no real advantage. Rather, it probably serves as a simple low cost reference model when analyzing more complex models or data studies.

I also raised some technical questions which remain unresolved. The MASS::lda and MASS::qda functions are able to handle mixed variable sets quite effectively through undocumented means. In contrast, the textbook treatments of LDA and QDA make this challenging. Furthermore, certain graphical plots like partimat don’t describe the mathematical assumptions for how classification boundaries are graphed. Curiously, numerous bloggers in data science repeat the use of these charts without addressing such concern. So there is ample room for improving the tools of our trade.

5 Appendices

5.2 Code

We summarize all the R code used in this project in this appendix for ease of reading.

library(tidyverse)
library(ggplot2)
library(knitr)
library(kableExtra)
library(GGally)
library(caret)
library(e1071)
library(klaR)
knitr::opts_chunk$set(echo = FALSE, message=FALSE, warning=FALSE)

library(palmerpenguins)
library(skimr)


# The final dataset excludes observations with missing sex and drops the island and year variables.

pc = penguins %>% filter( is.na(sex) == FALSE) %>% dplyr::select( -one_of("island", "year") )

head(pc)
dim(pc)

# Feature Plot of box plots
caret::featurePlot(x = pc[, c("bill_length_mm", "bill_depth_mm", "flipper_length_mm", "body_mass_g")], 
                   y = pc$species ,
                   plot = 'box' , 
                   scales = list(y = list( relation="free")))



# Feature Plot of density

caret::featurePlot(x = pc[, c("bill_length_mm", "bill_depth_mm", "flipper_length_mm", "body_mass_g")], 
                   y = pc$species ,
                   plot = 'density' , 
                   scales = list(x = list( relation="free"), y = list(relation="free")) ,
                   auto.key = list( columns = 3 )
                   )



# Feature Plot of ellipses

caret::featurePlot(
  x = pc[, c("bill_length_mm",  "bill_depth_mm", "flipper_length_mm", "body_mass_g")],
  y = pc$species,  
  plot = "ellipse",
  auto.key = list(columns = 3)  # lattice argument:  The number of columns to display for group labels at top of plot
)
# This code block partitions the cleansed data into a training and test set.
# It then pre-processes the training data so the model is stabilized with respect to the influence of 
# all variables.

library(MASS)


set.seed(123)

training.individuals <- pc$species %>% 
  createDataPartition(p = 0.8 , list = FALSE)

train.data  <- pc[  training.individuals, ]
test.data   <- pc[ -training.individuals, ]

# Estimate preprocessing parameters
preproc.parameter <- train.data %>% 
    preProcess(method = c("center", "scale"))

# Transform the data using the estimated parameters
train.transform <- preproc.parameter %>% predict( train.data)
test.transform  <- preproc.parameter %>% predict( test.data )

# Fit the model
model_lda_comp <- lda( species~ ., data = train.transform)

# Make predictions
predictions_lda_comp <- model_lda_comp %>% predict( test.transform)

model_lda_comp


# Fit the model LDA
model_lda4 <- lda( species~ bill_length_mm + bill_depth_mm + body_mass_g + flipper_length_mm, 
               data = train.transform)

# Make predictions
predictions_lda4 <- model_lda4 %>% predict( test.transform)

model_lda4


# Confusion Matrix for the Complete LDA model

(cm_lda_comp = confusionMatrix(predictions_lda_comp$class, test.transform$species))


# Confusion Matrix for the 4 variable LDA model

(cm_lda4 = confusionMatrix( predictions_lda4$class, test.transform$species) )

# Component Plots of the LDA models

par(mfrow=c(1,2))

plot(model_lda_comp, col = as.integer(train.transform$species), main='LDA Comp')


plot(model_lda4, col = as.integer(train.transform$species), main='LDA4' )

# Partition Plots
partimat( species ~ bill_length_mm + bill_depth_mm + flipper_length_mm + body_mass_g  , data= train.transform, method = 'lda', main="Partition Plot for LDA 4")


# Exploring covariance
library(heplots)

covEllipses(train.transform[,2:5], train.transform$species, fill=c(c(rep(FALSE,3)), TRUE), variables=1:4, fill.alpha=.1 , center = FALSE, pooled=TRUE , label.pos = c( 2,  1 , 0, 3) )


# Explore the Covariance Matrices with overlapping ellipses over the Pooled covariances

covEllipses(train.transform[,2:5], train.transform$species, fill=c(c(rep(FALSE,3)), TRUE), variables=1:4, fill.alpha=.1 , center = TRUE, pooled=TRUE , label.pos = c( 2,  1 , 0, 3) )

# Fit the model for QDA Complete
model_qda_comp <- qda( species~ ., data = train.transform)

# Make predictions
predictions_qda_comp <- model_qda_comp %>% predict( test.transform)

model_qda_comp

# Fit the model for QDA 4 variables

model_qda4 <- qda( species~ bill_length_mm + bill_depth_mm + 
                     flipper_length_mm + body_mass_g , data = train.transform)

# Make predictions
predictions_qda4 <- model_qda4  %>% predict( test.transform)

model_qda4


# Confusion Matrix for QDA comp
(cm_qda_comp = confusionMatrix(predictions_qda_comp$class, test.transform$species) )

# Confusion Matrix for QDA 4 variable
(cm_qda4 = confusionMatrix(predictions_qda4$class, test.transform$species) )

# Partition plot for QDA 4 model

partimat( species ~ bill_length_mm + bill_depth_mm + flipper_length_mm + body_mass_g  , data= train.transform, method = 'qda', main = 'Partition Plot for qda4 model')


nb4 = klaR::NaiveBayes( species ~ bill_length_mm + bill_depth_mm + body_mass_g + flipper_length_mm ,
                        data = train.transform )

pred_nb4 = predict( nb4, newdata = test.transform )



nb_comp = klaR::NaiveBayes( species ~ bill_length_mm + bill_depth_mm + body_mass_g 
                            + flipper_length_mm + sex ,
                        data = train.transform )

pred_nb_comp = predict( nb_comp, newdata = test.transform )



# Confusion matrix of Naive Bayes 4 variables
(cm_nb4 = confusionMatrix(pred_nb4$class, test.transform$species ) )


# Confusion matrix of Naive Bayes -complete model
(cm_nbcomp = confusionMatrix(pred_nb_comp$class, test.transform$species ) )

library(knitr)
library(kableExtra)
model_report = data.frame(model = c("lda_comp", "lda4", "qda_comp", "qda4", "nb_comp", "nb4") ,
                          method = c("LDA", "LDA", "QDA", "QDA", "Naive Bayes", "Naive Bayes") ,
                          numVars = c( 5, 4, 5, 4, 5, 4) ,
                          accuracy = c( cm_lda_comp$overall[1], 
                                        cm_lda4$overall[1],
                                        cm_qda_comp$overall[1],
                                        cm_qda4$overall[1],
                                        cm_nbcomp$overall[1],
                                        cm_nb4$overall[1] ) ,
                          kappa = c(  cm_lda_comp$overall[2], 
                                        cm_lda4$overall[2],
                                        cm_qda_comp$overall[2],
                                        cm_qda4$overall[2],
                                        cm_nbcomp$overall[2],
                                        cm_nb4$overall[2]
                            
                            )
                          )

kable(model_report, digits = 3 , caption = "Model Performance on Palmer Penguins") %>% kable_styling(bootstrap_options = c("striped", "hover") ) %>%
   footnote( general = "All models used the same data with 80-20% (268/65 observations) split of training to test data."
             )