Data Manipulation

## -- Attaching packages --------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.3     v purrr   0.3.4
## v tibble  3.0.6     v dplyr   1.0.4
## v tidyr   1.1.2     v stringr 1.4.0
## v readr   1.3.1     v forcats 0.5.0
## Warning: package 'ggplot2' was built under R version 4.0.3
## Warning: package 'tibble' was built under R version 4.0.3
## Warning: package 'tidyr' was built under R version 4.0.3
## Warning: package 'dplyr' was built under R version 4.0.3
## -- Conflicts ------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
## Warning: package 'caret' was built under R version 4.0.2
## Loading required package: lattice
## 
## Attaching package: 'caret'
## The following object is masked from 'package:purrr':
## 
##     lift
## Warning: package 'e1071' was built under R version 4.0.4
## Warning: package 'caTools' was built under R version 4.0.4

2. Normalize the data. Categorical variables are automatically ignored.

For LDA to work, we have to make sure our variables are normally distributed. This is because LDA assumes the predictors to be normally distributed. LDA assumes that predictors are normally distributed and the different classes have class-specific means and equal variance/covariance.

Linear discriminant analysis - LDA

## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
## [1] 1

Compute LDA:

## Call:
## lda(species ~ . - year, data = train.transformed)
## 
## Prior probabilities of groups:
##    Adelie Chinstrap    Gentoo 
## 0.4365672 0.2052239 0.3582090 
## 
## Group means:
##             island1   island2 bill_length_mm bill_depth_mm flipper_length_mm
## Adelie    0.3931624 0.3247863     -0.9555862     0.6041178        -0.7633521
## Chinstrap 1.0000000 0.0000000      0.8349499     0.6116721        -0.4203050
## Gentoo    0.0000000 0.0000000      0.6862640    -1.0867057         1.1711351
##           body_mass_g      sex1
## Adelie     -0.6210872 0.4529915
## Chinstrap  -0.6096684 0.4363636
## Gentoo      1.1062392 0.5520833
## 
## Coefficients of linear discriminants:
##                          LD1         LD2
## island1           -1.0710715 -1.77820348
## island2           -1.0600222 -0.21538615
## bill_length_mm     0.5311923 -2.11473056
## bill_depth_mm     -1.6661173 -0.17620747
## flipper_length_mm  1.2482407 -0.01937768
## body_mass_g        1.1955640  0.67876715
## sex1              -0.5513973  1.03207058
## 
## Proportion of trace:
##    LD1    LD2 
## 0.8381 0.1619
  1. Prior probabilities of groups: the proportion of training observations in each group.

For example, there are 43% of the training observations in the Adelie group. There are 20% of the training observations in the Chinstrap group There are 35% of the training observations in the Gentoo group

  1. Group means: group center of gravity. Shows the mean of each variable in each group.

Coefficients of linear discriminants: Shows the linear combination of predictor variables that are used to form the LDA decision rule. for example, LD1 = -1.07 x island1 + -1.06 x island2 + 0.53 x bill_length_mm -2.03 x bill_depth_mm - 1.25 x flipper_length_mm - 1.0 x body_mass_g - 0.55 x sex1.

Similarly, LD2 = -1.77 x island1 + -0.215 x island2 - 2.24 x bill_length_mm + -0.08 x bill_depth_mm + 0.13 x flipper_length_mm - 1.35 x body_mass_g + 1.03 x sex1.

This plot shows the linear discriminants of the model.

Model accuracy:

You can compute the model accuracy as follow:

## [1] 1

Our model classified 100% of the observations, which is very good.

Quadratic discriminant analysis - QDA

QDA is more flexible than LDA, because the covariance can be different for each class. LDA is a bit better than QDA when you have a small training set. QDA is often recommended if the training set is very large, where variance of the classifier is not an issue.

## Call:
## qda(species ~ bill_length_mm + bill_length_mm + flipper_length_mm + 
##     body_mass_g + sex, data = train.transformed)
## 
## Prior probabilities of groups:
##    Adelie Chinstrap    Gentoo 
## 0.4365672 0.2052239 0.3582090 
## 
## Group means:
##           bill_length_mm flipper_length_mm body_mass_g      sex1
## Adelie        -0.9555862        -0.7633521  -0.6210872 0.4529915
## Chinstrap      0.8349499        -0.4203050  -0.6096684 0.4363636
## Gentoo         0.6862640         1.1711351   1.1062392 0.5520833
## [1] 1

There are 43% of the training observations in the Adelie group. There are 20% of the training observations in the Chinstrap group. There are 35% of the training observations in the Gentoo group.

This model accuracy is 100%.

Naive Bayes Classifier

## 
## Naive Bayes Classifier for Discrete Predictors
## 
## Call:
## naiveBayes.default(x = X, y = Y, laplace = laplace)
## 
## A-priori probabilities:
## Y
##    Adelie Chinstrap    Gentoo 
## 0.4365672 0.2052239 0.3582090 
## 
## Conditional probabilities:
##            island
## Y                   0         1         2
##   Adelie    0.3247863 0.3760684 0.2991453
##   Chinstrap 0.0000000 1.0000000 0.0000000
##   Gentoo    1.0000000 0.0000000 0.0000000
## 
##            bill_length_mm
## Y               [,1]     [,2]
##   Adelie    38.83932 2.476196
##   Chinstrap 49.18727 3.426479
##   Gentoo    47.58854 3.195717
## 
##            bill_depth_mm
## Y               [,1]      [,2]
##   Adelie    18.33504 1.1555948
##   Chinstrap 18.46545 1.1686045
##   Gentoo    14.98646 0.9857532
## 
##            flipper_length_mm
## Y               [,1]     [,2]
##   Adelie    190.3846 6.508411
##   Chinstrap 196.3273 6.947034
##   Gentoo    217.1875 6.737659
## 
##            body_mass_g
## Y               [,1]     [,2]
##   Adelie    3706.838 464.4776
##   Chinstrap 3781.364 372.0320
##   Gentoo    5088.542 502.8722
## 
##            sex
## Y                   0         1
##   Adelie    0.4786325 0.5213675
##   Chinstrap 0.4363636 0.5636364
##   Gentoo    0.4791667 0.5208333
##            y_pred
##             Adelie Chinstrap Gentoo
##   Adelie        29         0      0
##   Chinstrap      1        12      0
##   Gentoo         0         0     23
## Confusion Matrix and Statistics
## 
##            y_pred
##             Adelie Chinstrap Gentoo
##   Adelie        29         0      0
##   Chinstrap      1        12      0
##   Gentoo         0         0     23
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9846          
##                  95% CI : (0.9172, 0.9996)
##     No Information Rate : 0.4615          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9757          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: Adelie Class: Chinstrap Class: Gentoo
## Sensitivity                 0.9667           1.0000        1.0000
## Specificity                 1.0000           0.9811        1.0000
## Pos Pred Value              1.0000           0.9231        1.0000
## Neg Pred Value              0.9722           1.0000        1.0000
## Prevalence                  0.4615           0.1846        0.3538
## Detection Rate              0.4462           0.1846        0.3538
## Detection Prevalence        0.4462           0.2000        0.3538
## Balanced Accuracy           0.9833           0.9906        1.0000

The Conditional probability for each feature or variable is created by model separately. The apriori probabilities are also calculated which indicates the distribution of our data.

So, 23 Gentoo are correctly classified as Gentoo. Out of 30 Adelie, 29 Adelie are correctly classified as Adelie, and 1 are classified as Chinstrap. Out of 12 Chinstrap, 12 Chinstrap are correctly classified as Chinstrap.

The model achieved 98% accuracy with a p-value of less than 1. With Sensitivity, Specificity, and Balanced accuracy, the model build is good.

Conclusion

Comparing the accuracy of the three models, QDA and LDA performs the best by correctly classifying 100% of observations compared to Naive Bayes Classifier. However, using for QDA i had to remove the island variable because it gave a rank deficiency for the Chinstrap. Doing so it doesn’t include most of the data like LDA did. So I believe in this dataset, LDA would be the ideal concept to use as it had 100% accuracy and was the most inclusive of the data out of the three. I also ran the model without the two other variables, island and sex, and the statistics remained the same but the accuracy is a bit lower.

Reference

http://www.sthda.com/english/articles/36-classification-methods-essentials/146-discriminant-analysis-essentials-in-r/

https://www.geeksforgeeks.org/naive-bayes-classifier-in-r-programming/#:~:text=Naive%20Bayes%20is%20a%20Supervised,between%20the%20features%20or%20variables.