DATA 622 - Assignment 2

Author: Romerl Elizes

Submitted Date: March 17, 2021

Overall Summary

Per requirements of Assignment 2, the following results have been determined:

I developed some data preparation activities prior to executing the three models required for Assignment 2. First, I decided to fill the missing values of the data using a data imputation function call mice. Second, I executed the same correlation function from Assignment 1 to determine the high correlation data and eliminate them from the data set. Unlike Assignment 1, I truly removed the data variables flipper_length_mm and body_mass_g. I had to transform the categorical data into numerical variable equivalents to execute the correlation effectively. Third, I separated the data into training and testing sub-sets. I preprocessed the training data set by smoothing the data using centering and scaling methods.I used this transformation to predict the train1.transform and test1.transform data sets that I used for my 3 model predictions. Per Professor Khan’s suggestion, the year variable was eliminated from the data sets.
Linear Discrimant Analysis. Upon execution of the LDA function, the final model was species ~ island + bill_length_mm + bill_depth_mm + sex. Accuracy Rate is: 100%, Precision is 100%, F1-Score is 100%.
Quadratic Discriminant Analysis. Upon execution of the QDA function, the final model was species ~ bill_length_mm + bill_depth_mm + sex. Accuracy Rate is: 100%, Precision is 100%, F1-Score is 100%.
Naïve Bayes. Upon execution of Naïve Bayes function using a 10-fold cross validation procedure the final model is species ~ island + bill_length_mm + bill_depth_mm + sex. Accuracy Rate is: 98.5%, Precision is 96.8%, F1-Score is 98.4%.

b. Findings

I really tried to remove any form of gut instinct from the calculations for the models. I resisted the temptation to use any of the highly correlated variables such as flipper_length_mm and body_mass_g. In Assignment 1, I retained fliper_length_mm and body_mass_g even though mathematically they were found to be highly-correlated variables and therefore may have messed up my observations.
If I had retained the highly-correlated variables in the models, all of the models would have generated an accuracy of 1. I am uncomfortable with a perfect accuracy score, because in reality, they do not exist. However, I still ended up with models with an accuracy of 1.
I only did a smoothing out of the remaining data set prior to the execution of the three models using centering and scaling. Perhaps it would have been more beneficial to make the variables squared, logged, or other transformation treatment to make them more accurate with the models presented. However, the resulting accuracies of the models seem to indicate that the variables by themselves are good predictors of species.
Initially, I was a little concerned that the sex variable was still in all of the models that I used for this assignment even though the split between male and female sub-sets are nearly identical. In the midst of last minute experimentation, the sex variable was irrelevant in any of the final models. It did not affect the accuracy of the models whether I used it or not. I left the variable in the final models due to time constraints.
I performed a feature plot analysis for my selected variables in the QDA section of the assignment. In hindsight, the feature plot analysis could have been performed in the Data Preparation section of the assignment.
Despite my conversation with you recently on March 17, I had to resolve to use the numerical versions of the two categorical variables for species and island to determine multicollinearity. It may produce bad results, but I made sure to use the VIF function to help me with the variable selection process.

0. Data Preparation

a. Imputing Missing Values

The following variables had missing data and I used the MICE function to remediate the missing values. Bear in mind, SEX has only 11 missing values and the other variables had only 2 missing values. Perhaps, it would have been a lot easier to omit the missing data, but I felt that it would be a more truthful data set if we had imputed data values to play with.

BILL_LENGTH_MM
BILL_DEPTH_MM
FLIPPER_LENGTH_MM
BODY_MASS_G
SEX

## 
##  iter imp variable
##   1   1  bill_length_mm  bill_depth_mm  flipper_length_mm  body_mass_g  sex
##   1   2  bill_length_mm  bill_depth_mm  flipper_length_mm  body_mass_g  sex
##   1   3  bill_length_mm  bill_depth_mm  flipper_length_mm  body_mass_g  sex
##   1   4  bill_length_mm  bill_depth_mm  flipper_length_mm  body_mass_g  sex
##   1   5  bill_length_mm  bill_depth_mm  flipper_length_mm  body_mass_g  sex
##   2   1  bill_length_mm  bill_depth_mm  flipper_length_mm  body_mass_g  sex
##   2   2  bill_length_mm  bill_depth_mm  flipper_length_mm  body_mass_g  sex
##   2   3  bill_length_mm  bill_depth_mm  flipper_length_mm  body_mass_g  sex
##   2   4  bill_length_mm  bill_depth_mm  flipper_length_mm  body_mass_g  sex
##   2   5  bill_length_mm  bill_depth_mm  flipper_length_mm  body_mass_g  sex
##   3   1  bill_length_mm  bill_depth_mm  flipper_length_mm  body_mass_g  sex
##   3   2  bill_length_mm  bill_depth_mm  flipper_length_mm  body_mass_g  sex
##   3   3  bill_length_mm  bill_depth_mm  flipper_length_mm  body_mass_g  sex
##   3   4  bill_length_mm  bill_depth_mm  flipper_length_mm  body_mass_g  sex
##   3   5  bill_length_mm  bill_depth_mm  flipper_length_mm  body_mass_g  sex
##   4   1  bill_length_mm  bill_depth_mm  flipper_length_mm  body_mass_g  sex
##   4   2  bill_length_mm  bill_depth_mm  flipper_length_mm  body_mass_g  sex
##   4   3  bill_length_mm  bill_depth_mm  flipper_length_mm  body_mass_g  sex
##   4   4  bill_length_mm  bill_depth_mm  flipper_length_mm  body_mass_g  sex
##   4   5  bill_length_mm  bill_depth_mm  flipper_length_mm  body_mass_g  sex
##   5   1  bill_length_mm  bill_depth_mm  flipper_length_mm  body_mass_g  sex
##   5   2  bill_length_mm  bill_depth_mm  flipper_length_mm  body_mass_g  sex
##   5   3  bill_length_mm  bill_depth_mm  flipper_length_mm  body_mass_g  sex
##   5   4  bill_length_mm  bill_depth_mm  flipper_length_mm  body_mass_g  sex
##   5   5  bill_length_mm  bill_depth_mm  flipper_length_mm  body_mass_g  sex

##       species          island    bill_length_mm  bill_depth_mm  
##  Adelie   :152   Biscoe   :168   Min.   :32.10   Min.   :13.10  
##  Chinstrap: 68   Dream    :124   1st Qu.:39.27   1st Qu.:15.57  
##  Gentoo   :124   Torgersen: 52   Median :44.45   Median :17.30  
##                                  Mean   :43.92   Mean   :17.14  
##                                  3rd Qu.:48.50   3rd Qu.:18.70  
##                                  Max.   :59.60   Max.   :21.50  
##  flipper_length_mm  body_mass_g       sex     
##  Min.   :172.0     Min.   :2700   female:174  
##  1st Qu.:190.0     1st Qu.:3550   male  :170  
##  Median :197.0     Median :4050               
##  Mean   :200.9     Mean   :4202               
##  3rd Qu.:213.2     3rd Qu.:4756               
##  Max.   :231.0     Max.   :6300

b. Determining Highly-Correlated Variables

Upon inspecting my previous course on DATA 621 Business Analytics and Data Mining, one of the things I found that there was a possibility that I may have highly-correlated variables in my model. To determine and eliminate a highly-correlated variable from the model, may improve the fitness of the models.

Prior to executing the cor Pearson function, I had to transform the temporary categorical data variables into numerical equivalents. For species, I set 0, 1, and 2 for Adelie, Chinstrap, and Gentoo respectively. For island, I set 0, 1, 2 for Biscoe, Dream, and Torgersen respectively. For sex, I set 0 and 1 for male and female respectively.

pengtemp1 = pengtempimputed
pengtemp1$species = case_when(pengtemp1$species == 'Adelie' ~ 0, 
                                         pengtemp1$species == 'Chinstrap' ~ 1, 
                                         pengtemp1$species == 'Gentoo' ~ 2)

pengtemp1$island = case_when(pengtemp1$island == 'Biscoe' ~ 0, 
                                         pengtemp1$island == 'Dream' ~ 1, 
                                         pengtemp1$island == 'Torgersen' ~ 2)
pengtemp1$sex = case_when(pengtemp1$sex == 'male' ~ 0, 
                                         pengtemp1$sex == 'female' ~ 1)

summary(pengtemp1)

##     species           island       bill_length_mm  bill_depth_mm  
##  Min.   :0.0000   Min.   :0.0000   Min.   :32.10   Min.   :13.10  
##  1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:39.27   1st Qu.:15.57  
##  Median :1.0000   Median :1.0000   Median :44.45   Median :17.30  
##  Mean   :0.9186   Mean   :0.6628   Mean   :43.92   Mean   :17.14  
##  3rd Qu.:2.0000   3rd Qu.:1.0000   3rd Qu.:48.50   3rd Qu.:18.70  
##  Max.   :2.0000   Max.   :2.0000   Max.   :59.60   Max.   :21.50  
##  flipper_length_mm  body_mass_g        sex        
##  Min.   :172.0     Min.   :2700   Min.   :0.0000  
##  1st Qu.:190.0     1st Qu.:3550   1st Qu.:0.0000  
##  Median :197.0     Median :4050   Median :1.0000  
##  Mean   :200.9     Mean   :4202   Mean   :0.5058  
##  3rd Qu.:213.2     3rd Qu.:4756   3rd Qu.:1.0000  
##  Max.   :231.0     Max.   :6300   Max.   :1.0000

I conducted a cor function using the Pearson method to find out of any correlations in the data. It was determined that bill_length_mm, flipper_length_mm, and body_mass_g had high positive correlations.

	species	island	bill_length_mm	bill_depth_mm	flipper_length_mm	body_mass_g	sex
species	1.0000000	-0.6356590	0.7327216	-0.7436032	0.8551531	0.7525558	-0.0119753
island	-0.6356590	1.0000000	-0.3573415	0.5689644	-0.5708034	-0.5653960	0.0214442
bill_length_mm	0.7327216	-0.3573415	1.0000000	-0.2368275	0.6577704	0.5973979	-0.3495081
bill_depth_mm	-0.7436032	0.5689644	-0.2368275	1.0000000	-0.5832227	-0.4728635	-0.3731686
flipper_length_mm	0.8551531	-0.5708034	0.6577704	-0.5832227	1.0000000	0.8725170	-0.2536850
body_mass_g	0.7525558	-0.5653960	0.5973979	-0.4728635	0.8725170	1.0000000	-0.4266063
sex	-0.0119753	0.0214442	-0.3495081	-0.3731686	-0.2536850	-0.4266063	1.0000000

Positive Correlated Variables

##                     species
## bill_length_mm    0.7327216
## flipper_length_mm 0.8551531
## body_mass_g       0.7525558

Negative Correlated Variables

##                  species
## island        -0.6356590
## bill_depth_mm -0.7436032

According to feedback from Professor Khan in Assignment 1, I kept bill_length_mm variable in the model. Based on a lecture from Professor Chris Mack, one way to determine which variable to eliminate from the model is to find the variables with high Variable Inflation Factor [MAC].

lmmodel <- lm(species~.,data=pengtemp1)
vif(lmmodel)

##            island    bill_length_mm     bill_depth_mm flipper_length_mm 
##          1.782707          1.906783          3.252636          5.968078 
##       body_mass_g               sex 
##          5.850087          2.588557

Upon running the vif function, you will see that flipper_length_mm and body_mass_g have the highest inflation factor values at 5.92 and 5.8 respectively. Also observe in the correlation map above, flipper_length_mm and body_mass_g are within the blue color range indicating that they have multicollinearity. According to Professor Mack, he indicated that any VIF values greater than 5 should be candidate variables for elimination. In keeping with Professor Khan’s suggestion to keep the bill_length_mm variable, I executed the lm and vif function again without the body_mass_g and flipper_length_mm variables and came with much better results.

lmmodel <- lm(species~island + bill_depth_mm + sex + bill_length_mm,data=pengtemp1)
vif(lmmodel)

##         island  bill_depth_mm            sex bill_length_mm 
##       1.672551       2.034484       1.586528       1.421243

The resulting output indicates that none of the variables have any variable inflation factor greater than 5. For simplicity, I removed body_mass_g and flipper_length_mm variables from the data sets after determining their high correlation percentage with each other. The final data set will be: species ~ island + bill_length_mm + bill_depth_mm + sex.

pengtempimputed <- pengtempimputed[,!(names(pengtempimputed) %in% c("body_mass_g","flipper_length_mm"))]

c. Partition of Training/Test Data and Data Transformation

I used createDataPartition to develop the Training and Test data sets. I separated the training and test data set using 80/20 partition of the imputed data.

set.seed(123)
inTrain1 <- createDataPartition(y = pengtempimputed$species, p=0.80, list = FALSE)
training1 <- pengtempimputed[inTrain1,]
testing1 <- pengtempimputed[-inTrain1,]

I next preprocessed the data by smoothing the data using centering and scaling and predicted the training and test data set that will be used for the model predictions.

preproc.parameter <- training1 %>%  
  preProcess(method = c("center", "scale")) 
  
# Transform the data using the estimated parameters 
train1.transform <- preproc.parameter %>% predict(training1) 
test1.transform <- preproc.parameter %>% predict(testing1)

All required data preparation tasks are completed. I now have train1.transform and test1.transform data sets ready for execution and prediction of the three required models.

1. Linear Discriminant Analysis

a. Execute LDA

I executed the LDA function with the final model: species ~ island + bill_length_mm + bill_depth_mm + sex. I did not remove any variables from the final data set nor did I need to transform the variables. I did not remove any of the independent variables because doing so would reduce the accuracy of the prediction when executing the confusion matrix afterwards.

## Call:
## lda(species ~ bill_length_mm + bill_depth_mm + island + sex, 
##     data = train1.transform)
## 
## Prior probabilities of groups:
##    Adelie Chinstrap    Gentoo 
## 0.4404332 0.1985560 0.3610108 
## 
## Group means:
##           bill_length_mm bill_depth_mm islandDream islandTorgersen   sexmale
## Adelie        -0.9400123     0.6093378   0.4180328       0.3278689 0.5163934
## Chinstrap      0.9046259     0.6472918   1.0000000       0.0000000 0.4909091
## Gentoo         0.6492707    -1.0994026   0.0000000       0.0000000 0.5100000
## 
## Coefficients of linear discriminants:
##                        LD1        LD2
## bill_length_mm   1.5386966 -1.6947394
## bill_depth_mm   -1.7533280 -0.7601105
## islandDream     -1.4379457 -2.1955665
## islandTorgersen -1.3641784 -0.6341571
## sexmale          0.2388337  1.7261911
## 
## Proportion of trace:
##    LD1    LD2 
## 0.7536 0.2464

b. LDA Plot

I ran ggplot on the lda model and you will notice that the species are distinctly separate from each other. This gives a good indication that the independent variables are accurately predicting the species type.

c. Perform Predictions

I performed a prediction of the model using the test1.transform data. Based on the prediction, the accuracy of the model is 100%. I do not like models of 100% accuracy but I ran this several times and still came up with the same conclusion.

##            
##             Adelie Chinstrap Gentoo
##   Adelie        30         0      0
##   Chinstrap      0        13      0
##   Gentoo         0         0     24

## [1] 1

d. Create Confusion Matrix

Upone execution of the confusion matrix for the LDA, the accuracy of the model is still 100% at p-value significantly less than 0.05% indicating that there is strong evidence that the independent variables are predicting the values for species.

## Confusion Matrix and Statistics
## 
##            Reference
## Prediction  Adelie Chinstrap Gentoo
##   Adelie        30         0      0
##   Chinstrap      0        13      0
##   Gentoo         0         0     24
## 
## Overall Statistics
##                                      
##                Accuracy : 1          
##                  95% CI : (0.9464, 1)
##     No Information Rate : 0.4478     
##     P-Value [Acc > NIR] : < 2.2e-16  
##                                      
##                   Kappa : 1          
##                                      
##  Mcnemar's Test P-Value : NA         
## 
## Statistics by Class:
## 
##                      Class: Adelie Class: Chinstrap Class: Gentoo
## Sensitivity                 1.0000            1.000        1.0000
## Specificity                 1.0000            1.000        1.0000
## Pos Pred Value              1.0000            1.000        1.0000
## Neg Pred Value              1.0000            1.000        1.0000
## Prevalence                  0.4478            0.194        0.3582
## Detection Rate              0.4478            0.194        0.3582
## Detection Prevalence        0.4478            0.194        0.3582
## Balanced Accuracy           1.0000            1.000        1.0000

2. Quadratic Discriminant Analysis

For the development of the QDA model, I used a conversion of the variables on a scale from 0 to 1 to help with defining a feature plot of the data. A Feature Plot will help us visually see the data and determine what final variable(s) would be used to execute the QDA model. The base model is species ~ island + bill_length_mm + bill_depth_mm + sex. Are there any variables that will not be used in the final QDA model?

a. Prep Data for Feature Plot

To prepare for the feature plot analysis, I performed the following prep activities: 1) created dummy variables by converting the categorical variables if any to binary variables, 2) create a matrix of predicted values for the dummy independent variables, and 3) range them in values between 0 and 1 [KEI].

## Warning in model.frame.default(Terms, newdata, na.action = na.action, xlev =
## object$lvls): variable 'species' is not a factor

##     island.Biscoe island.Dream island.Torgersen bill_length_mm bill_depth_mm
## min             0            0                0              0             0
## max             1            1                1              1             1

b. Feature Plot

In the first row of plots, it clearly shows that bill_length_mm and bill_depth_mm are valid predictors of species. This is evidenced by the various medians and Q1 and Q3 ranges for the Adelie, Chinstrap, and Gentoo species types. A cause for concern may be the Adelie and Chinstrip median for bill_length_mm as their medians are similar. However, the range of data values is different. Adelie has a slightly, but apparent larger range of data than Chinstrap with bill_length_mm.

You will notice that the sex variable is conspicuously absent in the feature plot. This may indicate that the existence of the sex variable is inconsequtiental in the prediction of species.

The island variable is of concern based on the visual observation of the feature plot. It is split into three plots each representing an island. Adelie is present in all three islands. However, Gentoo exists only on the island of Biscoe and Chinstrap exists only on the island of Dream.

c. Execute QDA

I executed the QDA function with the final model: species ~ bill_length_mm + bill_depth_mm + sex. I removed island from the final data set because it was giving me rank deficiency errors with the species sub-variable Chinstrap. Recall that Chinstrap in the feature plot analysis only ’exists in the island Dream. I could have investigated further, but since the data was already transformed already, I did not want to make island numerical to see if I could eliminate the rank deficiency problem. Per Professor Khan’s suggestion from Assignment 1, I cannot just eliminate a level from a variable. Either I leave the variable in or eliminate the variable from the model. I thus eliminated island from the QDA model. I did not remove any more independent variables because doing so would reduce the accuracy of the prediction when executing the confusion matrix afterwards.

## Call:
## qda(species ~ bill_length_mm + bill_depth_mm + sex, data = train1.transform)
## 
## Prior probabilities of groups:
##    Adelie Chinstrap    Gentoo 
## 0.4404332 0.1985560 0.3610108 
## 
## Group means:
##           bill_length_mm bill_depth_mm   sexmale
## Adelie        -0.9400123     0.6093378 0.5163934
## Chinstrap      0.9046259     0.6472918 0.4909091
## Gentoo         0.6492707    -1.0994026 0.5100000

d. Make predictions

##            
##             Adelie Chinstrap Gentoo
##   Adelie        30         0      0
##   Chinstrap      0        13      0
##   Gentoo         0         0     24

## [1] 1

e. Create Confusion Matrix

Upone execution of the confusion matrix for the QDA, the accuracy of the model is still 100% at p-value significantly less than 0.05% indicating that there is strong evidence that the independent variables are predicting the values for species.

## Confusion Matrix and Statistics
## 
##            Reference
## Prediction  Adelie Chinstrap Gentoo
##   Adelie        30         0      0
##   Chinstrap      0        13      0
##   Gentoo         0         0     24
## 
## Overall Statistics
##                                      
##                Accuracy : 1          
##                  95% CI : (0.9464, 1)
##     No Information Rate : 0.4478     
##     P-Value [Acc > NIR] : < 2.2e-16  
##                                      
##                   Kappa : 1          
##                                      
##  Mcnemar's Test P-Value : NA         
## 
## Statistics by Class:
## 
##                      Class: Adelie Class: Chinstrap Class: Gentoo
## Sensitivity                 1.0000            1.000        1.0000
## Specificity                 1.0000            1.000        1.0000
## Pos Pred Value              1.0000            1.000        1.0000
## Neg Pred Value              1.0000            1.000        1.0000
## Prevalence                  0.4478            0.194        0.3582
## Detection Rate              0.4478            0.194        0.3582
## Detection Prevalence        0.4478            0.194        0.3582
## Balanced Accuracy           1.0000            1.000        1.0000

3. Naïve Bayes

a. Set up 10-fold cross validation procedure for Naïve Bayes calculation

I executed the Naïve Bayes function using 10-fold cross validation procedure [NAI] with the final model: species ~ island + bill_length_mm + bill_depth_mm + sex. I did not remove any variables from the final data set nor did I need to transform the variables. I did not remove any of the independent variables because doing so would reduce the accuracy of the prediction when executing the confusion matrix afterwards.

# create response and feature data for ALL fields
x = train1.transform[, 2:5]
y = train1.transform$species

# set up 10-fold cross validation procedure
train_control <- trainControl(
  method = "cv", 
  number = 10
  )

# train model
nb.m1 <- train(
  x = x,
  y = y,
  method = "nb",
  trControl = train_control
  )

b. Make Predictions

I performed a prediction of the model using the test1.transform data. Based on the prediction, the accuracy of the model is 98.1%. It is interesting that the accuracy of the model is less than both the LDA and QDA models.

## Cross-Validated (10 fold) Confusion Matrix 
## 
## (entries are percentual average cell counts across resamples)
##  
##            Reference
## Prediction  Adelie Chinstrap Gentoo
##   Adelie      43.7       1.4    0.0
##   Chinstrap    0.4      18.4    0.0
##   Gentoo       0.0       0.0   36.1
##                             
##  Accuracy (average) : 0.9819

c. Create Confusion Matrix

Upone execution of the confusion matrix for the Naïve Bayes model, the accuracy of the model is still 98.5% at p-value significantly less than 0.05% indicating that there is strong evidence that the independent variables are predicting the values for species.

## Confusion Matrix and Statistics
## 
##            Reference
## Prediction  Adelie Chinstrap Gentoo
##   Adelie        30         0      1
##   Chinstrap      0        13      0
##   Gentoo         0         0     23
## 
## Overall Statistics
##                                           
##                Accuracy : 0.9851          
##                  95% CI : (0.9196, 0.9996)
##     No Information Rate : 0.4478          
##     P-Value [Acc > NIR] : < 2.2e-16       
##                                           
##                   Kappa : 0.9764          
##                                           
##  Mcnemar's Test P-Value : NA              
## 
## Statistics by Class:
## 
##                      Class: Adelie Class: Chinstrap Class: Gentoo
## Sensitivity                 1.0000            1.000        0.9583
## Specificity                 0.9730            1.000        1.0000
## Pos Pred Value              0.9677            1.000        1.0000
## Neg Pred Value              1.0000            1.000        0.9773
## Prevalence                  0.4478            0.194        0.3582
## Detection Rate              0.4478            0.194        0.3433
## Detection Prevalence        0.4627            0.194        0.3433
## Balanced Accuracy           0.9865            1.000        0.9792

4. Discussion on the Three Models’ Fit, Strengths, Weakness, and Accuracy

Model	Accuracy	Recall	Specificity	Precision	F1Score	TPR	TNR	FPR
LDA	1.0000000	1	1.000000	1.0000000	1.0000000	1	1.000000	0.000000
QDA	1.0000000	1	1.000000	1.0000000	1.0000000	1	1.000000	0.000000
Naive Bayes	0.9850746	1	0.972973	0.9677419	0.9836066	1	0.972973	0.027027

I will discuss on some findings I found with this assignment and its comparison to the literature [HOA] [LIN:

According to the text, LDA and QDA are reliant on the independent variables being normalized prior to executing their respective models. In this assignment I transformed the data by centering and scaling it prior to executing the LDA and QDA. While this produced two models with a perfect accuracy of 100%, it gives me great trepidation whether to rely on these findings or not. I don’t trust 100% accuracy and for me to achieve it in this exercise is somewhat disconcerting for me. By no means am I an expert data scientist and I can’t just qualify that these models are correct.
An alternative to the Naïve Bayes function using 10-fold cross validation procedure that I implemented in this assignment, there is the naiveBayes function from the e1071 package. However, upon executing this function, the accuracy for the model is 100%. Again, this leaves me a bit concerned about my model data selection process.
Prior to the execution of the models, it has become apparent to me that the feature selection process is probably the most important requirement. I felt it was the most time consuming and required the understanding of the nuances of the data. Unfortunately, there are so many feature selection techniques out there, it would be difficult to determine what feature(s) to use in these models for this assignment. Based on my Assignment 1 assessment, it was best to remove continuous variables such as year and sex. It was best to remove high multicollinear variables such as flipper_length_mm and body_mass_g. It is also assummed that the data had to be normalized prior to execution of the LDA and QDA models. The data was normalized using centering and scaling. Therefore, the data has been prepped thoroughly for the LDA, QDA, and Naïve Bayes models and thus the high accuracy rate and other attributes.
Based on the given fit statistics I have available at this time, I cannot accurately determine which model is better than the rest. For example, there is no clear distinctiong between the performance of LDA and QDA. Both models have the same Accuracies, F1-Score, TPR, FPR, TNR, FNR, etc. (See table above). How can I accurately determine which model is the best? As for the Naïve Bayes, with the exception of Recall, TPR and TNR (which are the same), it exhibits values at slightly than 1 as compared to the LDA and QDA models. This would mean that the Naïve Bayes is competitive to the LDA and QDA models. According to the literature, Naïve Bayes works well even with less than perfect data [HOA].
I maybe wrong in this assumption, but I don’t think the data is big enough to adequately determine the strengths and weaknesses of the models presented in this assignment. Perhaps a data set with more variables and more rows would be helpful in determining the relevancy of each of the models in this assignment.

References

[HOA] Discriminant Analysis & Naive Bayes. Retrieved from website: http://jennguyen1.github.io/nhuyhoa/statistics/Discriminant-Analysis-Naive-Bayes.html#logistic-regression-vs-discriminant-analysis-vs-naive-bayes

[KEI] A. Kei. R Tutorial 12: LDA, QDA, and KNN for Classification. Retrieved from website: https://www.youtube.com/watch?v=kBFor2dkZxE

[LIN] Linear Discriminant Analysis vs Naive Bayes. Retrieved from website: https://stackoverflow.com/questions/46396552/linear-discriminant-analysis-vs-naive-bayes

[MOH] J. Mohajen. Confusion Matrix for Your Multi-Class Machine Learning Model. Retrieved from website: https://towardsdatascience.com/confusion-matrix-for-your-multi-class-machine-learning-model-ff9aa3bf7826

[MAC] C. Mack. Lecture52 (Data2Decision) Detecting Multicollinearity in R. Retrieved from website: https://www.youtube.com/watch?v=QruEcbgfhzo

[NAI] Naïve Bayes Classifier. UC Business Analytics R Programming Guide. Retrieved from website: https://uc-r.github.io/naive_bayes

[PRA] S. Prabhakaran. Caret Package – A Practical Guide to Machine Learning in R. Retrieved from website: https://www.machinelearningplus.com/machine-learning/caret-package/

[WHA1] What is rank deficiency, and how to deal with it? Retrieved from website: https://stats.stackexchange.com/questions/35071/what-is-rank-deficiency-and-how-to-deal-with-it