Per requirements of Assignment 2, the following results have been determined:
I developed some data preparation activities prior to executing the three models required for Assignment 2. First, I decided to fill the missing values of the data using a data imputation function call mice. Second, I executed the same correlation function from Assignment 1 to determine the high correlation data and eliminate them from the data set. Unlike Assignment 1, I truly removed the data variables flipper_length_mm and body_mass_g. I had to transform the categorical data into numerical variable equivalents to execute the correlation effectively. Third, I separated the data into training and testing sub-sets. I preprocessed the training data set by smoothing the data using centering and scaling methods.I used this transformation to predict the train1.transform and test1.transform data sets that I used for my 3 model predictions. Per Professor Khan’s suggestion, the year variable was eliminated from the data sets.
Linear Discrimant Analysis. Upon execution of the LDA function, the final model was species ~ island + bill_length_mm + bill_depth_mm + sex. Accuracy Rate is: 100%, Precision is 100%, F1-Score is 100%.
Quadratic Discriminant Analysis. Upon execution of the QDA function, the final model was species ~ bill_length_mm + bill_depth_mm + sex. Accuracy Rate is: 100%, Precision is 100%, F1-Score is 100%.
Naïve Bayes. Upon execution of Naïve Bayes function using a 10-fold cross validation procedure the final model is species ~ island + bill_length_mm + bill_depth_mm + sex. Accuracy Rate is: 98.5%, Precision is 96.8%, F1-Score is 98.4%.
I really tried to remove any form of gut instinct from the calculations for the models. I resisted the temptation to use any of the highly correlated variables such as flipper_length_mm and body_mass_g. In Assignment 1, I retained fliper_length_mm and body_mass_g even though mathematically they were found to be highly-correlated variables and therefore may have messed up my observations.
If I had retained the highly-correlated variables in the models, all of the models would have generated an accuracy of 1. I am uncomfortable with a perfect accuracy score, because in reality, they do not exist. However, I still ended up with models with an accuracy of 1.
I only did a smoothing out of the remaining data set prior to the execution of the three models using centering and scaling. Perhaps it would have been more beneficial to make the variables squared, logged, or other transformation treatment to make them more accurate with the models presented. However, the resulting accuracies of the models seem to indicate that the variables by themselves are good predictors of species.
Initially, I was a little concerned that the sex variable was still in all of the models that I used for this assignment even though the split between male and female sub-sets are nearly identical. In the midst of last minute experimentation, the sex variable was irrelevant in any of the final models. It did not affect the accuracy of the models whether I used it or not. I left the variable in the final models due to time constraints.
I performed a feature plot analysis for my selected variables in the QDA section of the assignment. In hindsight, the feature plot analysis could have been performed in the Data Preparation section of the assignment.
Despite my conversation with you recently on March 17, I had to resolve to use the numerical versions of the two categorical variables for species and island to determine multicollinearity. It may produce bad results, but I made sure to use the VIF function to help me with the variable selection process.
The following variables had missing data and I used the MICE function to remediate the missing values. Bear in mind, SEX has only 11 missing values and the other variables had only 2 missing values. Perhaps, it would have been a lot easier to omit the missing data, but I felt that it would be a more truthful data set if we had imputed data values to play with.
BILL_LENGTH_MM
BILL_DEPTH_MM
FLIPPER_LENGTH_MM
BODY_MASS_G
SEX
##
## iter imp variable
## 1 1 bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex
## 1 2 bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex
## 1 3 bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex
## 1 4 bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex
## 1 5 bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex
## 2 1 bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex
## 2 2 bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex
## 2 3 bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex
## 2 4 bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex
## 2 5 bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex
## 3 1 bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex
## 3 2 bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex
## 3 3 bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex
## 3 4 bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex
## 3 5 bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex
## 4 1 bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex
## 4 2 bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex
## 4 3 bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex
## 4 4 bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex
## 4 5 bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex
## 5 1 bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex
## 5 2 bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex
## 5 3 bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex
## 5 4 bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex
## 5 5 bill_length_mm bill_depth_mm flipper_length_mm body_mass_g sex
## species island bill_length_mm bill_depth_mm
## Adelie :152 Biscoe :168 Min. :32.10 Min. :13.10
## Chinstrap: 68 Dream :124 1st Qu.:39.27 1st Qu.:15.57
## Gentoo :124 Torgersen: 52 Median :44.45 Median :17.30
## Mean :43.92 Mean :17.14
## 3rd Qu.:48.50 3rd Qu.:18.70
## Max. :59.60 Max. :21.50
## flipper_length_mm body_mass_g sex
## Min. :172.0 Min. :2700 female:174
## 1st Qu.:190.0 1st Qu.:3550 male :170
## Median :197.0 Median :4050
## Mean :200.9 Mean :4202
## 3rd Qu.:213.2 3rd Qu.:4756
## Max. :231.0 Max. :6300
I used createDataPartition to develop the Training and Test data sets. I separated the training and test data set using 80/20 partition of the imputed data.
set.seed(123)
inTrain1 <- createDataPartition(y = pengtempimputed$species, p=0.80, list = FALSE)
training1 <- pengtempimputed[inTrain1,]
testing1 <- pengtempimputed[-inTrain1,]
I next preprocessed the data by smoothing the data using centering and scaling and predicted the training and test data set that will be used for the model predictions.
preproc.parameter <- training1 %>%
preProcess(method = c("center", "scale"))
# Transform the data using the estimated parameters
train1.transform <- preproc.parameter %>% predict(training1)
test1.transform <- preproc.parameter %>% predict(testing1)
All required data preparation tasks are completed. I now have train1.transform and test1.transform data sets ready for execution and prediction of the three required models.
I executed the LDA function with the final model: species ~ island + bill_length_mm + bill_depth_mm + sex. I did not remove any variables from the final data set nor did I need to transform the variables. I did not remove any of the independent variables because doing so would reduce the accuracy of the prediction when executing the confusion matrix afterwards.
## Call:
## lda(species ~ bill_length_mm + bill_depth_mm + island + sex,
## data = train1.transform)
##
## Prior probabilities of groups:
## Adelie Chinstrap Gentoo
## 0.4404332 0.1985560 0.3610108
##
## Group means:
## bill_length_mm bill_depth_mm islandDream islandTorgersen sexmale
## Adelie -0.9400123 0.6093378 0.4180328 0.3278689 0.5163934
## Chinstrap 0.9046259 0.6472918 1.0000000 0.0000000 0.4909091
## Gentoo 0.6492707 -1.0994026 0.0000000 0.0000000 0.5100000
##
## Coefficients of linear discriminants:
## LD1 LD2
## bill_length_mm 1.5386966 -1.6947394
## bill_depth_mm -1.7533280 -0.7601105
## islandDream -1.4379457 -2.1955665
## islandTorgersen -1.3641784 -0.6341571
## sexmale 0.2388337 1.7261911
##
## Proportion of trace:
## LD1 LD2
## 0.7536 0.2464
I ran ggplot on the lda model and you will notice that the species are distinctly separate from each other. This gives a good indication that the independent variables are accurately predicting the species type.
I performed a prediction of the model using the test1.transform data. Based on the prediction, the accuracy of the model is 100%. I do not like models of 100% accuracy but I ran this several times and still came up with the same conclusion.
##
## Adelie Chinstrap Gentoo
## Adelie 30 0 0
## Chinstrap 0 13 0
## Gentoo 0 0 24
## [1] 1
Upone execution of the confusion matrix for the LDA, the accuracy of the model is still 100% at p-value significantly less than 0.05% indicating that there is strong evidence that the independent variables are predicting the values for species.
## Confusion Matrix and Statistics
##
## Reference
## Prediction Adelie Chinstrap Gentoo
## Adelie 30 0 0
## Chinstrap 0 13 0
## Gentoo 0 0 24
##
## Overall Statistics
##
## Accuracy : 1
## 95% CI : (0.9464, 1)
## No Information Rate : 0.4478
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 1
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: Adelie Class: Chinstrap Class: Gentoo
## Sensitivity 1.0000 1.000 1.0000
## Specificity 1.0000 1.000 1.0000
## Pos Pred Value 1.0000 1.000 1.0000
## Neg Pred Value 1.0000 1.000 1.0000
## Prevalence 0.4478 0.194 0.3582
## Detection Rate 0.4478 0.194 0.3582
## Detection Prevalence 0.4478 0.194 0.3582
## Balanced Accuracy 1.0000 1.000 1.0000
For the development of the QDA model, I used a conversion of the variables on a scale from 0 to 1 to help with defining a feature plot of the data. A Feature Plot will help us visually see the data and determine what final variable(s) would be used to execute the QDA model. The base model is species ~ island + bill_length_mm + bill_depth_mm + sex. Are there any variables that will not be used in the final QDA model?
To prepare for the feature plot analysis, I performed the following prep activities: 1) created dummy variables by converting the categorical variables if any to binary variables, 2) create a matrix of predicted values for the dummy independent variables, and 3) range them in values between 0 and 1 [KEI].
## Warning in model.frame.default(Terms, newdata, na.action = na.action, xlev =
## object$lvls): variable 'species' is not a factor
## island.Biscoe island.Dream island.Torgersen bill_length_mm bill_depth_mm
## min 0 0 0 0 0
## max 1 1 1 1 1
In the first row of plots, it clearly shows that bill_length_mm and bill_depth_mm are valid predictors of species. This is evidenced by the various medians and Q1 and Q3 ranges for the Adelie, Chinstrap, and Gentoo species types. A cause for concern may be the Adelie and Chinstrip median for bill_length_mm as their medians are similar. However, the range of data values is different. Adelie has a slightly, but apparent larger range of data than Chinstrap with bill_length_mm.
You will notice that the sex variable is conspicuously absent in the feature plot. This may indicate that the existence of the sex variable is inconsequtiental in the prediction of species.
The island variable is of concern based on the visual observation of the feature plot. It is split into three plots each representing an island. Adelie is present in all three islands. However, Gentoo exists only on the island of Biscoe and Chinstrap exists only on the island of Dream.
I executed the QDA function with the final model: species ~ bill_length_mm + bill_depth_mm + sex. I removed island from the final data set because it was giving me rank deficiency errors with the species sub-variable Chinstrap. Recall that Chinstrap in the feature plot analysis only ’exists in the island Dream. I could have investigated further, but since the data was already transformed already, I did not want to make island numerical to see if I could eliminate the rank deficiency problem. Per Professor Khan’s suggestion from Assignment 1, I cannot just eliminate a level from a variable. Either I leave the variable in or eliminate the variable from the model. I thus eliminated island from the QDA model. I did not remove any more independent variables because doing so would reduce the accuracy of the prediction when executing the confusion matrix afterwards.
## Call:
## qda(species ~ bill_length_mm + bill_depth_mm + sex, data = train1.transform)
##
## Prior probabilities of groups:
## Adelie Chinstrap Gentoo
## 0.4404332 0.1985560 0.3610108
##
## Group means:
## bill_length_mm bill_depth_mm sexmale
## Adelie -0.9400123 0.6093378 0.5163934
## Chinstrap 0.9046259 0.6472918 0.4909091
## Gentoo 0.6492707 -1.0994026 0.5100000
I performed a prediction of the model using the test1.transform data. Based on the prediction, the accuracy of the model is 100%. I do not like models of 100% accuracy but I ran this several times and still came up with the same conclusion.
##
## Adelie Chinstrap Gentoo
## Adelie 30 0 0
## Chinstrap 0 13 0
## Gentoo 0 0 24
## [1] 1
Upone execution of the confusion matrix for the QDA, the accuracy of the model is still 100% at p-value significantly less than 0.05% indicating that there is strong evidence that the independent variables are predicting the values for species.
## Confusion Matrix and Statistics
##
## Reference
## Prediction Adelie Chinstrap Gentoo
## Adelie 30 0 0
## Chinstrap 0 13 0
## Gentoo 0 0 24
##
## Overall Statistics
##
## Accuracy : 1
## 95% CI : (0.9464, 1)
## No Information Rate : 0.4478
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 1
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: Adelie Class: Chinstrap Class: Gentoo
## Sensitivity 1.0000 1.000 1.0000
## Specificity 1.0000 1.000 1.0000
## Pos Pred Value 1.0000 1.000 1.0000
## Neg Pred Value 1.0000 1.000 1.0000
## Prevalence 0.4478 0.194 0.3582
## Detection Rate 0.4478 0.194 0.3582
## Detection Prevalence 0.4478 0.194 0.3582
## Balanced Accuracy 1.0000 1.000 1.0000
I executed the Naïve Bayes function using 10-fold cross validation procedure [NAI] with the final model: species ~ island + bill_length_mm + bill_depth_mm + sex. I did not remove any variables from the final data set nor did I need to transform the variables. I did not remove any of the independent variables because doing so would reduce the accuracy of the prediction when executing the confusion matrix afterwards.
# create response and feature data for ALL fields
x = train1.transform[, 2:5]
y = train1.transform$species
# set up 10-fold cross validation procedure
train_control <- trainControl(
method = "cv",
number = 10
)
# train model
nb.m1 <- train(
x = x,
y = y,
method = "nb",
trControl = train_control
)
I performed a prediction of the model using the test1.transform data. Based on the prediction, the accuracy of the model is 98.1%. It is interesting that the accuracy of the model is less than both the LDA and QDA models.
## Cross-Validated (10 fold) Confusion Matrix
##
## (entries are percentual average cell counts across resamples)
##
## Reference
## Prediction Adelie Chinstrap Gentoo
## Adelie 43.7 1.4 0.0
## Chinstrap 0.4 18.4 0.0
## Gentoo 0.0 0.0 36.1
##
## Accuracy (average) : 0.9819
Upone execution of the confusion matrix for the Naïve Bayes model, the accuracy of the model is still 98.5% at p-value significantly less than 0.05% indicating that there is strong evidence that the independent variables are predicting the values for species.
## Confusion Matrix and Statistics
##
## Reference
## Prediction Adelie Chinstrap Gentoo
## Adelie 30 0 1
## Chinstrap 0 13 0
## Gentoo 0 0 23
##
## Overall Statistics
##
## Accuracy : 0.9851
## 95% CI : (0.9196, 0.9996)
## No Information Rate : 0.4478
## P-Value [Acc > NIR] : < 2.2e-16
##
## Kappa : 0.9764
##
## Mcnemar's Test P-Value : NA
##
## Statistics by Class:
##
## Class: Adelie Class: Chinstrap Class: Gentoo
## Sensitivity 1.0000 1.000 0.9583
## Specificity 0.9730 1.000 1.0000
## Pos Pred Value 0.9677 1.000 1.0000
## Neg Pred Value 1.0000 1.000 0.9773
## Prevalence 0.4478 0.194 0.3582
## Detection Rate 0.4478 0.194 0.3433
## Detection Prevalence 0.4627 0.194 0.3433
## Balanced Accuracy 0.9865 1.000 0.9792
Model | Accuracy | Recall | Specificity | Precision | F1Score | TPR | TNR | FPR | FNR |
---|---|---|---|---|---|---|---|---|---|
LDA | 1.0000000 | 1 | 1.000000 | 1.0000000 | 1.0000000 | 1 | 1.000000 | 0.000000 | 0 |
QDA | 1.0000000 | 1 | 1.000000 | 1.0000000 | 1.0000000 | 1 | 1.000000 | 0.000000 | 0 |
Naive Bayes | 0.9850746 | 1 | 0.972973 | 0.9677419 | 0.9836066 | 1 | 0.972973 | 0.027027 | 0 |
I will discuss on some findings I found with this assignment and its comparison to the literature [HOA] [LIN:
According to the text, LDA and QDA are reliant on the independent variables being normalized prior to executing their respective models. In this assignment I transformed the data by centering and scaling it prior to executing the LDA and QDA. While this produced two models with a perfect accuracy of 100%, it gives me great trepidation whether to rely on these findings or not. I don’t trust 100% accuracy and for me to achieve it in this exercise is somewhat disconcerting for me. By no means am I an expert data scientist and I can’t just qualify that these models are correct.
An alternative to the Naïve Bayes function using 10-fold cross validation procedure that I implemented in this assignment, there is the naiveBayes function from the e1071 package. However, upon executing this function, the accuracy for the model is 100%. Again, this leaves me a bit concerned about my model data selection process.
Prior to the execution of the models, it has become apparent to me that the feature selection process is probably the most important requirement. I felt it was the most time consuming and required the understanding of the nuances of the data. Unfortunately, there are so many feature selection techniques out there, it would be difficult to determine what feature(s) to use in these models for this assignment. Based on my Assignment 1 assessment, it was best to remove continuous variables such as year and sex. It was best to remove high multicollinear variables such as flipper_length_mm and body_mass_g. It is also assummed that the data had to be normalized prior to execution of the LDA and QDA models. The data was normalized using centering and scaling. Therefore, the data has been prepped thoroughly for the LDA, QDA, and Naïve Bayes models and thus the high accuracy rate and other attributes.
Based on the given fit statistics I have available at this time, I cannot accurately determine which model is better than the rest. For example, there is no clear distinctiong between the performance of LDA and QDA. Both models have the same Accuracies, F1-Score, TPR, FPR, TNR, FNR, etc. (See table above). How can I accurately determine which model is the best? As for the Naïve Bayes, with the exception of Recall, TPR and TNR (which are the same), it exhibits values at slightly than 1 as compared to the LDA and QDA models. This would mean that the Naïve Bayes is competitive to the LDA and QDA models. According to the literature, Naïve Bayes works well even with less than perfect data [HOA].
I maybe wrong in this assumption, but I don’t think the data is big enough to adequately determine the strengths and weaknesses of the models presented in this assignment. Perhaps a data set with more variables and more rows would be helpful in determining the relevancy of each of the models in this assignment.
[HOA] Discriminant Analysis & Naive Bayes. Retrieved from website: http://jennguyen1.github.io/nhuyhoa/statistics/Discriminant-Analysis-Naive-Bayes.html#logistic-regression-vs-discriminant-analysis-vs-naive-bayes
[KEI] A. Kei. R Tutorial 12: LDA, QDA, and KNN for Classification. Retrieved from website: https://www.youtube.com/watch?v=kBFor2dkZxE
[LIN] Linear Discriminant Analysis vs Naive Bayes. Retrieved from website: https://stackoverflow.com/questions/46396552/linear-discriminant-analysis-vs-naive-bayes
[MOH] J. Mohajen. Confusion Matrix for Your Multi-Class Machine Learning Model. Retrieved from website: https://towardsdatascience.com/confusion-matrix-for-your-multi-class-machine-learning-model-ff9aa3bf7826
[MAC] C. Mack. Lecture52 (Data2Decision) Detecting Multicollinearity in R. Retrieved from website: https://www.youtube.com/watch?v=QruEcbgfhzo
[NAI] Naïve Bayes Classifier. UC Business Analytics R Programming Guide. Retrieved from website: https://uc-r.github.io/naive_bayes
[PRA] S. Prabhakaran. Caret Package – A Practical Guide to Machine Learning in R. Retrieved from website: https://www.machinelearningplus.com/machine-learning/caret-package/
[WHA1] What is rank deficiency, and how to deal with it? Retrieved from website: https://stats.stackexchange.com/questions/35071/what-is-rank-deficiency-and-how-to-deal-with-it