You want to evaluate all the ‘features’ or dependent variables and see what should be in your model. Please comment on your choices. Just a suggestion: You might want to consider exploring featurePlot on the caret package. Basically, you look at each of the features/dependent variables and see how they are different based on species. Simply eye-balling this might give you an idea about which would be strong ‘classifiers’ (aka predictors)
First, we’ll save the penguin dataset as a dataframe so we can manipulate it easier in the next steps. Additionally, we’ll omit any rows with missing data, take the recommendation to remove the year variable, as well as create a factor variable based on our species variable:
Next, we will use the caret package and the featurePlot() function to take a look at all of the available continuous features in the penguins dataset:
featurePlot(
x = penguins[, c("bill_length_mm", "bill_depth_mm", "flipper_length_mm", "body_mass_g")], y = penguins$species_factor, plot="density", scales=list(
x = list(relation = "free"),
y = list(relation = "free")
),
adjust = 1.5,
pch = "|",
layout = c(2, 2),
auto.key = list(columns = 3)
)
featurePlot(
x = penguins[, c("bill_length_mm", "bill_depth_mm", "flipper_length_mm", "body_mass_g")],
y = penguins$species_factor,
plot = "ellipse",
auto.key = list(columns = 3)
)
featurePlot(
x = penguins[, c("bill_length_mm",
"bill_depth_mm",
"flipper_length_mm",
"body_mass_g")],
y = penguins$species_factor,
plot = "box",
scales = list(y = list(relation = "free"),
x = list(rot = 90)),
layout = c(4, 1)
)
Interestingly, when we break out the plots by species and look at the distributions, we can see that bill_depth_mm and body_mass_g show clustering between Adelie and Chinstrap penguins (blue and green), while Gentoo penguins tend to contrast the other two species for those two features. However, when examining the distributions for bill_length_mm and flipper_length_mm, we can see that there is less overlap between the distribution curves, indicating that they may serve as better predictors of our species variable. The same appears to be true when looking at the ellipse and box plots, each showing less overlap and clustering for features such as flipper_length_mm and bill_length_mm, and more clustering of Adelie and Chinstrap species for bill_depth_mm and body_mass_g. Since the goal is to determine a good classifier, we’ll want to use features in our analysis that will effectively distinguish by species type. Therefore, for our LDA, QDA, and Naive Bayes models, I’ve chosen to include bill_length_mm and flipper_length_mm as my predictors for species.
Splitting into training and testing datasets
In our first step, we’ll run an LDA model. However, before running our LDA, we’ll want to split our penguins dataset into a training and testing set in order to validate our models and see how effective our predictions are on a hold-out test set. Therefore, we’ll split into a training and testing set by doing the following:
set.seed(12345)
# utilizing one dataset for all four models
penguins_partition <- createDataPartition(penguins$species, p=0.7, list=FALSE)
penguins_training <- penguins[penguins_partition,]
penguins_testing <- penguins[-penguins_partition,]
Fit your LDA model using whatever predictor variables you deem appropriate. Feel free to split the data into training and test sets before fitting the model.
Now, with our dataset split into a training and test set, we’ll use the training set to run our initial LDA:
penguins_lda_training <- lda(species ~ bill_length_mm + flipper_length_mm, data = penguins_training)
penguins_lda_training
## Call:
## lda(species ~ bill_length_mm + flipper_length_mm, data = penguins_training)
##
## Prior probabilities of groups:
## Adelie Chinstrap Gentoo
## 0.4382979 0.2042553 0.3574468
##
## Group means:
## bill_length_mm flipper_length_mm
## Adelie 38.91456 190.4175
## Chinstrap 48.76042 195.4167
## Gentoo 47.74881 217.2143
##
## Coefficients of linear discriminants:
## LD1 LD2
## bill_length_mm 0.08245312 -0.3647561
## flipper_length_mm 0.12687541 0.1113311
##
## Proportion of trace:
## LD1 LD2
## 0.7067 0.2933
From the output of our LDA fit, at the top we can see the proportion of each species in the training set. Next, we can see the coefficients of our linear discriminants for both bill_length_mm and flipper_length_mm. These coefficients are used in our discriminant function to create the decision boundaries. Since k = 3 here, we can plot this in a two-dimensional space, which you can see below:
As we can see from the plots above, there is pretty distinct clustering between the three species where LD1 and LD2 are mapped, we can see Adelie predictions in the upper left, Chinstrap in lower middle, and Gentoo in the upper right.
Look at fit statistics/accuracy rates.
We’ll now use this model to create some predictions on our training and testing dataset. First, we can take a look at the first ten predictions (posteriors) in our training dataset based on our fitted model:
## Adelie Chinstrap Gentoo
## 1 0.9913 0.0087 0.0000
## 2 0.9934 0.0066 0.0000
## 3 0.9929 0.0044 0.0028
## 5 0.9997 0.0001 0.0002
## 6 0.9971 0.0028 0.0002
## 8 0.9971 0.0011 0.0017
## 14 0.9988 0.0010 0.0002
## 15 0.9989 0.0000 0.0011
## 16 0.9998 0.0002 0.0000
## 17 0.9980 0.0006 0.0014
From above, we can see that the model provides an estimated probability for each one of our individuals, and the highest probability signifies where the individual is ultimately grouped. After looking at the first ten predictions in the training dataset, we can see that all 10 are predicted to be Adelie species based on their predictor values of bill_length_mm and flipper_length_mm and their ensuing probabilities calculated by the LDA model.
We can now use our model to make predictions on our holdout test dataset in order to correctly evaluate its performance:
penguins_predictions_testing <- predict(penguins_lda_training, penguins_testing)$class
Then, after creating our predictions on our holdout testing dataset, we can now determine how many penguins were misclassified from our LDA model by looking at the class predictions generated from the model and comparing them to the actual species column present in our testing set.
In order to do this, we’ll first need to create a function, we’ll call it misclassified_penguins(), which accepts two arguments of “actual” and “predicted”. Then, it compares each rows values to determine whether or not values are equal between the prediction and the species.
misclassified_penguins <- function(actual, predicted) {
mean(actual != predicted)
print(table(actual != predicted))
print(paste0('Percent misclassified: ', round(mean(actual != predicted), 2)*100, '%'))
print(paste0('Percent classified correctly: ', round(1 - mean(actual != predicted), 2)*100, '%'))
}
misclassified_penguins(actual = penguins_testing$species, predicted = penguins_predictions_testing)
##
## FALSE TRUE
## 95 3
## [1] "Percent misclassified: 3%"
## [1] "Percent classified correctly: 97%"
We can see below that only about 3% (3 instances) were misclassified when using the LDA model on our testing set! Therefore, our overall accuracy is about 97%, not bad!
Next, we’ll create a confusion matrix and take a look at other fit statistics to further evaluate our LDA model.
| Analysis | Accuracy | F1 | Sensitivity | Specificity |
|---|---|---|---|---|
| Linear Discriminant Analysis (LDA) | 0.9693878 | 0.9654449 | 0.9589147 | 0.9833574 |
From the fit statistics above, we can see that the accuracy of our LDA model was around 97% on our testing dataset. Additionally, we had a high F1 score of 0.97, and the sensitivity and specificity values were also quite high at 0.96 and 0.98 respectively. Overall, we can see that this model performed quite well given these fit statistics.
When looking at the confusion matrix below, we can see that the model classified all Chinstrap individuals correctly, only misidentified one Adelie individual as a Chinstrap species, and misidentified two Gentoo individuals (one as Adelie and the other as Gentoo). The slightly higher misclassification on the Gentoo individuals would be something to keep in mind, especially if this model were to be used on a larger dataset. Since there was a slight class imbalance between the three species in our training dataset, this could account for a few of these discrepancies.
Given that we’ve already done a fair amount of exploratory analysis using the featurePlot() function in the caret package, and we’ve split our penguins dataset into a viable training and testing dataset, we can move along to fitting a quadratic discriminant analysis model to our data.
In order to do this, we’ll continue to use the MASS package and the qda() function:
penguins_qda_training <- qda(species ~ bill_length_mm + flipper_length_mm, data = penguins_training)
penguins_qda_training
## Call:
## qda(species ~ bill_length_mm + flipper_length_mm, data = penguins_training)
##
## Prior probabilities of groups:
## Adelie Chinstrap Gentoo
## 0.4382979 0.2042553 0.3574468
##
## Group means:
## bill_length_mm flipper_length_mm
## Adelie 38.91456 190.4175
## Chinstrap 48.76042 195.4167
## Gentoo 47.74881 217.2143
After fitting our QDA model, we can take a look at the partitions created below:
As we can see from the partition plot below, there’s distinct boundaries created by the QDA model. If we compare these boundaries to the LDA model we ran earlier (partition plot below), we can see that the shape of the boundaries are quite different, where the LDA model denotes distinct linear decision boundaries, the QDA model denotes quadratic decision boundaries.
Now, in order for us to evaluate our QDA model, we’ll have to generate predictions on our holdout test set again:
penguins_predictions_testing_qda <- predict(penguins_qda_training, penguins_testing)$class
Similar to our LDA process, we can now determine how many penguins were misclassified from our QDA model by looking at the class predictions generated from the model and comparing them to the actual species column present in our testing set.
In order to do this, we’ll use our misclassified_penguins() function again, which accepts two arguments of “actual” and “predicted”.
misclassified_penguins(actual = penguins_testing$species, predicted = penguins_predictions_testing_qda)
##
## FALSE TRUE
## 96 2
## [1] "Percent misclassified: 2%"
## [1] "Percent classified correctly: 98%"
We can see below that only about 2% (2 instances) were misclassified when using the LDA model on our testing set! Therefore, our overall accuracy is about 98%, even better than our LDA model.
Next, we’ll create a confusion matrix and take a look at other fit statistics to further evaluate our LDA model.
| Analysis | Accuracy | F1 | Sensitivity | Specificity |
|---|---|---|---|---|
| Linear Discriminant Analysis (LDA) | 0.9693878 | 0.9654449 | 0.9589147 | 0.9833574 |
| Quadratic Discriminant Analysis (QDA) | 0.9795918 | 0.9790062 | 0.9755814 | 0.9886484 |
From the fit statistics above, we can see that the accuracy of our QDA model was around 98% on our testing dataset, which was slightly higher than the accuracy on our LDA model. Additionally, we had a high F1 score of 0.98, and the sensitivity and specificity values were also quite high at 0.98 and 0.99 respectively. Overall, we can see that this model performed quite well given these fit statistics, and slightly better than our LDA model.
When looking at the confusion matrix below, we can see that the model classified all Chinstrap individuals correctly, only misidentified one Adelie individual as a Chinstrap species, and misidentified one Gentoo individual as an Adelie species.
So far, out of our two models, it looks like our QDA model performed slightly better on our testing dataset.
In a similar process to our LDA and QDA analysis, we’ll do a final model using Naive Bayes classification on our penguins dataset. In order to do this, we can use the NaiveBayes() function in the klaR package.
penguins_nb_training <- NaiveBayes(species ~ bill_length_mm + flipper_length_mm, data = penguins_training)
penguins_nb_training$apriori
## grouping
## Adelie Chinstrap Gentoo
## 0.4382979 0.2042553 0.3574468
penguins_nb_training$tables
## $bill_length_mm
## [,1] [,2]
## Adelie 38.91456 2.706796
## Chinstrap 48.76042 3.420323
## Gentoo 47.74881 3.204007
##
## $flipper_length_mm
## [,1] [,2]
## Adelie 190.4175 6.461840
## Chinstrap 195.4167 7.124526
## Gentoo 217.2143 6.858416
Although initially I thought the mean and standard deviations in the Naive Bayes output may be different than those in the LDA and QDA outputs, we can see that they are indeed the same across all three models.
Additionally, when plotting the partition plot of this model, we can see that Naive Bayes seems to create decision boundaries that aren’t quite as drastic as the QDA interpretation, but do have a slight quadratic shape to it.
partimat(species ~ bill_length_mm + flipper_length_mm, data = penguins_training, method = "naiveBayes", col.correct='gray', col.wrong='red')
Similar to the above processes, in order for us to evaluate our Naive Bayes model and compare it against our other two, we’ll have to generate predictions on our holdout test set again:
penguins_predictions_testing_nb <- predict(penguins_nb_training, penguins_testing)$class
Similar to our LDA and QDA process, we can now determine how many penguins were misclassified from our Naive Bayes model by looking at the class predictions generated from the model and comparing them to the actual species column present in our testing set.
In order to do this, we’ll use our misclassified_penguins() function one last time, which accepts two arguments of “actual” and “predicted”.
misclassified_penguins(actual = penguins_testing$species, predicted = penguins_predictions_testing_nb)
##
## FALSE TRUE
## 93 5
## [1] "Percent misclassified: 5%"
## [1] "Percent classified correctly: 95%"
We can see below that about 5% (5 instances) were misclassified when using the LDA model on our testing set! Therefore, our overall accuracy is about 95%, which is worse than both our LDA and QDA models.
Next, we’ll create a confusion matrix and take a look at other fit statistics to further evaluate our LDA model.
| Analysis | Accuracy | F1 | Sensitivity | Specificity |
|---|---|---|---|---|
| Linear Discriminant Analysis (LDA) | 0.9693878 | 0.9654449 | 0.9589147 | 0.9833574 |
| Quadratic Discriminant Analysis (QDA) | 0.9795918 | 0.9790062 | 0.9755814 | 0.9886484 |
| Naive Bayes (NB) | 0.9489796 | 0.9394489 | 0.9344961 | 0.9737929 |
From the fit statistics above, we can see that the accuracy of our Naive Bayes model was around 95% on our testing dataset, which was slightly lower than the accuracy on our LDA and QDA models. Additionally, we had an F1 score of 0.94, and the sensitivity and specificity values were at 0.93 and 0.97 respectively. Overall, we can see that this model performed worse than our other two previous models given these fit statistics.
When looking at the confusion matrix below, we can see that the model classified all but one Chinstrap individuals correctly (misclassified as an Adelie species), only misidentified one Adelie individual as a Chinstrap species, and misidentified three Gentoo individuals (two as Chinstrap and one as an Adelie species).
After running our LDA, QDA and Naive Bayes analysis on our penguins dataset, we can discuss a few different items pertaining to each model’s performance. When we look at a final table below that displays the accuracy and fit statistics of the three models, we can see slight differences across all three:
| Analysis | Accuracy | F1 | Sensitivity | Specificity |
|---|---|---|---|---|
| Linear Discriminant Analysis (LDA) | 0.9693878 | 0.9654449 | 0.9589147 | 0.9833574 |
| Quadratic Discriminant Analysis (QDA) | 0.9795918 | 0.9790062 | 0.9755814 | 0.9886484 |
| Naive Bayes (NB) | 0.9489796 | 0.9394489 | 0.9344961 | 0.9737929 |
Naive Bayes didn’t perform well in this assignment, relative to our other models
When looking at the accuracy of all three models, it appears that our Naive Bayes model underperformed on all of our fit statistic measures compared to our LDA and QDA analyses. Additionally, when we mapped out the predictions in a confusion matrix, there were slightly more misclassified instances in our Naive Bayes predictions than our LDA or QDA model predictions. This seems to follow a trend in literature that suggests that Naive Bayes models are often much more suitable for very high dimensional datasets. Given that our k-value was very low in this assignment, we weren’t able to harness the full power of Naive Bayes. However, if we were to pick between Naive Bayes, LDA, and QDA for analysis dealing with sparser datasets (i.e. text analysis), where we were working with a very sparse corpus of text data, Naive Bayes would be much more effective.
QDA had a slight edge in performance over LDA
With this in mind, we could see that both LDA and QDA performed quite well with our training and testing datasets in this assignment. Although it looks like QDA had a slight edge in accuracy and other fit statistics over our LDA model, it is important to note that if we were to use this on another holdout dataset, the simplicity of our LDA model may provide a more stable set of predictions than the QDA model. Regardless, either the QDA model or the LDA model would be a good pick!