The purpose of the assignment was to explore linear discriminant analysis (LDA), quadratic discriminant analysis (QDA), and naive Bayes as applied to the Palmer penguin dataset.
……………………………………………………………………..
Before we explore LDA, QDA, and naive Bayes, as applied to the data at hand, we first have to re-famiarize ourselves with the data at hand and evaluate all ‘features’ to see what should be in our model.
We’ll perform EDA, select our features based on pertinence and correlation, and then verify our selection for consistencies or inconsistencies with the output of the caret package’s featurePlot.
First, we load in the data and re-familiarize ourselves with the penguins dataset:
#We're going to deal with the Palmer penguins dataset:
data(penguins)
#Perform 'light EDA' to familiarize ourselves with the dataset:
glimpse(penguins)
summary(penguins)
We utilize the built-in glimpse() and summary() functions to gain insight into the dimensions, variable characteristics, and value range for our penguin dataset.
We observe the presence of a non-pertinent ‘year’ variable as well as the presence of NA values in varying quantities across a number of our variables. We also see that we’re dealing with (3) factors and (5) numeric variables. As a next step, we’ll visualize, explore further, and elect which features to carry forward and which to drop.
We visualize and explore our dataset with a scatter plot, a simple count() table, and a correlation matrix:
#Scatter plot of all explanatory variables vs. response
tall <- gather(penguins, metric, value, -species)
ggplot(tall, aes(x = value, y = species)) +
geom_point(alpha = .2) +
geom_smooth(method = 'lm', se = FALSE) +
facet_wrap(~metric, scales = 'free')
The scatter plot above provides early indication as to which variables should be in our model:
Being that we were planning on excluding the year variable, we’ll explore whether there’s merit to the exclusion of sex.
## # A tibble: 8 x 3
## # Groups: species, sex [8]
## species sex n
## <fct> <fct> <int>
## 1 Adelie female 73
## 2 Adelie male 73
## 3 Adelie <NA> 6
## 4 Chinstrap female 34
## 5 Chinstrap male 34
## 6 Gentoo female 58
## 7 Gentoo male 61
## 8 Gentoo <NA> 5
It appears that there is. It appears that the sex follows a ~50/50 split regardless of species and thus dropping sex from consideration would be a model upgrade.
Next we explore a correlation matrix to dig a little deeper into the earlier mentioned scatter plot similarities between flipper length and body mass:
#Correlation matrix
library(corrplot)
pen_corr <- penguins %>%
select_if(is.numeric) %>%
drop_na() %>%
cor()
pen_corr
## bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## bill_length_mm 1.00000000 -0.23505287 0.6561813 0.59510982
## bill_depth_mm -0.23505287 1.00000000 -0.5838512 -0.47191562
## flipper_length_mm 0.65618134 -0.58385122 1.0000000 0.87120177
## body_mass_g 0.59510982 -0.47191562 0.8712018 1.00000000
## year 0.05454458 -0.06035364 0.1696751 0.04220939
## year
## bill_length_mm 0.05454458
## bill_depth_mm -0.06035364
## flipper_length_mm 0.16967511
## body_mass_g 0.04220939
## year 1.00000000
flipper_length_mm
is strongly correlated with body_mass_g
and thus only one of the variables should be carried forward in our model. I elected body_mass_g
since it had weaker correlation with the other variables than flipper_length_mm
.
Thus, we’ll drop flipper_length_mm
, sex
, and year
from consideration:
#Drop impertinent variables
skinny_penguins <- dplyr::select(penguins, -c(sex, year, flipper_length_mm))
##head(skinny_penguins) #verification
As a final step in pre-processing our data, we have to deal with missing values:
## species island bill_length_mm bill_depth_mm body_mass_g
## 0 0 2 2 2
From above, we see that we’re missing 2 values in 2 separate features and that these features are numeric. Rather than dropping them, we can impute using the pmm method (predictive mean matching):
Predictive mean matching calculates the predicted value for our target variable, and, for missing values, forms a small set of “candidate donors” from the complete cases that are closest to the predicted value for our missing entry. Donors are then randomly chosen from candidates and imputed where values were once missing. To apply pmm we assume that the distribution is the same for missing cells as it is for observed data, and thus, the approach may be more limited when the % of missing values is higher.
Once we’ve imputed missing values into our training dataset and returned the data in proper form, we verify whether our operation was successful:
#Impute missing values
skinny_penguins <- mice(data = skinny_penguins, m = 1, method = "pmm", seed = 500)
##
## iter imp variable
## 1 1 bill_length_mm bill_depth_mm body_mass_g
## 2 1 bill_length_mm bill_depth_mm body_mass_g
## 3 1 bill_length_mm bill_depth_mm body_mass_g
## 4 1 bill_length_mm bill_depth_mm body_mass_g
## 5 1 bill_length_mm bill_depth_mm body_mass_g
skinny_penguins <- mice::complete(skinny_penguins, 1)
#verify absence of NA values in the dataset
colSums(is.na(skinny_penguins))
## species island bill_length_mm bill_depth_mm body_mass_g
## 0 0 0 0 0
The presence of all 0’s above confirms the success of imputation.
At this point we’ve dealt with NA values and non-pertinent variables. Our model has simplified from 7 independent variables to the 4 noted below:
Before, we train-test split our data and then move on to exploring each generative model, we’ll verify our variable selection using the caret package’s featurePlot:
featurePlot(x = penguins[,-1],
y = penguins$species,
plot = "pairs",
## Add a key at the top
auto.key = list(columns = 3))
Consulting the scatter plot matrix, while a bit hard to read and digest, appears to confirm our variable selection and so we proceed with performing a train-test split:
#Split the data into 80% training and 20% test sets
set.seed(123)
partition <- skinny_penguins$species %>%
createDataPartition(p = 0.8, list = FALSE)
train <- skinny_penguins[partition, ]
test <- skinny_penguins[-partition, ]
With 80% of our data in the training set and 20% of our data in the testing set, we’re prepared to fit our data with the generative models at hand.
……………………………………………………………………..
LDA uses Bayes’ Theorem to estimate probabilities. The probability that an input belongs to each class is used to predict the output class, and the class with the highest probability is cast as the prediction.
We fit our model using the built-in lda() function, we then pass our training and testing data through the model to cast predictions (taking the class), we store these predictions, compare with the actual test data, and then tabularize the result:
## Call:
## lda(species ~ ., data = train)
##
## Prior probabilities of groups:
## Adelie Chinstrap Gentoo
## 0.4404332 0.1985560 0.3610108
##
## Group means:
## islandDream islandTorgersen bill_length_mm bill_depth_mm body_mass_g
## Adelie 0.4180328 0.3278689 38.79180 18.35738 3715.574
## Chinstrap 1.0000000 0.0000000 48.80364 18.41636 3748.182
## Gentoo 0.0000000 0.0000000 47.42200 14.98500 5069.000
##
## Coefficients of linear discriminants:
## LD1 LD2
## islandDream -1.347018837 -1.424579794
## islandTorgersen -0.943921338 -0.063966723
## bill_length_mm 0.104035334 -0.384034721
## bill_depth_mm -0.994305097 0.100652724
## body_mass_g 0.001918717 0.001339994
##
## Proportion of trace:
## LD1 LD2
## 0.819 0.181
pen_lda_trn_pred = predict(pen_lda, train)$class
pen_lda_tst_pred = predict(pen_lda, test)$class
calc_class_err = function(actual, predicted) {
mean(actual != predicted)
}
calc_class_err(predicted = pen_lda_trn_pred, actual = train$species) #0
## [1] 0
## [1] 0
## actual
## predicted Adelie Chinstrap Gentoo
## Adelie 30 0 0
## Chinstrap 0 13 0
## Gentoo 0 0 24
Based on the error calculations above, our results are as follows:
Based on the Confusion Matrix for our test set, our results are as follows:
We then take the sum of the diagonal of our Confusion Matrix (our True values) and put that over the sum of all values. As you may have expected, the output is 1.0 or 100% test accuracy.
QDA is a more general version of LDA. It’s important to note the nomial difference: linear vs. quadratic. The difference between LDA and QDA is that LDA assumes the feature covariance matrices are the same and that the decision boundary is linear. QDA, by comparison, allows for different feature covariance matrices which allows for a nonlinear, or quadratic, decision boundary.
We fit our model using the built-in qda() function, we then pass our training and testing data through the model to cast predictions (taking the class), we store these predictions, compare with the actual test data, and then tabularize the result:
# Fit the model based on 2 variables
pen_qda = qda(species ~ bill_length_mm + bill_depth_mm, data = train)
pen_qda
## Call:
## qda(species ~ bill_length_mm + bill_depth_mm, data = train)
##
## Prior probabilities of groups:
## Adelie Chinstrap Gentoo
## 0.4404332 0.1985560 0.3610108
##
## Group means:
## bill_length_mm bill_depth_mm
## Adelie 38.79180 18.35738
## Chinstrap 48.80364 18.41636
## Gentoo 47.42200 14.98500
pen_qda_trn_pred = predict(pen_qda, train)$class
pen_qda_tst_pred = predict(pen_qda, test)$class
calc_class_err(predicted = pen_qda_trn_pred, actual = train$species)
## [1] 0.0433213
## [1] 0
## actual
## predicted Adelie Chinstrap Gentoo
## Adelie 30 0 0
## Chinstrap 0 13 0
## Gentoo 0 0 24
Based on the error calculations above, our results are as follows:
Based on the Confusion Matrix for our test set, our results are as follows:
We then take the sum of the diagonal of our Confusion Matrix (our True values) and put that over the sum of all values. As you may have expected, the output is 1.0 or 100% test accuracy.
With two fewer variables to consider, the QDA model saved on computational power and model complexity and was still 100% accurate.
Bayes Theorem provides a means of calculating conditional probability that, while deceptively simple, is incredibly powerful:
P(A|B) = (P(B|A) * P(A)) / P(B)
LDA, QDA, and Naive Bayes all utilize Bayes’ Theorem to estimate their probabilities. Where Naive Bayes differs is that it assumes the independence of features within its consideration. Where there is strong correlation between variables, this can create a problem, but where the correlation is limited, the predictions are quite accurate.
We fit our model using the built-in naiveBayes() function, we then pass our training and testing data through the model to cast predictions, we store these predictions, compare with the actual test data, and then tabularize the result:
##
## Naive Bayes Classifier for Discrete Predictors
##
## Call:
## naiveBayes.default(x = X, y = Y, laplace = laplace)
##
## A-priori probabilities:
## Y
## Adelie Chinstrap Gentoo
## 0.4404332 0.1985560 0.3610108
##
## Conditional probabilities:
## island
## Y Biscoe Dream Torgersen
## Adelie 0.2540984 0.4180328 0.3278689
## Chinstrap 0.0000000 1.0000000 0.0000000
## Gentoo 1.0000000 0.0000000 0.0000000
##
## bill_length_mm
## Y [,1] [,2]
## Adelie 38.79180 2.660069
## Chinstrap 48.80364 3.582621
## Gentoo 47.42200 2.896587
##
## bill_depth_mm
## Y [,1] [,2]
## Adelie 18.35738 1.1690158
## Chinstrap 18.41636 1.1674226
## Gentoo 14.98500 0.9768404
##
## body_mass_g
## Y [,1] [,2]
## Adelie 3715.574 440.5613
## Chinstrap 3748.182 418.9066
## Gentoo 5069.000 485.3162
pen_nb_trn_pred = predict(pen_nb, train)
pen_nb_tst_pred = predict(pen_nb, test)
calc_class_err(predicted = pen_nb_trn_pred, actual = train$species)
## [1] 0.02166065
## [1] 0.01492537
## actual
## predicted Adelie Chinstrap Gentoo
## Adelie 29 0 0
## Chinstrap 1 13 0
## Gentoo 0 0 24
Based on the error calculations above, our results are as follows:
Based on the Confusion Matrix for our test set, our results are as follows:
We then take the sum of the diagonal of our Confusion Matrix (our True values) and put that over the sum of all values. The output is 0.985 or 98.5% test accuracy.
The naive Bayes classifier performed quite well with just 1 (mislabeled Chinstrap) error.
Without even visualizing AUC (Area Under Curve), we can tell from the statistics above that the LDA and QDA models were optimal. The naive Bayes model, while still performing well, had 98.5% test accuracy as compared to 100% for the LDA and QDA models. Naive Bayes did not get to show its strength, the ability to handle a large number of predictors (even with a limited sample size), since LDA and QDA performed well given a relatively lower number of predictors.
With this said, the QDA model accounted for fewer variables and was thus less complex and computationally expensive, while the LDA model performed with 100% test and train accuracy as compared to the QDA model’s 100% test and 95.7% train accuracy. This creates a bit of a ‘toss up’ in deciding between the models. QDA would be recommended as the less complex model while LDA would be recommended as the more consistent model since it performed well on training and test data.
……………………………………………………………………..
The process and analysis, executed above, was made with reference to the following resources: