penguins_df <- palmerpenguins::penguins %>%
dplyr::select(-year)
str(penguins_df)
## tibble [344 x 7] (S3: tbl_df/tbl/data.frame)
## $ species : Factor w/ 3 levels "Adelie","Chinstrap",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ island : Factor w/ 3 levels "Biscoe","Dream",..: 3 3 3 3 3 3 3 3 3 3 ...
## $ bill_length_mm : num [1:344] 39.1 39.5 40.3 NA 36.7 39.3 38.9 39.2 34.1 42 ...
## $ bill_depth_mm : num [1:344] 18.7 17.4 18 NA 19.3 20.6 17.8 19.6 18.1 20.2 ...
## $ flipper_length_mm: int [1:344] 181 186 195 NA 193 190 181 195 193 190 ...
## $ body_mass_g : int [1:344] 3750 3800 3250 NA 3450 3650 3625 4675 3475 4250 ...
## $ sex : Factor w/ 2 levels "female","male": 2 1 1 NA 1 2 1 2 NA NA ...
The penguins dataset is composed of 344 datapoints and has one response variable (species) and seven explanatory variables (island, bill_length_mm, bill_depth_mm, flipper_length_mm, body_mass_g, sex, and year).
Intuitively, we know that the year variable shouldn’t make a difference in the penguin species, so we will remove it from the dataset. We are now left with two categorical variables (island and sex) and four numeric variables (bill_length_mm, bill_depth_mm, flipper_length_mm andbody_mass_g).
summary(penguins_df)
## species island bill_length_mm bill_depth_mm
## Adelie :152 Biscoe :168 Min. :32.10 Min. :13.10
## Chinstrap: 68 Dream :124 1st Qu.:39.23 1st Qu.:15.60
## Gentoo :124 Torgersen: 52 Median :44.45 Median :17.30
## Mean :43.92 Mean :17.15
## 3rd Qu.:48.50 3rd Qu.:18.70
## Max. :59.60 Max. :21.50
## NA's :2 NA's :2
## flipper_length_mm body_mass_g sex
## Min. :172.0 Min. :2700 female:165
## 1st Qu.:190.0 1st Qu.:3550 male :168
## Median :197.0 Median :4050 NA's : 11
## Mean :200.9 Mean :4202
## 3rd Qu.:213.0 3rd Qu.:4750
## Max. :231.0 Max. :6300
## NA's :2 NA's :2
From the summary statistics, we can see that 5 out of 6 of the explanatory variables have at least some null values. It doesn’t make sense to impute the null records of our categorical variable (sex), so we will remove them from our dataset. We will impute the missing data of the numerical values using the Multivariate imputation by chained equations (MICE) method. Multiple imputation involves creating multiple predictions for each missing value, helping to account for the uncertainty in the individual imputations.
# remove records with null sex values
penguins_df <- penguins_df %>% filter(!is.na(sex))
# impute null values using MICE method
penguins_df <- complete(mice(data = penguins_df,
method = "pmm", print = FALSE), 3)
One of the assumptions for LDA, QDA, and Naive Bayes is that that each predictor variable is normally distributed. This implies non-categorical variables, so we will eliminate them from the dataset.
penguins_df <- penguins_df %>%
dplyr::select(-island, -sex)
The final dataset contains 333 records with 4 explanatory variables.
In order for LDA, QDA and Naive Bayes methods to work optimally, the data must be approximately normally distributed for each variable in each class. Let’s take a look:
For the most part, the data appear to be normally distributed for each class, so we are set to move forward.
In LDA, we assume that the covariance matrices are the same between classes. We can use Box’s M test to test if the covariance matrices are equal. The null hypothesis is that the covariances are equal across all groups.
boxm <- heplots::boxM(penguins_df %>% select_if(is.numeric), penguins_df$species)
boxm
##
## Box's M-test for Homogeneity of Covariance Matrices
##
## data: penguins_df %>% select_if(is.numeric)
## Chi-Sq (approx.) = 74.731, df = 20, p-value = 3.02e-08
Since the p-value is less than 0.05, we can reject the null hypothesis and we know that the covariance matrices are different for at least one group. This means that an LDA model will likely not perform as well as a QDA model.
We can visualize the covariances of the variables against the species to identify which ones are the most different:
heplots::covEllipses(penguins_df %>% dplyr::select_if(is.numeric),
penguins_df$species,
fill = TRUE,
pooled = FALSE,
col = c("blue", "red", "orange"),
variables = c(1:ncol(penguins_df %>% dplyr::select_if(is.numeric))),
fill.alpha = 0.05)
It is evident by the differing shapes of the ellipses that the variance is not equal for some of our variables (ex: bill_depth).
In LDA and QDA, there are no assumptions about the relationships between the variables. However, in Naive Bayes, we assume that all features are independent of one another, ie, no correlations between variables in a class. Let’s take a look at the relationships of the numeric variables against each other for each species.
First, we’ll take a look at the Adelie species:
## bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## bill_length_mm NA 0.3858132 0.3322740 0.5442764
## bill_depth_mm NA NA 0.3108974 0.5801560
## flipper_length_mm NA NA NA 0.4648539
## body_mass_g NA NA NA NA
Next, we’ll take a look at the Chinstrap species:
## bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## bill_length_mm NA 0.6535362 0.4716073 0.5136383
## bill_depth_mm NA NA 0.5801429 0.6044983
## flipper_length_mm NA NA NA 0.6415594
## body_mass_g NA NA NA NA
Finally, we’ll take a look at the Gentoo species:
## bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
## bill_length_mm NA 0.6540233 0.6642052 0.6667302
## bill_depth_mm NA NA 0.7106422 0.7229672
## flipper_length_mm NA NA NA 0.7113053
## body_mass_g NA NA NA NA
We can see that there is a stronger relationship between many of the variables for the Gentoo species. We know off the bat if we use all variables in the model that the Naive Bayes method might not perform as well as the other methods.
We can also visualize these relationships:
caret::featurePlot(x = penguins_df %>% select_if(is.numeric),
y = penguins_df$species,
plot = "pairs",
auto.key = list(columns = 3)) # add key for species at the top
We can see that the Gentoo species is more easily separable from the other two species. Gentoo penguins tend to have a smaller bill depth, larger flipper length, and larger body mass than the other species.
Since discriminant analysis can be affected by the scale and unit of the predictor variables, we will normalize the continuous predictors for the analysis. We will use the scale function, which will standardize the data so that we have a mean of 0 and a standard deviation of 1.
final_df <- penguins_df %>%
mutate_at(c("bill_length_mm", "bill_depth_mm", "flipper_length_mm", "body_mass_g"),
~(scale(.) %>% as.vector))
First, we’ll train an LDA model with 10-fold cross validation. LDA is more stable than Logistic Regression, and is more often used in instances where there are more than 2 outcome classes. It assumes that the predictor variables are (1) normally distributed and (2) that the classes have identical covariance matrices. We know from our EDA that the data is approximately normally distributed by class, but that the covariance matrices are not the same.
lda_model <- train(data=final_df,
species ~ .,
method="lda",
trControl=trainControl(method="cv", number=10))
We will also create an LDA model that eliminates the bill_depth variable, as this was the feature that appeared to have the most unequal variance across species.
lda_model2 <- train(data=final_df,
species ~ bill_length_mm + flipper_length_mm + body_mass_g,
method="lda",
trControl=trainControl(method="cv", number=10))
Quadratic Discriminant Analysis is similar to LDA, but does not assume that the classes have identical covariance matrices. Our conditions are met for QDA, so we will keep all variables in the model.
qda_model <- train(data=final_df,
species ~ .,
method="qda",
trControl=trainControl(method="cv", number=10))
A Naive Bayes classifier assumes that the presence of a particular feature in a class is unrelated to the presence of any other feature. We know from our EDA that some of the variables for the Gentoo species had a high correlation, so we anticipate that the Naive Bayes model won’t perform as well as the other two.
nb_model <- train(data=final_df,
species ~ .,
method="nb",
trControl=trainControl(method="cv", number=10))
Finally, we can compare the accuracy and kappa of our 3 models.
In both cases, the higher the value, the better the classification.
nms <- c('LDA - All Variables',
'LDA - No Bill Depth',
'QDA',
'Naive Bayes')
acc <- c(lda_model$results$Accuracy,
lda_model2$results$Accuracy,
qda_model$results$Accuracy,
nb_model$results$Accuracy[1])
kap <- c(lda_model$results$Kappa,
lda_model2$results$Kappa,
qda_model$results$Kappa,
nb_model$results$Kappa[1])
data.frame(nms, acc, kap)
## nms acc kap
## 1 LDA - All Variables 0.9851103 0.9765707
## 2 LDA - No Bill Depth 0.9729947 0.9574153
## 3 QDA 0.9879679 0.9811025
## 4 Naive Bayes 0.9700423 0.9530511
The results are pretty similar high the board. However, the QDA model wins out at \(98.8%\) accuracy and \(98.1%\) kappa. The final ranking aligns with that we had expected from our preliminary analysis of the data: QDA, LDA, and finally Naive Bayes.
In summary: