Questions:
LDA assumes equal variance-covariance matrices across groups (homoscedasticity) and multivariate normality. Based on a quick plot, do the species’ “clouds” look like they have roughly the same shape and size?
Use the caret package to create an 80% training set and a 20% test set. (use Seed 33)
Why do we use createDataPartition instead of a simple random sample?
Fit the LDA model using only the training data.
Since we have 3 groups (species), how many Linear Discriminants (LDs) will the model produce?
Look at the “Proportion of trace.” How much of the between-group variance is explained by the first LD?
Examine the coefficients of the linear discriminants. Which physical measurement is the strongest “discriminator” for the first LD?
Use the model to predict the species of the 20% “unseen” test penguins.
Create a Confusion Matrix for the test set.
Which two species are most likely to be confused by the model? Why does this make “common sense” (check their physical similarities)?
Compare the accuracy of the Training set vs. the Test set. If the Training accuracy is 99% but the Test accuracy is 70%, what does that tell you about your model?
Look at the “Prior Probabilities” in the LDA output. How do these change if your training set is small?
library(FactoMineR)
library(tidyverse)
library(palmerpenguins)
library(caret)
library(GGally)
data(penguins)
df <- na.omit(penguins)
# Select relevant variables
df <- df %>%
select(species, bill_length_mm, bill_depth_mm,
flipper_length_mm, body_mass_g)Linear Discriminant Analysis assumes:
A pairwise scatterplot of the variables (colored by species) provides a quick visual check. The clusters representing the species appear reasonably elliptical and similar in shape, although some differences exist, particularly between Gentoo and the other species. However, the assumption of equal covariance structures is not severely violated.
Thus, while the assumptions are not perfectly met, they are sufficiently reasonable for LDA to be applied.
# Pairwise plot
ggpairs(df, aes(color = species))
#> `stat_bin()` using `bins = 30`. Pick better value `binwidth`.
#> `stat_bin()` using `bins = 30`. Pick better value `binwidth`.
#> `stat_bin()` using `bins = 30`. Pick better value `binwidth`.
#> `stat_bin()` using `bins = 30`. Pick better value `binwidth`.We used an 80/20 split with createDataPartition.
This function is preferred over a simple random split because it performs stratified sampling. This ensures that the proportion of each species is approximately the same in both training and test sets.
Without this, one class could be underrepresented in the training data, leading to biased models.
library(MASS)
#>
#> Attaching package: 'MASS'
#> The following object is masked from 'package:dplyr':
#>
#> select
set.seed(33)
trainIndex <- createDataPartition(df$species, p = 0.8, list = FALSE)
train <- df[trainIndex, ]
test <- df[-trainIndex, ]LDA produces at most k − 1 discriminant functions, where k is the number of groups.
Since there are 3 species, the model produces 2 Linear Discriminants (LD1 and LD2)
These represent axes that maximize separation between species.
lda_model <- lda(species ~ ., data = train)
lda_model
#> Call:
#> lda(species ~ ., data = train)
#>
#> Prior probabilities of groups:
#> Adelie Chinstrap Gentoo
#> 0.4365672 0.2052239 0.3582090
#>
#> Group means:
#> bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
#> Adelie 38.70342 18.29402 190.2564 3686.538
#> Chinstrap 49.18364 18.53818 196.3636 3742.727
#> Gentoo 47.45833 14.95521 216.7604 5096.615
#>
#> Coefficients of linear discriminants:
#> LD1 LD2
#> bill_length_mm -0.084689578 -0.419011936
#> bill_depth_mm 1.045477396 -0.016903127
#> flipper_length_mm -0.075609642 0.013875883
#> body_mass_g -0.001392831 0.001616021
#>
#> Proportion of trace:
#> LD1 LD2
#> 0.8499 0.1501The proportion of trace indicates how much of the between-group variance is explained by each discriminant.
Typically:
This means most of the group separation occurs along a single dimension.
The coefficients (scaling) indicate how each variable contributes to the discriminant functions. In this case bill_length and bill_depth are the variables contributin most to LD1 and LD2.
lda_model$scaling
#> LD1 LD2
#> bill_length_mm -0.084689578 -0.419011936
#> bill_depth_mm 1.045477396 -0.016903127
#> flipper_length_mm -0.075609642 0.013875883
#> body_mass_g -0.001392831 0.001616021The confusion matrix on the test set shows classification performance. We achieve a high overall accuracy of 97%. There are only 2 misclassifications, and these happen between the Adelie and Chinstrap penguins.
This makes intuitive sense because they have overlapping physical characteristics and their measurements are less distinct compared to Gentoo
pred_test <- predict(lda_model, test)
# Confusion matrix
confusionMatrix(pred_test$class, test$species)
#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction Adelie Chinstrap Gentoo
#> Adelie 29 2 0
#> Chinstrap 0 11 0
#> Gentoo 0 0 23
#>
#> Overall Statistics
#>
#> Accuracy : 0.9692
#> 95% CI : (0.8932, 0.9963)
#> No Information Rate : 0.4462
#> P-Value [Acc > NIR] : < 2.2e-16
#>
#> Kappa : 0.951
#>
#> Mcnemar's Test P-Value : NA
#>
#> Statistics by Class:
#>
#> Class: Adelie Class: Chinstrap Class: Gentoo
#> Sensitivity 1.0000 0.8462 1.0000
#> Specificity 0.9444 1.0000 1.0000
#> Pos Pred Value 0.9355 1.0000 1.0000
#> Neg Pred Value 1.0000 0.9630 1.0000
#> Prevalence 0.4462 0.2000 0.3538
#> Detection Rate 0.4462 0.1692 0.3538
#> Detection Prevalence 0.4769 0.1692 0.3538
#> Balanced Accuracy 0.9722 0.9231 1.0000
pred_train <- predict(lda_model, train)
confusionMatrix(pred_train$class, train$species)
#> Confusion Matrix and Statistics
#>
#> Reference
#> Prediction Adelie Chinstrap Gentoo
#> Adelie 116 1 0
#> Chinstrap 1 54 0
#> Gentoo 0 0 96
#>
#> Overall Statistics
#>
#> Accuracy : 0.9925
#> 95% CI : (0.9733, 0.9991)
#> No Information Rate : 0.4366
#> P-Value [Acc > NIR] : < 2.2e-16
#>
#> Kappa : 0.9883
#>
#> Mcnemar's Test P-Value : NA
#>
#> Statistics by Class:
#>
#> Class: Adelie Class: Chinstrap Class: Gentoo
#> Sensitivity 0.9915 0.9818 1.0000
#> Specificity 0.9934 0.9953 1.0000
#> Pos Pred Value 0.9915 0.9818 1.0000
#> Neg Pred Value 0.9934 0.9953 1.0000
#> Prevalence 0.4366 0.2052 0.3582
#> Detection Rate 0.4328 0.2015 0.3582
#> Detection Prevalence 0.4366 0.2052 0.3582
#> Balanced Accuracy 0.9924 0.9886 1.0000If we look at the training accuracy, we see that it is 99%.
If the training accuracy is equal to 99% and the test accuracy is 70%, this indicates overfitting. The model has learned patterns specific to the training data that do not generalize well to new data. A large gap indicates poor generalization
In this case, the training accuracy and test accuracy are close, there only a small gap of 2%, which indicates that there is no sign of overfitting.
The prior probabilities reflect the proportion of each class in the training data.
With a large dataset:
Priors are stable and close to true population proportions
With a small dataset:
Priors can fluctuate. This may bias predictions toward overrepresented classes