Multivariate Statistics - StatUa

Linear Discriminant Analysis - Exercise solutions

Questions:

LDA assumes equal variance-covariance matrices across groups (homoscedasticity) and multivariate normality. Based on a quick plot, do the species’ “clouds” look like they have roughly the same shape and size?

Use the caret package to create an 80% training set and a 20% test set. (use Seed 33)

Why do we use createDataPartition instead of a simple random sample?

Fit the LDA model using only the training data.

Since we have 3 groups (species), how many Linear Discriminants (LDs) will the model produce?

Look at the “Proportion of trace.” How much of the between-group variance is explained by the first LD?

Examine the coefficients of the linear discriminants. Which physical measurement is the strongest “discriminator” for the first LD?

Use the model to predict the species of the 20% “unseen” test penguins.

Create a Confusion Matrix for the test set.

Which two species are most likely to be confused by the model? Why does this make “common sense” (check their physical similarities)?

Compare the accuracy of the Training set vs. the Test set. If the Training accuracy is 99% but the Test accuracy is 70%, what does that tell you about your model?

Look at the “Prior Probabilities” in the LDA output. How do these change if your training set is small?


library(FactoMineR)
library(tidyverse)
library(palmerpenguins)
library(caret)
library(GGally)

data(penguins)

df <- na.omit(penguins)

# Select relevant variables
df <- df %>%
  select(species, bill_length_mm, bill_depth_mm, 
         flipper_length_mm, body_mass_g)

Linear Discriminant Analysis assumes:

multivariate normality within each group
equal variance-covariance matrices across groups

A pairwise scatterplot of the variables (colored by species) provides a quick visual check. The clusters representing the species appear reasonably elliptical and similar in shape, although some differences exist, particularly between Gentoo and the other species. However, the assumption of equal covariance structures is not severely violated.

Thus, while the assumptions are not perfectly met, they are sufficiently reasonable for LDA to be applied.


# Pairwise plot

ggpairs(df, aes(color = species))
#> `stat_bin()` using `bins = 30`. Pick better value `binwidth`.
#> `stat_bin()` using `bins = 30`. Pick better value `binwidth`.
#> `stat_bin()` using `bins = 30`. Pick better value `binwidth`.
#> `stat_bin()` using `bins = 30`. Pick better value `binwidth`.

We used an 80/20 split with createDataPartition.

This function is preferred over a simple random split because it performs stratified sampling. This ensures that the proportion of each species is approximately the same in both training and test sets.

Without this, one class could be underrepresented in the training data, leading to biased models.

library(MASS)
#> 
#> Attaching package: 'MASS'
#> The following object is masked from 'package:dplyr':
#> 
#>     select
set.seed(33)

trainIndex <- createDataPartition(df$species, p = 0.8, list = FALSE)

train <- df[trainIndex, ]
test  <- df[-trainIndex, ]

LDA produces at most k − 1 discriminant functions, where k is the number of groups.

Since there are 3 species, the model produces 2 Linear Discriminants (LD1 and LD2)

These represent axes that maximize separation between species.

lda_model <- lda(species ~ ., data = train)

lda_model
#> Call:
#> lda(species ~ ., data = train)
#> 
#> Prior probabilities of groups:
#>    Adelie Chinstrap    Gentoo 
#> 0.4365672 0.2052239 0.3582090 
#> 
#> Group means:
#>           bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
#> Adelie          38.70342      18.29402          190.2564    3686.538
#> Chinstrap       49.18364      18.53818          196.3636    3742.727
#> Gentoo          47.45833      14.95521          216.7604    5096.615
#> 
#> Coefficients of linear discriminants:
#>                            LD1          LD2
#> bill_length_mm    -0.084689578 -0.419011936
#> bill_depth_mm      1.045477396 -0.016903127
#> flipper_length_mm -0.075609642  0.013875883
#> body_mass_g       -0.001392831  0.001616021
#> 
#> Proportion of trace:
#>    LD1    LD2 
#> 0.8499 0.1501

The proportion of trace indicates how much of the between-group variance is explained by each discriminant.

Typically:

LD1 explains the majority: 85%
LD2 explains the remaining smaller portion: 15%

This means most of the group separation occurs along a single dimension.

lda_model$svd^2 / sum(lda_model$svd^2)
#> [1] 0.8499114 0.1500886

The coefficients (scaling) indicate how each variable contributes to the discriminant functions. In this case bill_length and bill_depth are the variables contributin most to LD1 and LD2.

lda_model$scaling
#>                            LD1          LD2
#> bill_length_mm    -0.084689578 -0.419011936
#> bill_depth_mm      1.045477396 -0.016903127
#> flipper_length_mm -0.075609642  0.013875883
#> body_mass_g       -0.001392831  0.001616021

The confusion matrix on the test set shows classification performance. We achieve a high overall accuracy of 97%. There are only 2 misclassifications, and these happen between the Adelie and Chinstrap penguins.

This makes intuitive sense because they have overlapping physical characteristics and their measurements are less distinct compared to Gentoo

pred_test <- predict(lda_model, test)

# Confusion matrix
confusionMatrix(pred_test$class, test$species)
#> Confusion Matrix and Statistics
#> 
#>            Reference
#> Prediction  Adelie Chinstrap Gentoo
#>   Adelie        29         2      0
#>   Chinstrap      0        11      0
#>   Gentoo         0         0     23
#> 
#> Overall Statistics
#>                                           
#>                Accuracy : 0.9692          
#>                  95% CI : (0.8932, 0.9963)
#>     No Information Rate : 0.4462          
#>     P-Value [Acc > NIR] : < 2.2e-16       
#>                                           
#>                   Kappa : 0.951           
#>                                           
#>  Mcnemar's Test P-Value : NA              
#> 
#> Statistics by Class:
#> 
#>                      Class: Adelie Class: Chinstrap Class: Gentoo
#> Sensitivity                 1.0000           0.8462        1.0000
#> Specificity                 0.9444           1.0000        1.0000
#> Pos Pred Value              0.9355           1.0000        1.0000
#> Neg Pred Value              1.0000           0.9630        1.0000
#> Prevalence                  0.4462           0.2000        0.3538
#> Detection Rate              0.4462           0.1692        0.3538
#> Detection Prevalence        0.4769           0.1692        0.3538
#> Balanced Accuracy           0.9722           0.9231        1.0000

pred_train <- predict(lda_model, train)

confusionMatrix(pred_train$class, train$species)
#> Confusion Matrix and Statistics
#> 
#>            Reference
#> Prediction  Adelie Chinstrap Gentoo
#>   Adelie       116         1      0
#>   Chinstrap      1        54      0
#>   Gentoo         0         0     96
#> 
#> Overall Statistics
#>                                           
#>                Accuracy : 0.9925          
#>                  95% CI : (0.9733, 0.9991)
#>     No Information Rate : 0.4366          
#>     P-Value [Acc > NIR] : < 2.2e-16       
#>                                           
#>                   Kappa : 0.9883          
#>                                           
#>  Mcnemar's Test P-Value : NA              
#> 
#> Statistics by Class:
#> 
#>                      Class: Adelie Class: Chinstrap Class: Gentoo
#> Sensitivity                 0.9915           0.9818        1.0000
#> Specificity                 0.9934           0.9953        1.0000
#> Pos Pred Value              0.9915           0.9818        1.0000
#> Neg Pred Value              0.9934           0.9953        1.0000
#> Prevalence                  0.4366           0.2052        0.3582
#> Detection Rate              0.4328           0.2015        0.3582
#> Detection Prevalence        0.4366           0.2052        0.3582
#> Balanced Accuracy           0.9924           0.9886        1.0000

If we look at the training accuracy, we see that it is 99%.

If the training accuracy is equal to 99% and the test accuracy is 70%, this indicates overfitting. The model has learned patterns specific to the training data that do not generalize well to new data. A large gap indicates poor generalization

In this case, the training accuracy and test accuracy are close, there only a small gap of 2%, which indicates that there is no sign of overfitting.

lda_model$prior
#>    Adelie Chinstrap    Gentoo 
#> 0.4365672 0.2052239 0.3582090

The prior probabilities reflect the proportion of each class in the training data.

With a large dataset:

Priors are stable and close to true population proportions

With a small dataset:

Priors can fluctuate. This may bias predictions toward overrepresented classes

Linear Discriminant Analysis - exercise solutions

dr. Annelies Agten

2026-04-27

Multivariate Statistics - StatUa

Linear Discriminant Analysis - Exercise solutions