Linear Discriminant Analysis

Theoretical Background

Linear Discriminant Analysis (LDA) is a supervised classification and dimensionality reduction technique introduced by Fisher (1936). Given \(K\) classes and \(p\) predictors, LDA finds a set of linear combinations of the predictors — called discriminant functions — that maximise the ratio of between-class variance to within-class variance.

Formally, the \(k\)-th discriminant function is:

\[z_k = \mathbf{a}_k^\top \mathbf{x}\]

where \(\mathbf{a}_k\) is the vector of discriminant coefficients (the scaling matrix returned by MASS::lda) and \(\mathbf{x}\) is the vector of predictor values for an observation. For \(K\) classes there are at most \(\min(K-1, p)\) discriminant functions.

Classification is performed by assigning each observation to the class whose centroid (in the discriminant space) is closest, weighted by the prior class probabilities:

\[\hat{y} = \arg\max_k \left[ \log \pi_k - \tfrac{1}{2} \delta_k^2(\mathbf{x}) \right]\]

where \(\pi_k\) is the prior probability of class \(k\) and \(\delta_k^2(\mathbf{x})\) is the Mahalanobis distance from \(\mathbf{x}\) to the \(k\)-th class centroid.

Key assumptions:

Assumption	Description
Multivariate normality	Predictors follow a multivariate normal distribution within each class
Homogeneity of covariance	All classes share the same covariance matrix \(\boldsymbol{\Sigma}\) (relaxed in QDA)
No perfect multicollinearity	Predictors must not be linearly dependent
Adequate sample size	At least \(p + 1\) observations per class recommended

The singular values (SVD) returned by MASS::lda quantify the importance of each discriminant function: larger values indicate greater between-class separation relative to within-class scatter.

Dataset — `infert`

The infert dataset (available in base R) is drawn from a matched case-control study on secondary infertility. It contains data from 248 women and examines whether a history of induced or spontaneous abortions is associated with infertility.

Variable	Description
`case`	Outcome — 1 = infertile case, 0 = fertile control
`education`	Educational attainment (0–5 yrs, 6–11 yrs, 12+ yrs)
`age`	Age at time of interview
`parity`	Number of prior pregnancies carried to term
`induced`	Number of prior induced abortions (0, 1, 2+)
`spontaneous`	Number of prior spontaneous abortions (0, 1, 2+)
`stratum`	Matching stratum identifier
`pooled.stratum`	Pooled stratum for stratified analysis

The classification goal is to correctly assign each woman to the infertile or fertile group based on her reproductive history and demographics.

data(infert)

# Class balance
cat("Class distribution (case):\n")

## Class distribution (case):

print(table(infert$case))

## 
##   0   1 
## 165  83

cat(sprintf("\nPrevalence: %.1f%% cases\n", 100 * mean(infert$case)))

## 
## Prevalence: 33.5% cases

# Visual overview: induced and spontaneous abortions by case status
p1 <- ggplot(infert, aes(x=factor(induced), fill=factor(case))) +
  geom_bar(position="dodge") +
  scale_fill_manual(values=c("#2166ac","#d6604d"), labels=c("Control","Case")) +
  theme_bw() +
  labs(title="Induced abortions by case status",
       x="Number of induced abortions", y="Count", fill=NULL)

p2 <- ggplot(infert, aes(x=factor(spontaneous), fill=factor(case))) +
  geom_bar(position="dodge") +
  scale_fill_manual(values=c("#2166ac","#d6604d"), labels=c("Control","Case")) +
  theme_bw() +
  labs(title="Spontaneous abortions by case status",
       x="Number of spontaneous abortions", y="Count", fill=NULL)

plot_multiplot(p1, p2, cols=2)

## [[1]]

Model Specification

We fit LDA using all available predictors to discriminate infertile cases from controls. Because there are only two classes (\(K = 2\)), LDA produces a single discriminant function (\(K - 1 = 1\)).

model <- MASS::lda(case ~ ., data=infert)

Results

result <- report_lda(model=model)

Prior Probabilities and Class Counts

Prior probabilities reflect the proportion of each class in the training data. LDA uses these priors to penalise or favour predictions toward a class based on its base rate. Because the two classes are balanced by design (matched case-control study), priors are approximately equal.

result$prior_counts

##       prior counts mean.education6.11yrs mean.education12..yrs mean.age mean.parity mean.induced mean.spontaneous mean.stratum mean.pooled.stratum
## 0 0.6653226    165             0.4848485             0.4666667 31.49091    2.084848    0.5636364        0.3878788     41.80606            33.52121
## 1 0.3346774     83             0.4819277             0.4698795 31.53012    2.108434    0.5903614        0.9518072     42.00000            33.69880

Class Means (Group Centroids)

The group means show the average value of each predictor in each class. Large differences between class means on a predictor indicate that it is likely to be a strong discriminator. Here, spontaneous and induced both show notably higher means in the case group.

round(result$means, 3)

##   education6-11yrs education12+ yrs    age parity induced spontaneous stratum pooled.stratum
## 0            0.485            0.467 31.491  2.085   0.564       0.388  41.806         33.521
## 1            0.482            0.470 31.530  2.108   0.590       0.952  42.000         33.699

Discriminant Function Coefficients

The coefficients (the scaling matrix) define the linear combination of predictors that maximises class separation. Larger absolute values indicate a stronger contribution to discrimination. Coefficients are analogous to standardised regression coefficients but optimised for class separation rather than prediction error.

round(result$coeficients, 4)

##                      LD1
## education6-11yrs  0.7502
## education12+ yrs  2.0488
## age               0.0630
## parity           -0.4574
## induced           1.1616
## spontaneous       1.9802
## stratum          -0.0019
## pooled.stratum   -0.0521

Positive coefficients push scores toward the infertile case class; negative coefficients push toward the control class. spontaneous and induced carry the largest positive weights, consistent with the clinical hypothesis that prior abortion history is associated with secondary infertility.

Model Description — Singular Values

The singular value (SVD) of the single discriminant function quantifies the ratio of between-class to within-class standard deviation on that function. A value greater than 1 indicates that the classes are separated by more than one within-class standard deviation — a practically meaningful separation.

result$model_description

##   Observations      SDV
## 1          248 8.428289

cat(sprintf("\nVariance explained by LD1: 100%% (only one function for K=2)\n"))

## 
## Variance explained by LD1: 100% (only one function for K=2)

cat(sprintf("Between/within-class SD ratio: %.4f\n", result$model_description$SDV))

## Between/within-class SD ratio: 8.4283

Discriminant Scores

We can project all observations onto the single discriminant axis and visualise the overlap between classes. Good discrimination appears as clearly separated distributions.

scores <- predict(model)$x
df_scores <- data.frame(LD1=scores[,1], case=factor(infert$case, labels=c("Control","Case")))

ggplot(df_scores, aes(x=LD1, fill=case)) +
  geom_density(alpha=0.55) +
  scale_fill_manual(values=c("#2166ac","#d6604d")) +
  geom_vline(xintercept=0, linetype="dashed", colour="grey40") +
  theme_bw() +
  labs(title="Distribution of discriminant scores (LD1)",
       subtitle="Vertical dashed line = decision boundary at 0",
       x="Linear Discriminant 1", y="Density", fill=NULL)

The degree of overlap between the two distributions reflects classification uncertainty. A clear bimodal separation would indicate near-perfect discrimination; substantial overlap signals limited discriminating power.

Classification Performance

Confusion Matrix (Resubstitution)

The confusion matrix below shows how well the model classifies the training observations. Because this is a resubstitution estimate (fitted and evaluated on the same data), it is optimistically biased — it overestimates true generalisability.

result$cmatrix

##          0     1    2    sum    p
## 0   146.00 42.00 0.00 188.00 0.78
## 1    19.00 41.00 0.00  60.00 0.68
## 2     0.00  0.00 0.00   0.00 0.00
## sum 165.00 83.00 0.00 248.00 1.00
## p     0.88  0.49 0.00   1.00 0.75

Each cell shows the percentage of observations in the observed class (rows) that were assigned to the predicted class (columns). The diagonal gives the per-class accuracy; off-diagonal cells are misclassification rates.

# Overall accuracy from the confusion matrix
pred_class <- predict(model)$class
obs_class  <- factor(infert$case)
acc <- mean(pred_class == obs_class)
cat(sprintf("Overall resubstitution accuracy: %.1f%%\n", 100 * acc))

## Overall resubstitution accuracy: 75.4%

cat(sprintf("Baseline (majority class):        %.1f%%\n", 100 * max(prop.table(table(infert$case)))))

## Baseline (majority class):        66.5%

Posterior Probabilities

LDA also returns posterior class probabilities for each observation — the probability of belonging to each class given the discriminant score. Values close to 0 or 1 indicate high-confidence assignments; values near 0.5 indicate ambiguity.

post <- predict(model)$posterior
df_post <- data.frame(
  prob_case = post[, "1"],
  case      = factor(infert$case, labels=c("Control","Case"))
)

ggplot(df_post, aes(x=prob_case, fill=case)) +
  geom_histogram(bins=30, position="identity", alpha=0.6) +
  scale_fill_manual(values=c("#2166ac","#d6604d")) +
  geom_vline(xintercept=0.5, linetype="dashed", colour="grey40") +
  theme_bw() +
  labs(title="Posterior probability of being a case",
       subtitle="Dashed line = 0.5 decision threshold",
       x="P(case = 1 | x)", y="Count", fill=NULL)

Second Example — Iris Dataset

The iris dataset provides a three-class problem (\(K = 3\), \(\min(K-1, p) = 2\) discriminant functions) that illustrates multi-class LDA and the interpretation of a two-dimensional discriminant space.

model_iris <- MASS::lda(Species ~ ., data=iris)
result_iris <- report_lda(model=model_iris)

Proportion of Trace

With two discriminant functions, the proportion of trace tells us how much of the total between-class variance each function captures.

prop_trace <- model_iris$svd^2 / sum(model_iris$svd^2)
cat("LD1 proportion of trace:", round(prop_trace[1], 4), "\n")

## LD1 proportion of trace: 0.9912

cat("LD2 proportion of trace:", round(prop_trace[2], 4), "\n")

## LD2 proportion of trace: 0.0088

LD1 typically captures the dominant separation (setosa vs. the other two species); LD2 captures the residual separation between versicolor and virginica.

Discriminant Space Plot

scores_iris <- as.data.frame(predict(model_iris)$x)
scores_iris$Species <- iris$Species

ggplot(scores_iris, aes(x=LD1, y=LD2, colour=Species)) +
  geom_point(alpha=0.7, size=2) +
  stat_ellipse(level=0.95, linewidth=0.8) +
  scale_colour_manual(values=c("#4dac26","#2166ac","#d6604d")) +
  theme_bw() +
  labs(title="Iris species in the LDA discriminant space",
       subtitle="95% concentration ellipses per class",
       x="Linear Discriminant 1", y="Linear Discriminant 2")

Classification Performance

result_iris$cmatrix

##            setosa versicolor virginica    sum    p
## setosa      50.00       0.00      0.00  50.00 1.00
## versicolor   0.00      48.00      1.00  49.00 0.98
## virginica    0.00       2.00     49.00  51.00 0.96
## sum         50.00      50.00     50.00 150.00 1.00
## p            1.00       0.96      0.98   1.00 0.98

pred_iris <- predict(model_iris)$class
cat(sprintf("Overall resubstitution accuracy: %.1f%%\n",
            100 * mean(pred_iris == iris$Species)))

## Overall resubstitution accuracy: 98.0%

References

Fisher, R. A. (1936). The use of multiple measurements in taxonomic problems. Annals of Eugenics, 7(2), 179–188.

McLachlan, G. J. (2004). Discriminant Analysis and Statistical Pattern Recognition. Wiley.

Venables, W. N., & Ripley, B. D. (2002). Modern Applied Statistics with S (4th ed.). Springer.

Linear Discriminant Analysis

Dimitrios Zacharatos

Theoretical Background

Dataset — infert

Model Specification

Results

Prior Probabilities and Class Counts

Class Means (Group Centroids)

Discriminant Function Coefficients

Model Description — Singular Values

Discriminant Scores

Classification Performance

Confusion Matrix (Resubstitution)

Posterior Probabilities

Second Example — Iris Dataset

Proportion of Trace

Discriminant Space Plot

Classification Performance

References

Dataset — `infert`