Linear Discriminant Analysis (LDA) is a supervised classification and dimensionality reduction technique introduced by Fisher (1936). Given \(K\) classes and \(p\) predictors, LDA finds a set of linear combinations of the predictors — called discriminant functions — that maximise the ratio of between-class variance to within-class variance.
Formally, the \(k\)-th discriminant function is:
\[z_k = \mathbf{a}_k^\top \mathbf{x}\]
where \(\mathbf{a}_k\) is the vector
of discriminant coefficients (the scaling matrix returned by
MASS::lda) and \(\mathbf{x}\) is the vector of predictor
values for an observation. For \(K\)
classes there are at most \(\min(K-1,
p)\) discriminant functions.
Classification is performed by assigning each observation to the class whose centroid (in the discriminant space) is closest, weighted by the prior class probabilities:
\[\hat{y} = \arg\max_k \left[ \log \pi_k - \tfrac{1}{2} \delta_k^2(\mathbf{x}) \right]\]
where \(\pi_k\) is the prior probability of class \(k\) and \(\delta_k^2(\mathbf{x})\) is the Mahalanobis distance from \(\mathbf{x}\) to the \(k\)-th class centroid.
Key assumptions:
| Assumption | Description |
|---|---|
| Multivariate normality | Predictors follow a multivariate normal distribution within each class |
| Homogeneity of covariance | All classes share the same covariance matrix \(\boldsymbol{\Sigma}\) (relaxed in QDA) |
| No perfect multicollinearity | Predictors must not be linearly dependent |
| Adequate sample size | At least \(p + 1\) observations per class recommended |
The singular values (SVD) returned by
MASS::lda quantify the importance of each discriminant
function: larger values indicate greater between-class separation
relative to within-class scatter.
infertThe infert dataset (available in base R) is drawn from a
matched case-control study on secondary infertility. It contains data
from 248 women and examines whether a history of induced or spontaneous
abortions is associated with infertility.
| Variable | Description |
|---|---|
case |
Outcome — 1 = infertile case, 0 = fertile control |
education |
Educational attainment (0–5 yrs, 6–11 yrs, 12+ yrs) |
age |
Age at time of interview |
parity |
Number of prior pregnancies carried to term |
induced |
Number of prior induced abortions (0, 1, 2+) |
spontaneous |
Number of prior spontaneous abortions (0, 1, 2+) |
stratum |
Matching stratum identifier |
pooled.stratum |
Pooled stratum for stratified analysis |
The classification goal is to correctly assign each woman to the infertile or fertile group based on her reproductive history and demographics.
data(infert)
# Class balance
cat("Class distribution (case):\n")
## Class distribution (case):
print(table(infert$case))
## ## 0 1 ## 165 83
cat(sprintf("\nPrevalence: %.1f%% cases\n", 100 * mean(infert$case)))
## ## Prevalence: 33.5% cases
# Visual overview: induced and spontaneous abortions by case status
p1 <- ggplot(infert, aes(x=factor(induced), fill=factor(case))) +
geom_bar(position="dodge") +
scale_fill_manual(values=c("#2166ac","#d6604d"), labels=c("Control","Case")) +
theme_bw() +
labs(title="Induced abortions by case status",
x="Number of induced abortions", y="Count", fill=NULL)
p2 <- ggplot(infert, aes(x=factor(spontaneous), fill=factor(case))) +
geom_bar(position="dodge") +
scale_fill_manual(values=c("#2166ac","#d6604d"), labels=c("Control","Case")) +
theme_bw() +
labs(title="Spontaneous abortions by case status",
x="Number of spontaneous abortions", y="Count", fill=NULL)
plot_multiplot(p1, p2, cols=2)
## [[1]]
We fit LDA using all available predictors to discriminate infertile cases from controls. Because there are only two classes (\(K = 2\)), LDA produces a single discriminant function (\(K - 1 = 1\)).
model <- MASS::lda(case ~ ., data=infert)
result <- report_lda(model=model)
Prior probabilities reflect the proportion of each class in the training data. LDA uses these priors to penalise or favour predictions toward a class based on its base rate. Because the two classes are balanced by design (matched case-control study), priors are approximately equal.
result$prior_counts
## prior counts mean.education6.11yrs mean.education12..yrs mean.age mean.parity mean.induced mean.spontaneous mean.stratum mean.pooled.stratum ## 0 0.6653226 165 0.4848485 0.4666667 31.49091 2.084848 0.5636364 0.3878788 41.80606 33.52121 ## 1 0.3346774 83 0.4819277 0.4698795 31.53012 2.108434 0.5903614 0.9518072 42.00000 33.69880
The group means show the average value of each predictor in each
class. Large differences between class means on a predictor indicate
that it is likely to be a strong discriminator. Here,
spontaneous and induced both show notably
higher means in the case group.
round(result$means, 3)
## education6-11yrs education12+ yrs age parity induced spontaneous stratum pooled.stratum ## 0 0.485 0.467 31.491 2.085 0.564 0.388 41.806 33.521 ## 1 0.482 0.470 31.530 2.108 0.590 0.952 42.000 33.699
The coefficients (the scaling matrix) define the linear combination of predictors that maximises class separation. Larger absolute values indicate a stronger contribution to discrimination. Coefficients are analogous to standardised regression coefficients but optimised for class separation rather than prediction error.
round(result$coeficients, 4)
## LD1 ## education6-11yrs 0.7502 ## education12+ yrs 2.0488 ## age 0.0630 ## parity -0.4574 ## induced 1.1616 ## spontaneous 1.9802 ## stratum -0.0019 ## pooled.stratum -0.0521
Positive coefficients push scores toward the infertile case class;
negative coefficients push toward the control class.
spontaneous and induced carry the largest
positive weights, consistent with the clinical hypothesis that prior
abortion history is associated with secondary infertility.
The singular value (SVD) of the single discriminant function quantifies the ratio of between-class to within-class standard deviation on that function. A value greater than 1 indicates that the classes are separated by more than one within-class standard deviation — a practically meaningful separation.
result$model_description
## Observations SDV ## 1 248 8.428289
cat(sprintf("\nVariance explained by LD1: 100%% (only one function for K=2)\n"))
## ## Variance explained by LD1: 100% (only one function for K=2)
cat(sprintf("Between/within-class SD ratio: %.4f\n", result$model_description$SDV))
## Between/within-class SD ratio: 8.4283
We can project all observations onto the single discriminant axis and visualise the overlap between classes. Good discrimination appears as clearly separated distributions.
scores <- predict(model)$x
df_scores <- data.frame(LD1=scores[,1], case=factor(infert$case, labels=c("Control","Case")))
ggplot(df_scores, aes(x=LD1, fill=case)) +
geom_density(alpha=0.55) +
scale_fill_manual(values=c("#2166ac","#d6604d")) +
geom_vline(xintercept=0, linetype="dashed", colour="grey40") +
theme_bw() +
labs(title="Distribution of discriminant scores (LD1)",
subtitle="Vertical dashed line = decision boundary at 0",
x="Linear Discriminant 1", y="Density", fill=NULL)
The degree of overlap between the two distributions reflects classification uncertainty. A clear bimodal separation would indicate near-perfect discrimination; substantial overlap signals limited discriminating power.
The confusion matrix below shows how well the model classifies the training observations. Because this is a resubstitution estimate (fitted and evaluated on the same data), it is optimistically biased — it overestimates true generalisability.
result$cmatrix
## 0 1 2 sum p ## 0 146.00 42.00 0.00 188.00 0.78 ## 1 19.00 41.00 0.00 60.00 0.68 ## 2 0.00 0.00 0.00 0.00 0.00 ## sum 165.00 83.00 0.00 248.00 1.00 ## p 0.88 0.49 0.00 1.00 0.75
Each cell shows the percentage of observations in the observed class (rows) that were assigned to the predicted class (columns). The diagonal gives the per-class accuracy; off-diagonal cells are misclassification rates.
# Overall accuracy from the confusion matrix
pred_class <- predict(model)$class
obs_class <- factor(infert$case)
acc <- mean(pred_class == obs_class)
cat(sprintf("Overall resubstitution accuracy: %.1f%%\n", 100 * acc))
## Overall resubstitution accuracy: 75.4%
cat(sprintf("Baseline (majority class): %.1f%%\n", 100 * max(prop.table(table(infert$case)))))
## Baseline (majority class): 66.5%
LDA also returns posterior class probabilities for each observation — the probability of belonging to each class given the discriminant score. Values close to 0 or 1 indicate high-confidence assignments; values near 0.5 indicate ambiguity.
post <- predict(model)$posterior
df_post <- data.frame(
prob_case = post[, "1"],
case = factor(infert$case, labels=c("Control","Case"))
)
ggplot(df_post, aes(x=prob_case, fill=case)) +
geom_histogram(bins=30, position="identity", alpha=0.6) +
scale_fill_manual(values=c("#2166ac","#d6604d")) +
geom_vline(xintercept=0.5, linetype="dashed", colour="grey40") +
theme_bw() +
labs(title="Posterior probability of being a case",
subtitle="Dashed line = 0.5 decision threshold",
x="P(case = 1 | x)", y="Count", fill=NULL)
The iris dataset provides a three-class problem (\(K = 3\), \(\min(K-1, p) = 2\) discriminant functions)
that illustrates multi-class LDA and the interpretation of a
two-dimensional discriminant space.
model_iris <- MASS::lda(Species ~ ., data=iris)
result_iris <- report_lda(model=model_iris)
With two discriminant functions, the proportion of trace tells us how much of the total between-class variance each function captures.
prop_trace <- model_iris$svd^2 / sum(model_iris$svd^2)
cat("LD1 proportion of trace:", round(prop_trace[1], 4), "\n")
## LD1 proportion of trace: 0.9912
cat("LD2 proportion of trace:", round(prop_trace[2], 4), "\n")
## LD2 proportion of trace: 0.0088
LD1 typically captures the dominant separation (setosa vs. the other two species); LD2 captures the residual separation between versicolor and virginica.
scores_iris <- as.data.frame(predict(model_iris)$x)
scores_iris$Species <- iris$Species
ggplot(scores_iris, aes(x=LD1, y=LD2, colour=Species)) +
geom_point(alpha=0.7, size=2) +
stat_ellipse(level=0.95, linewidth=0.8) +
scale_colour_manual(values=c("#4dac26","#2166ac","#d6604d")) +
theme_bw() +
labs(title="Iris species in the LDA discriminant space",
subtitle="95% concentration ellipses per class",
x="Linear Discriminant 1", y="Linear Discriminant 2")
result_iris$cmatrix
## setosa versicolor virginica sum p ## setosa 50.00 0.00 0.00 50.00 1.00 ## versicolor 0.00 48.00 1.00 49.00 0.98 ## virginica 0.00 2.00 49.00 51.00 0.96 ## sum 50.00 50.00 50.00 150.00 1.00 ## p 1.00 0.96 0.98 1.00 0.98
pred_iris <- predict(model_iris)$class
cat(sprintf("Overall resubstitution accuracy: %.1f%%\n",
100 * mean(pred_iris == iris$Species)))
## Overall resubstitution accuracy: 98.0%
Fisher, R. A. (1936). The use of multiple measurements in taxonomic problems. Annals of Eugenics, 7(2), 179–188.
McLachlan, G. J. (2004). Discriminant Analysis and Statistical Pattern Recognition. Wiley.
Venables, W. N., & Ripley, B. D. (2002). Modern Applied Statistics with S (4th ed.). Springer.