\setbeamertemplate{navigation symbols}{} \setbeamertemplate{footline}[page number] \usepackage{amsmath}

Multi-Class LDA, Step-by-Step, and An Example

Author

Published

May 29, 2023

Summary of LDA

Linear Discriminant Analysis

knitr::include_graphics("img/FLDA.png")

LDA for Multi-Class Problems

  • Fisher 1936 proposed a dichotomous discriminant analysis
  • Fisher’s linear discriminant function is a function
  • The linear function has the maximum ability to discriminant between samples
  • This idea can be extended to multiple groups
  • We are going to show the details and an example

Multi-Class LDA

Extend FLDA to \(g\) Classes

  • Consider \(g\) classes. We have \(g\) independent random samples: \[\begin{aligned} X_{1j} & \overset{iid}\sim (\mu_1, \Sigma), j=1, \cdots, n_1\\ X_{2j} & \overset{iid}\sim (\mu_2, \Sigma), j=1, \cdots, n_2\\ \cdots & \cdots \\ X_{gj} & \overset{iid}\sim (\mu_g, \Sigma), j=1, \cdots, n_g \end{aligned}\]

  • Let \(n=\sum_{i=1}^{g} \sum_{j=1}^{n_i}\), i.e., \(n\) is the total sample size / total number of observations.

  • Let \(X_{ij} \in \mathbb R^p\) denote the \(j\)th observation from the \(i\)the class. The first linear discriminant aims to find a linear function \(Y_{ij}^{(1)}=a^T X_{ij}\) that leads to separation

Quantify Separation in a \(g\)-Class Problem

  • Measure separation using F statistic

\[\begin{aligned} F(a) &= \frac{MSB}{MSW}=\frac{SSB/(g-1)}{SSW/(n-g)}\\ &= \frac{\sum_{i=1}^g n_i (\bar Y_{i.}^{(1)} -\bar Y_{..}^{(1)})^2/(g-1)}{\sum_{i=1}^g (n_i-1)S_{Y_i^{(1)}}^2/(n-g)}\\ &=\frac{a^T \sum n_i (\bar X_{i.} -\bar X_{..})(\bar X_{i.}-\bar X_{..})^T a}{a^T \sum_{i=1}^g\sum_{j=1}^{n_i} (X_{ij} -\bar X_{i.})(X_{ij}-\bar X_{i.})^T a}\frac{n-g}{g-1}\\ &= \frac{a^T \mathbf B a}{a^T \mathbf W a}\frac{n-g}{g-1} \end{aligned}\] where \(n=\sum_{i=1}^g n_i\), \(\mathbf B\) is the between-group sample covariance matrix, and \(\mathbf W\) is the within-group sample covariance matrix.

Linear Discriminants - The Maximization Problem

  • The first linear discriminant (LD) is the linear function that maximizes \(F(a)\). Let \(a_1\) denote the coefficients corresponding to the first LD. Then \[a_1 = \underset{a^T S_{pooled} a=1 } {\operatorname*{argmax}} F(a)\]

  • The second LD, the coefficients of which are denoted by \(a_2\), satisfies \[a_2 = \underset{a^T S_{pooled} a=1, a^T S_{pooled} a_1=0 } {\operatorname*{argmax}} F(a)\]

  • The rest of LDs are similarly defined

  • Note that the scale of \(a_k\) does not matter, as it will be cancelled when taking the ratio of the numerator to the denominator. We use this constraint so that later we can simply use Euclidean distances in the projected space.

Linear Discriminants: The Coefficients

  • It can be shown that the \(a_k\) is proportional to the \(k\)th eigenvector of \(\mathbf W^{-1}\mathbf B\). We rescale it such that \(a_k^T S_{pooled}a_k=1\); as a result, \[a_k = \frac{\gamma_k}{\sqrt{\gamma_k^T S_{pooled}\gamma_k}}\]

  • Let \(\gamma_k\) denote the \(k\)th eigenvetor of \(\mathbf W^{-1}\mathbf B\). Note the eigenvectors are not necessarily orthogonal because \(\mathbf W^{-1}\mathbf B\) is not necessarily symmetric.

  • We have shown earlier that \(rank(\mathbf W^{-1}\mathbf B)=min(g-1, p)\). Let \(r\) denote \(rank(\mathbf W^{-1}\mathbf B)\). Then there are \(r\) LDs.

Linear Discriminants

  • Let’s consider a more compact/implementable presentation of the result. Let \(\mathbf X_{n\times p}\) be the data matrix for the \(n\) observations, each with \(p\) features.

  • The \(k\)th LD, denoted by \(Y^{(k)}\), which is an \(n\times 1\) vector. It can be obtained by \[Y^{(k)}=\mathbf X a_k\]

  • Note that \(Y^{(k)}\)’s have the same spooled variance because \[s^2_{k, pooled}= a_k^T S_{pooled} a_k =1\]

  • They are also uncorrelated (also independent if multivariate normality holds) because for \(k\not=k'\) \[cor(Y^{(k)}, Y^{(k')})=a_k^T \Sigma a_{k'}\overset{estiamte}=a_k^T S_{pooled} a_{k'}=0\]

  • Thus, it makes sense to use the Euclidean distance in the projected space

Use top LDs for discrimination

  • The first LD has the best ability to discriminant the classes. The \(r\)th LD has the poorest ability.

  • Suppose that we decide to use top \(T\) LDs. We first combine the LDs into an \(n\times T\) matrix \[\tilde Y =(Y^{(1)}, \cdots, Y^{(T)})= \mathbf X (a_1, \cdots, a_T)\]

  • Next, we compute the centers of the classes in the \(\mathbb R^T\) space, denoted by \(\tilde Y_1, \cdots, \tilde Y_g\), with \(\tilde Y_i= \bar {\mathbf X}_1 (a_1, \cdots, a_T)\)

  • Let \(x_0\in \mathbb R^p\) denote a data point to be allocated.

  • We first project \(x_0\in \mathbb R^p\) to a vector of the LDs, denoted by \(y_0\in \mathbb R^T\). This can be done by \(y_0=x_0^T (a_1, \cdots, a_T)\)

  • we then compute its squared Euclidean distance to each center \[||y_0-\tilde Y_i||^2=(y_0-\tilde Y_i)^T(y_0-\tilde Y_i) \]

  • We allocate \(x\) to the class based on the nearest distance rule.

Linear Discriminants: When Using All the LDs

  • Let \(A=(a_1, \cdots, a_r)\) where \(r=rank(W^{-1}B)\). Note \(A\) is a \(p\times r\) matrix that transforms the original features to \(r\) LD features.

  • When using all the LDs, we have \(\tilde Y = \mathbf X A\), where \(\tilde Y\) is \(n\times r\).

  • Consider \(x_0\), the data point to be classified and \(x\), an arbitrary point. In the original feature space, the squared statistical distance is \[D_{S_{pooled}}^2(x_0, x)=(x_0-x)^T S_{pooled}^{-1} (x_0 -x)\]

  • We transform/ project \(x_0\) and \(x\) to the space of the LDs by computing \(y_0=A^Tx_0 \mbox{ and } y=A^Tx\) both are \(r\times 1\) vectors. Their squared Euclidean distance is

\[D(y_0, y)=(x_0-x)^T A A^T (x_0-x)\]

Linear Discriminants: When Using All the LDs

  • Next, let’s examine what is \(AA^T\).
  • Recall that \(a_k\) is proportional to the \(k\)th eigenvector of \(W^{-1}B\). Thus, it satisfies \(W^{-1}Ba_k=\lambda_k a_k\), which can be rewritten to \[(n-p)^{-1}S_{pooled}^{-1/2}BS_{pooled}^{-1/2}S_{pooled}^{1/2}a_k=\lambda_k S_{pooled}^{1/2}a_k\]
  • Let \(b=S_{pooled}^{1/2}a_k\), we have \[S_{pooled}^{-1/2}BS_{pooled}^{-1/2} b = [\lambda_k(n-p)]b\] Thus, \(b_k\)’s are the eigenvectors of a symmetric matrix. As a result, \(b_k\)’s are orthogonal with each other.

Linear Discriminants: When Using All the LDs

  • Note that \(a_k=S_{pooled}^{-1/2} b_k\). Use the results about \(b_k\)’s, we have \[(AA^T)_{ij}=\sum_{k=1}^r a_k a_k^T=S_{pooled}^{-1/2} \sum_{k=1}^r b_k b_k^TS_{pooled}^{-1/2}=S_{pooled}^{-1}\]

  • The results indicate that the following two classification rules are the same

    1. The minimum distance (Euclidean) rule in the space of all the LDs
    2. The minimal distance (statistical/Mahalanobis) rule using original features

Example Iris Data

Compute \(W\) and \(B\)

Iris Data: three species, two features: SepalL SepalW

i=1; j=2 #(Speal L and Sepal W)
i=2; j=3 #(Speal W and Petal L)
x1.bar=colMeans(iris[1:50,c(i,j)])
x2.bar=colMeans(iris[51:100,c(i,j)])
x3.bar=colMeans(iris[101:150,c(i,j)])
T=(150-1)*cov(iris[,c(i,j)])
W=(50-1)*cov(iris[1:50, c(i,j)]) +(50-1)*cov(iris[51:100, c(i,j)])+
  (50-1)*cov(iris[101:150, c(i,j)])
B=T-W
s.pooled=W/(150-3)

Iris Data: Compute the LDs

LDs=eigen(solve(W)%*%B)$vector

#rescale them such that t(a_k) S_pooled a_k=1
LD1=LDs[,1]
LD1=LD1/sqrt(t(LD1)%*%s.pooled%*%LD1)
Warning in LD1/sqrt(t(LD1) %*% s.pooled %*% LD1): Recycling array of length 1 in vector-array arithmetic is deprecated.
  Use c() or as.vector() instead.
LD2=LDs[,2]
LD2=LD2/sqrt(t(LD2)%*%s.pooled%*%LD2)
Warning in LD2/sqrt(t(LD2) %*% s.pooled %*% LD2): Recycling array of length 1 in vector-array arithmetic is deprecated.
  Use c() or as.vector() instead.
# scores on LD1
Y1=as.matrix(iris[,c(i,j)])%*%matrix(LD1, 2, 1)
Y2=as.matrix(iris[,c(i,j)])%*%matrix(LD2, 2, 1)

# with directions. Note: they are not exactly projections
# because the length of $a_k$ is not 1
Y1.direct=as.matrix(Y1)%*%matrix(LD1, 1, 2)
Y2.direct=as.matrix(Y2)%*%matrix(LD2, 1, 2)

Iris Data: the LDs

#verify that the two LDs have the same pooled s2
lapply(split(Y1, iris$Species), var)
$setosa
[1] 0.4709355

$versicolor
[1] 0.953692

$virginica
[1] 1.575373
mean(unlist(lapply(split(Y1, iris$Species), var)))
[1] 1
lapply(split(Y2, iris$Species), var)
$setosa
[1] 1.101256

$versicolor
[1] 0.9343671

$virginica
[1] 0.9643773
mean(unlist(lapply(split(Y2, iris$Species), var)))
[1] 1

Iris Data: Visualize the LDs

plot(Y1, Y2)

Iris Data: Visualize the LDs in the using rescaled LDs

plot(iris3[,i,1], iris3[,j,1],  pty="s", xlab="SepalL", ylab="SepalW", 
     xlim=c(-11, 30), ylim=c(-1,30))
points(0, 0, cex=1.2)
points(iris3[,i,2], iris3[,j,2], col=2)
points(iris3[,i,3], iris3[,j,3], col=3)
abline(a=0, b=LD1[2]/LD1[1], col="blue")
abline(a=0, b=LD2[2]/LD2[1], col="blue")
arrows(0, 0, LD1[1], LD1[2], length = 0.1, angle=15, col="blue")
arrows(0, 0, LD2[1], LD2[2], length = 0.1, angle=15, col="blue")
for(m in 1:50){  
  text(x=Y1.direct[m,1], y=Y1.direct[m,2], labels="|", col=1, srt=atan(LD1[2]/LD1[1])*180/pi, cex=0.5)
  text(x=Y2.direct[m,1], y=Y2.direct[m,2], labels="|", col=1, srt=atan(LD2[2]/LD2[1])*180/pi, cex=0.5)}

for(m in 51:100){  
  text(x=Y1.direct[m,1], y=Y1.direct[m,2], labels="|", col=2, srt=atan(LD1[2]/LD1[1])*180/pi, cex=0.5)
  text(x=Y2.direct[m,1], y=Y2.direct[m,2], labels="|", col=2, srt=atan(LD2[2]/LD2[1])*180/pi, cex=0.5)}

for(m in 101:150){  
  text(x=Y1.direct[m,1], y=Y1.direct[m,2], labels="|", col=3, srt=atan(LD1[2]/LD1[1])*180/pi, cex=0.5)
  text(x=Y2.direct[m,1], y=Y2.direct[m,2], labels="|", col=3, srt=atan(LD2[2]/LD2[1])*180/pi, cex=0.5)}
text(x=0, y=1.2, "LD1", srt=-68, col="blue")
text(x=4, y=0.2, "LD2", srt=-3, col="blue")

  • The results do not look like projections. They are not, but proportional. The eigenvectors / coefficients of LDs are rescaled such that the variance for each LD is 1. If we want to show the projected data, we need to required each LD to have norm 1.

  • In Assignment 3, you can visualize either the projections or the transformed data based on the rescaled LDs.

Minimum Distance Classification Rule

  • Consider \(x_0=[5.5, 3]\)
  • We calculate the distance between the new observation and each three classes using two methods
  • As shown below, the two approaches give the same distances
  • Thus, using all LDs is the same as using the original data based on statistical distance
# consider an example 
x0=c(5.5, 3)

# statistical distance using original data
(x0-x1.bar)%*% solve(s.pooled) %*% (x0-x1.bar) 
         [,1]
[1,] 39.08461
(x0-x2.bar)%*% solve(s.pooled) %*% (x0-x2.bar) 
         [,1]
[1,] 106.1023
(x0-x3.bar)%*% solve(s.pooled) %*% (x0-x3.bar) 
         [,1]
[1,] 144.4239
# Euclidean distance using LDs
(x0%*%LD1-x1.bar%*%LD1)^2 + (x0%*%LD2-x1.bar%*%LD2)^2
         [,1]
[1,] 39.08461
(x0%*%LD1-x2.bar%*%LD1)^2 + (x0%*%LD2-x2.bar%*%LD2)^2
         [,1]
[1,] 106.1023
(x0%*%LD1-x3.bar%*%LD1)^2 + (x0%*%LD2-x3.bar%*%LD2)^2
         [,1]
[1,] 144.4239