Linear Discriminant Analysis: A Brief Tutorial

Linear Discriminant Analysis (LDA)

Linear Discriminant Analysis is a supervised learning model that is similar to logistic regression in that the outcome variable is categorical and can therefore be used for classification. If you are unfamiliar with logistic regression, you can find a brief primer here.The difference between LDA and logistic regression is that LDA can be very useful when you are dealing with dealing with two response classes (although a multinomial logistic regression model can also be used in dealing with more than two response classes).

Whereas in logistic regression, where we model \(\mathbf{Pr}(Y = k \lvert X = x)\) using the logistic function, in LDA we model the distribution of the predictors \(X\) separately in each of the response classes (i.e. given \(Y\)), and then use Bayes’ Theorem to convert these into estimates for \(\mathbf{Pr}(Y = k \lvert X = x)\). When these distributions are assumed to be normally distributed, it turns out that the model is very similar in form to logistic regression. If this is the case, you may be wondering why not use logistic regression? LDA provides three advantages over logistic regression.

When the classes are well-separated, the parameter estimates for the logistic regression model are fairly unstable. LDA does not suffer from this problem.
If \(n\) is small, and the distribution of the predictors \(X\) is approximately normal in each of the classes, the linear disriminant model is again more stable than the logistic regression model.
LDA is popular when we have more than two response classes.

Using Bayes’ Theorem for Classification

Let us suppose that we would like to classify an observation into on of \(K\) classes, where \(K \geq 2\). Let \(\pi_{k}\) denote the prior probability that a randomly chosen observation comes from the \(k\)th class; this is the probability that a randomly chosen observation is associated with the \(k\)th category of the response variable \(Y\). Then, let \(f_{k}(X) \equiv \mathbf{Pr}(X = x \lvert Y = k)\) denote the density function of \(X\) for an observation that comes from the \(k\)th class. That is, \(f_{k}(x)\) is relatively large if there is a high probability that an observation in the \(k\)th class has \(X \approx x\) and \(f_{k}(x)\) is small if it is very unlikely that an observation in the \(k\)th class has \(X \approx x\). Then, Bayes’ Theorem states that

\[ \begin{aligned} \mathbf{Pr}(Y = k \lvert X = x) = \frac{\pi_{k}f_{k}(x)}{\sum_{i=1}^{K}{\pi_{l}f_{l}(x)}} \end{aligned} \]

Our goal is to estimate \(\pi_{k}\) and \(f_{k}(X)\). The former is relatively easy if we have a random saple of \(Y\)s from the population. We simply compute the fraction of the training observations the belong to the \(k\)th class. However, estimating \(f_{k}(X)\) tends to be more difficult, unless we assume some simple forms for these densities. We refer to \(p_{k}(x) = \mathbf{Pr}(Y = k \lvert X = x)\) as the posterior probability that an observation \(X = x\) belongs to the \(k\)th class.

Linear Discriminant Analysis for \(p=1\)

Let us assume that \(p = 1\), which emans that we have only one predictor. Our goal is to obtain an estimate for \(f_{k}(x)\) that we can plug into the above equation in order to estimate \(p_{k}(x)\). We will then classify an observation to the class for which \(p_{k}(x)\) is the greatest. To estimate \(f_{k}(x)\), we will need to make a few assumptions about its form.

If we assume that \(f_{k}(x)\) is normally distributed, in a situation where \(p = 1\), the normal density takes the form

\[ \begin{aligned} f_{k}(x) = \frac{1}{\sqrt{2\pi\sigma_{k}}}\exp{\Bigg(-\frac{1}{2\sigma^{2}_{k}}(x - \mu_{k})^{2}\Bigg)} \end{aligned} \]

where \(\mu_{k}\) and \(\sigma_{k}^{2}\) are the mean and variance parameters for the \(k\)th class, respectively. For now, let us assume that there is a shared variance term across all \(K\) classes \(\sigma^{2}_{1}, \ldots, \sigma^{2}_{K}\). For simplicty, we can denote it with \(\sigma^{2}\). We then get

\[ \begin{aligned} p_{k}(x) = \frac{\pi_{k}\frac{1}{\sqrt{2\pi\sigma_{k}}}\exp{\Big(-\frac{1}{2\sigma^{2}}(x - \mu_{k})^{2}\Big)}}{\sum_{l=1}^{K}{\pi_{l}\exp{\Big(-\frac{1}{2\sigma^{2}}(x - \mu_{l})^{2}\Big)}}} \end{aligned} \]

If we take the logarithm of the above equation and rearrange the terms, we can show that this is equivalent to assigning the observation to the class for which

\[ \begin{aligned} \delta_{k} = x \cdot \frac{\mu_{k}}{\sigma^{2}} - \frac{\mu_{k}^{2}}{2\sigma^{2}} + \log{\pi_{k}} \end{aligned} \]

is largest. For example, if \(K = 2\) and \(\pi_{1} = \pi_{2}\), then the Bayes classifier assigns an observation to class \(1\) if \(2x(\mu_{1} - \mu_{2}) > \mu_{1}^{2} - \mu_{2}^{2}\) and to class \(2\) otherwise. In this scenario, the Bayes decision boundary corresponds to the point where

\[ \begin{equation} x = \frac{\mu_{1}^{2} - \mu_{2}^{2}}{2(\mu_{1} - \mu_{2})} = \frac{\mu_{1} + \mu_{2}}{2} \end{equation} \]

The estimates for \(\hat{\mu}_{k}\) and \(\hat{\sigma}^{2}\) are

\[ \begin{aligned} & \hat{\mu_{k}} = \frac{1}{n_{k}} \sum_{i:y_{i}=k}{x_{i}} \\\\ & \hat{\sigma}^{2} = \frac{1}{n - K} \sum_{k=1}^{K}\sum_{i:y_{i}=k}{(x_{i} - \hat{\mu_{k}})^{2}} \end{aligned} \]

where \(n\) is the total number of training observations, and \(n_{k}\) is the number of training observations in the \(k\)th class. The estimate for \(\mu_{k}\) is simply the average of all the training observations from the \(k\)th class, while \(\hat{\sigma}^{2}\) can be seen as a weighted average of the sample variances for each of the \(K\) classes. There will be instances in which we have knowledge of the class membership probabilities \(\pi_{1}, \ldots , \pi_{K}\), which can be used directly. However, in the absence of any additional information, LDA estimates \(\pi_{k}\) using the proportion of the training observations that belong to the \(k\)th class. That is,

\[ \begin{aligned} \hat{\pi}_{k} = \frac{n_{k}}{n} \end{aligned} \]

LDA Example

The data set we will be using will be stock market data. We will attempt to use Linear Discriminant Analysis to classify whether the market will go up or down using one predictor, which will be the column “Lag1.”

library(MASS)
library(ISLR)
data("Smarket")

summary(Smarket)

##       Year           Lag1                Lag2          
##  Min.   :2001   Min.   :-4.922000   Min.   :-4.922000  
##  1st Qu.:2002   1st Qu.:-0.639500   1st Qu.:-0.639500  
##  Median :2003   Median : 0.039000   Median : 0.039000  
##  Mean   :2003   Mean   : 0.003834   Mean   : 0.003919  
##  3rd Qu.:2004   3rd Qu.: 0.596750   3rd Qu.: 0.596750  
##  Max.   :2005   Max.   : 5.733000   Max.   : 5.733000  
##       Lag3                Lag4                Lag5         
##  Min.   :-4.922000   Min.   :-4.922000   Min.   :-4.92200  
##  1st Qu.:-0.640000   1st Qu.:-0.640000   1st Qu.:-0.64000  
##  Median : 0.038500   Median : 0.038500   Median : 0.03850  
##  Mean   : 0.001716   Mean   : 0.001636   Mean   : 0.00561  
##  3rd Qu.: 0.596750   3rd Qu.: 0.596750   3rd Qu.: 0.59700  
##  Max.   : 5.733000   Max.   : 5.733000   Max.   : 5.73300  
##      Volume           Today           Direction 
##  Min.   :0.3561   Min.   :-4.922000   Down:602  
##  1st Qu.:1.2574   1st Qu.:-0.639500   Up  :648  
##  Median :1.4229   Median : 0.038500             
##  Mean   :1.4783   Mean   : 0.003138             
##  3rd Qu.:1.6417   3rd Qu.: 0.596750             
##  Max.   :3.1525   Max.   : 5.733000

The code for LDA with \(p = 1\) is fairly simple.

set.seed(1987)
train <- sample(1:nrow(Smarket), nrow(Smarket)/2)

lda.fit <- lda(Direction ~ Lag1, data = Smarket, subset = train)
lda.fit

## Call:
## lda(Direction ~ Lag1, data = Smarket, subset = train)
## 
## Prior probabilities of groups:
##   Down     Up 
## 0.4832 0.5168 
## 
## Group means:
##            Lag1
## Down 0.12472185
## Up   0.02521362
## 
## Coefficients of linear discriminants:
##            LD1
## Lag1 0.8634778

The resulting output indicates that \(\hat{\pi}_{1} = 0.4832\) and \(\hat{\pi}_{2} = 0.5168\). That is, \(48.32\%\) of the training observations correspond to days during which the market went down, and \(51.68\%\) of the training observations correspond to days during which the market went up. The output also provides the group means for the classifications. It suggests that there is a tendency for the previous days’ returns to be positive when both the market increases or decreases.

The coefficients of linear discriminants that are used to form the LDA decision rule are also in the output. If \(0.8635 \times \text{Lag1}\) is large, then the LDA classifier will predict a market increase. Conversely, if it is small, the LDA classifier will predict a market decrease.

Using a \(50\%\) threshold to the posterior probabilities will allow us to predict whether the market will go up or down given the previous days’ returns.

lda.predict <- predict(lda.fit, Smarket)
lda.class <- lda.predict$class
prop.table(table(lda.class))

## lda.class
##   Down     Up 
## 0.1512 0.8488

The LDA classifier predicts that given the previous day’s returns, the market will increase \(84.9\%\) of the time and will decrease \(15.1\%\) of the time.

We can compare the results with our original data set.

prop.table(table(Smarket$Direction))

## 
##   Down     Up 
## 0.4816 0.5184

Obviously are prediction is very poor, but that could be due to a variety of factors. Perhaps we could have improved our model by adding more predictors to our model.

Linear Discriminant Analysis: A Brief Tutorial

Czar Yobero

12/25/2017

Linear Discriminant Analysis (LDA)

Using Bayes’ Theorem for Classification

Linear Discriminant Analysis for \(p=1\)

LDA Example