Linear Discriminant Analysis is a supervised learning model that is similar to logistic regression in that the outcome variable is categorical and can therefore be used for classification. If you are unfamiliar with logistic regression, you can find a brief primer here.The difference between LDA and logistic regression is that LDA can be very useful when you are dealing with dealing with two response classes (although a multinomial logistic regression model can also be used in dealing with more than two response classes).
Whereas in logistic regression, where we model \(\mathbf{Pr}(Y = k \lvert X = x)\) using the logistic function, in LDA we model the distribution of the predictors \(X\) separately in each of the response classes (i.e. given \(Y\)), and then use Bayes’ Theorem to convert these into estimates for \(\mathbf{Pr}(Y = k \lvert X = x)\). When these distributions are assumed to be normally distributed, it turns out that the model is very similar in form to logistic regression. If this is the case, you may be wondering why not use logistic regression? LDA provides three advantages over logistic regression.
When the classes are well-separated, the parameter estimates for the logistic regression model are fairly unstable. LDA does not suffer from this problem.
If \(n\) is small, and the distribution of the predictors \(X\) is approximately normal in each of the classes, the linear disriminant model is again more stable than the logistic regression model.
LDA is popular when we have more than two response classes.
Let us suppose that we would like to classify an observation into on of \(K\) classes, where \(K \geq 2\). Let \(\pi_{k}\) denote the prior probability that a randomly chosen observation comes from the \(k\)th class; this is the probability that a randomly chosen observation is associated with the \(k\)th category of the response variable \(Y\). Then, let \(f_{k}(X) \equiv \mathbf{Pr}(X = x \lvert Y = k)\) denote the density function of \(X\) for an observation that comes from the \(k\)th class. That is, \(f_{k}(x)\) is relatively large if there is a high probability that an observation in the \(k\)th class has \(X \approx x\) and \(f_{k}(x)\) is small if it is very unlikely that an observation in the \(k\)th class has \(X \approx x\). Then, Bayes’ Theorem states that
\[ \begin{aligned} \mathbf{Pr}(Y = k \lvert X = x) = \frac{\pi_{k}f_{k}(x)}{\sum_{i=1}^{K}{\pi_{l}f_{l}(x)}} \end{aligned} \]
Our goal is to estimate \(\pi_{k}\) and \(f_{k}(X)\). The former is relatively easy if we have a random saple of \(Y\)s from the population. We simply compute the fraction of the training observations the belong to the \(k\)th class. However, estimating \(f_{k}(X)\) tends to be more difficult, unless we assume some simple forms for these densities. We refer to \(p_{k}(x) = \mathbf{Pr}(Y = k \lvert X = x)\) as the posterior probability that an observation \(X = x\) belongs to the \(k\)th class.
Let us assume that \(p = 1\), which emans that we have only one predictor. Our goal is to obtain an estimate for \(f_{k}(x)\) that we can plug into the above equation in order to estimate \(p_{k}(x)\). We will then classify an observation to the class for which \(p_{k}(x)\) is the greatest. To estimate \(f_{k}(x)\), we will need to make a few assumptions about its form.
If we assume that \(f_{k}(x)\) is normally distributed, in a situation where \(p = 1\), the normal density takes the form
\[ \begin{aligned} f_{k}(x) = \frac{1}{\sqrt{2\pi\sigma_{k}}}\exp{\Bigg(-\frac{1}{2\sigma^{2}_{k}}(x - \mu_{k})^{2}\Bigg)} \end{aligned} \]
where \(\mu_{k}\) and \(\sigma_{k}^{2}\) are the mean and variance parameters for the \(k\)th class, respectively. For now, let us assume that there is a shared variance term across all \(K\) classes \(\sigma^{2}_{1}, \ldots, \sigma^{2}_{K}\). For simplicty, we can denote it with \(\sigma^{2}\). We then get
\[ \begin{aligned} p_{k}(x) = \frac{\pi_{k}\frac{1}{\sqrt{2\pi\sigma_{k}}}\exp{\Big(-\frac{1}{2\sigma^{2}}(x - \mu_{k})^{2}\Big)}}{\sum_{l=1}^{K}{\pi_{l}\exp{\Big(-\frac{1}{2\sigma^{2}}(x - \mu_{l})^{2}\Big)}}} \end{aligned} \]
If we take the logarithm of the above equation and rearrange the terms, we can show that this is equivalent to assigning the observation to the class for which
\[ \begin{aligned} \delta_{k} = x \cdot \frac{\mu_{k}}{\sigma^{2}} - \frac{\mu_{k}^{2}}{2\sigma^{2}} + \log{\pi_{k}} \end{aligned} \]
is largest. For example, if \(K = 2\) and \(\pi_{1} = \pi_{2}\), then the Bayes classifier assigns an observation to class \(1\) if \(2x(\mu_{1} - \mu_{2}) > \mu_{1}^{2} - \mu_{2}^{2}\) and to class \(2\) otherwise. In this scenario, the Bayes decision boundary corresponds to the point where
\[ \begin{equation} x = \frac{\mu_{1}^{2} - \mu_{2}^{2}}{2(\mu_{1} - \mu_{2})} = \frac{\mu_{1} + \mu_{2}}{2} \end{equation} \]
The estimates for \(\hat{\mu}_{k}\) and \(\hat{\sigma}^{2}\) are
\[ \begin{aligned} & \hat{\mu_{k}} = \frac{1}{n_{k}} \sum_{i:y_{i}=k}{x_{i}} \\\\ & \hat{\sigma}^{2} = \frac{1}{n - K} \sum_{k=1}^{K}\sum_{i:y_{i}=k}{(x_{i} - \hat{\mu_{k}})^{2}} \end{aligned} \]
where \(n\) is the total number of training observations, and \(n_{k}\) is the number of training observations in the \(k\)th class. The estimate for \(\mu_{k}\) is simply the average of all the training observations from the \(k\)th class, while \(\hat{\sigma}^{2}\) can be seen as a weighted average of the sample variances for each of the \(K\) classes. There will be instances in which we have knowledge of the class membership probabilities \(\pi_{1}, \ldots , \pi_{K}\), which can be used directly. However, in the absence of any additional information, LDA estimates \(\pi_{k}\) using the proportion of the training observations that belong to the \(k\)th class. That is,
\[ \begin{aligned} \hat{\pi}_{k} = \frac{n_{k}}{n} \end{aligned} \]
The data set we will be using will be stock market data. We will attempt to use Linear Discriminant Analysis to classify whether the market will go up or down using one predictor, which will be the column “Lag1.”
library(MASS)
library(ISLR)
data("Smarket")
summary(Smarket)
## Year Lag1 Lag2
## Min. :2001 Min. :-4.922000 Min. :-4.922000
## 1st Qu.:2002 1st Qu.:-0.639500 1st Qu.:-0.639500
## Median :2003 Median : 0.039000 Median : 0.039000
## Mean :2003 Mean : 0.003834 Mean : 0.003919
## 3rd Qu.:2004 3rd Qu.: 0.596750 3rd Qu.: 0.596750
## Max. :2005 Max. : 5.733000 Max. : 5.733000
## Lag3 Lag4 Lag5
## Min. :-4.922000 Min. :-4.922000 Min. :-4.92200
## 1st Qu.:-0.640000 1st Qu.:-0.640000 1st Qu.:-0.64000
## Median : 0.038500 Median : 0.038500 Median : 0.03850
## Mean : 0.001716 Mean : 0.001636 Mean : 0.00561
## 3rd Qu.: 0.596750 3rd Qu.: 0.596750 3rd Qu.: 0.59700
## Max. : 5.733000 Max. : 5.733000 Max. : 5.73300
## Volume Today Direction
## Min. :0.3561 Min. :-4.922000 Down:602
## 1st Qu.:1.2574 1st Qu.:-0.639500 Up :648
## Median :1.4229 Median : 0.038500
## Mean :1.4783 Mean : 0.003138
## 3rd Qu.:1.6417 3rd Qu.: 0.596750
## Max. :3.1525 Max. : 5.733000
The code for LDA with \(p = 1\) is fairly simple.
set.seed(1987)
train <- sample(1:nrow(Smarket), nrow(Smarket)/2)
lda.fit <- lda(Direction ~ Lag1, data = Smarket, subset = train)
lda.fit
## Call:
## lda(Direction ~ Lag1, data = Smarket, subset = train)
##
## Prior probabilities of groups:
## Down Up
## 0.4832 0.5168
##
## Group means:
## Lag1
## Down 0.12472185
## Up 0.02521362
##
## Coefficients of linear discriminants:
## LD1
## Lag1 0.8634778
The resulting output indicates that \(\hat{\pi}_{1} = 0.4832\) and \(\hat{\pi}_{2} = 0.5168\). That is, \(48.32\%\) of the training observations correspond to days during which the market went down, and \(51.68\%\) of the training observations correspond to days during which the market went up. The output also provides the group means for the classifications. It suggests that there is a tendency for the previous days’ returns to be positive when both the market increases or decreases.
The coefficients of linear discriminants that are used to form the LDA decision rule are also in the output. If \(0.8635 \times \text{Lag1}\) is large, then the LDA classifier will predict a market increase. Conversely, if it is small, the LDA classifier will predict a market decrease.
Using a \(50\%\) threshold to the posterior probabilities will allow us to predict whether the market will go up or down given the previous days’ returns.
lda.predict <- predict(lda.fit, Smarket)
lda.class <- lda.predict$class
prop.table(table(lda.class))
## lda.class
## Down Up
## 0.1512 0.8488
The LDA classifier predicts that given the previous day’s returns, the market will increase \(84.9\%\) of the time and will decrease \(15.1\%\) of the time.
We can compare the results with our original data set.
prop.table(table(Smarket$Direction))
##
## Down Up
## 0.4816 0.5184
Obviously are prediction is very poor, but that could be due to a variety of factors. Perhaps we could have improved our model by adding more predictors to our model.