Visit my website for more like this!

Heavily borrowed from:

Textbook: Introduction to statistical learning

Textbook: Elements of statistical learning

`require(knitr)`

`## Loading required package: knitr`

LDA provides a less direct approach to modeling the predicted probabilities given some set of predictor(s) *X*. This algorithm models the distribution of the predictors *X* separately in each of the response classes (given *Y’s*), and the uses Bayes’ theorem to flip them around into estimates. When these distributions are assumed to be normal, it turns out that the model is very similar to logistic regression.

Why do we need this method?

- When the classes are well-separated, the parameter estimates for the logistic model are surprisingly unstable. LDA does not suffer from this.
- If
*n*is small and the distribution of the predictors*X*is approximately normal in each of the classes, the LDA model is more stable than logistic. - For these reasons, and some others, LDA is the preferred method when dealing with > 2 response classes.

The LDA classifier assumes that each class comes from a normal distribution with a class-specific mean vector and a common variance. We utilize LDA to estimate the parameters so that we can leverage the Bayes classifier. The Bayes classifier is a simple and highly effective classifier that assigns each observation to the most likely class given its predictor values. The Bayes classifier has the lowest possible error rate out of all classifiers if the terms are correctly specified. Thus LDA is a classifier that attempts to approximate the Bayes classifier.

While some parameters could be specified with some prior class membership probability insight, LDA estimates some of these model parameters by simply using the proportion of the training observations that belong to the *k*th class. For p > 1 predictors, the LDA classifier assumes that the observations in the *k*th class are drawn from a *multivariate gaussian distribution* which has a class specific mean and common variance.

Once the parameters have been specified, the Bayes classifier draws *p* (number of predictors) decision boundaries. For example: for p = 3, there are three *pairs of classes* among the three classes. That is, one Bayes decision boundary separates class 1 from class 2, one separates class 1 from class 3, and one separates class 2 from class 3. The classifier then simply classifies an observation according the region in which it is located.

**Quadratic Discriminant Analysis** (QDA) holds the same assumptions as LDA except that the co variance matrix that is not common to all *K* classes.

The difference is really a bias-variance trade-off. With *p* predictors, estimating a co variance matrix requires estimating p(p+1)/2 parameters. The QDA estimates a separate co variance matrix for each class, so as the number of predictors becomes high, we experience a computational expense. Conversely, if we assume a common co variance matrix, we only have to do the computation once. LDA is a much less flexible classifier, than QDA, thus has substantially lower variance. However, if the assumption of uniform variance is highly off, then LDA can suffer high bias. In general, LDA tends to be better than QDA if there are relatively few training observations, so therefore reducing variance is crucial. QDA is recommended if the training set is very large, so that the variance of the classifier is not a major concern.

Between Logistic regression LDA and QDA, the biggest things to take into consideration are the type of decision boundary that is required. If highly linear, than LDA and Logistic may prove superior, if non-linear, the edge may be given to QDA. Though keep in mind we can do simple transformations to take non-linearity into consideration with Logistic models, similar to how we did in linear regression.

Again we will use the `Smarket`

data.

```
library(ISLR)
attach(Smarket)
# Split data into testing and training
train<-Smarket[Year<2005,]
test<-Smarket[Year==2005,]
```

The `lda()`

function is part of the `MASS`

library.

```
library(MASS)
lda.fit <- lda(Direction~Lag1 + Lag2, data=train)
lda.fit
```

```
## Call:
## lda(Direction ~ Lag1 + Lag2, data = train)
##
## Prior probabilities of groups:
## Down Up
## 0.492 0.508
##
## Group means:
## Lag1 Lag2
## Down 0.04279 0.03389
## Up -0.03955 -0.03133
##
## Coefficients of linear discriminants:
## LD1
## Lag1 -0.6420
## Lag2 -0.5135
```

From here we can see the model predicts 49.2% of days in the training data correspond to days which the market went down. We also see the *group means*; these are the average of each predictor within each class. These suggest that there is a tendency for the previous 2 days’ returns to be negative on days when the market increases, and a tendency for the previous 2 days’ returns to be positive on days when the market declines. The *cooeficients of the linear discrinimants* output provides the linear combination of `Lag1`

and `Lag2 that are used to form the LDA decision rule. We could use these, along with the predictor values to draw the linear discriminant.

The `predict()`

function in this case returns a list with three elements. `class`

, contains the LDA’s predictions about the movement of the market. `posterior`

, is a matrix whose *k*th column contains the posterior probability that the corresponding observation belongs to the *k*th class, and `x`

contains the linear discriminants.

```
lda.pred <- predict(lda.fit, test)
names(lda.pred)
```

`## [1] "class" "posterior" "x"`

```
# The results are the same as logistic regression
table(lda.pred$class, test$Direction)
```

```
##
## Down Up
## Down 35 35
## Up 76 106
```

`mean(lda.pred$class==test$Direction)`

`## [1] 0.5595`

Remember, the output of the class variable is simply created by applying a 50% posterior probability. So, if we wanted to use a different threshold, we could easily do so. For example, suppose that we wish to predict a market decrease only if we are very certain that the market will indeed decrease on that day - say, if the posterior probability is at least 90%.

`sum(lda.pred$posterior[,1]>0.9)`

`## [1] 0`

No days in 2005 meet this threshold, actually, the greatest posterior probability of decrease in all of 2005 was 52.02%, which is not too conclusive.

Again, here is how to do this using the `caret`

package

`library(caret)`

```
## Loading required package: lattice
## Loading required package: ggplot2
```

```
modelFit<- train(Direction~Lag1+Lag2, method='lda',preProcess=c('scale', 'center'), data=train)
confusionMatrix(test$Direction, predict(modelFit, test))
```

```
## Confusion Matrix and Statistics
##
## Reference
## Prediction Down Up
## Down 35 76
## Up 35 106
##
## Accuracy : 0.56
## 95% CI : (0.496, 0.622)
## No Information Rate : 0.722
## P-Value [Acc > NIR] : 1.000000
##
## Kappa : 0.07
## Mcnemar's Test P-Value : 0.000147
##
## Sensitivity : 0.500
## Specificity : 0.582
## Pos Pred Value : 0.315
## Neg Pred Value : 0.752
## Prevalence : 0.278
## Detection Rate : 0.139
## Detection Prevalence : 0.440
## Balanced Accuracy : 0.541
##
## 'Positive' Class : Down
##
```

The format of this code is identical to the `lda()`

function.

```
qda.fit <- qda(Direction ~ Lag1 + Lag2, data = train)
qda.fit
```

```
## Call:
## qda(Direction ~ Lag1 + Lag2, data = train)
##
## Prior probabilities of groups:
## Down Up
## 0.492 0.508
##
## Group means:
## Lag1 Lag2
## Down 0.04279 0.03389
## Up -0.03955 -0.03133
```

There are no linear discriminants since the QDA uses quadratic functions. However, the `predict()`

function works just the same.

```
qda.class <- predict(qda.fit, test)$class
table(qda.class, test$Direction)
```

```
##
## qda.class Down Up
## Down 30 20
## Up 81 121
```

`mean(qda.class == test$Direction)`

`## [1] 0.5992`

This suggests that QDA predictions are accurate ~ 60% of the time, which is pretty good for stock market data. Thus the quadratic relationship may be better suited for this problem.

Again, QDA using the `caret`

package is very simple.

```
modelFit <- train(Direction~Lag1+Lag2, method='qda', preProcess = c('scale', 'center'),data=train)
modelFit
```

```
## Quadratic Discriminant Analysis
##
## 998 samples
## 8 predictors
## 2 classes: 'Down', 'Up'
##
## Pre-processing: scaled, centered
## Resampling: Bootstrapped (25 reps)
##
## Summary of sample sizes: 998, 998, 998, 998, 998, 998, ...
##
## Resampling results
##
## Accuracy Kappa Accuracy SD Kappa SD
## 0.5 -0.004 0.03 0.04
##
##
```

`confusionMatrix(test$Direction, predict(modelFit, test))`

```
## Confusion Matrix and Statistics
##
## Reference
## Prediction Down Up
## Down 30 81
## Up 20 121
##
## Accuracy : 0.599
## 95% CI : (0.536, 0.66)
## No Information Rate : 0.802
## P-Value [Acc > NIR] : 1
##
## Kappa : 0.136
## Mcnemar's Test P-Value : 2.37e-09
##
## Sensitivity : 0.600
## Specificity : 0.599
## Pos Pred Value : 0.270
## Neg Pred Value : 0.858
## Prevalence : 0.198
## Detection Rate : 0.119
## Detection Prevalence : 0.440
## Balanced Accuracy : 0.600
##
## 'Positive' Class : Down
##
```

In this model, we can see that the Negative prediction value is 86%, which means when the market is up, we can predict with up to 86% accuracy that it is going to be up.