Data Sources:

Heavily borrowed from:

Textbook: Introduction to statistical learning
Textbook: Elements of statistical learning
Wikipedia

require(knitr)

## Loading required package: knitr

1.0 Overview

LDA provides a less direct approach to modeling the predicted probabilities given some set of predictor(s) X. This algorithm models the distribution of the predictors X separately in each of the response classes (given Y’s), and the uses Bayes’ theorem to flip them around into estimates. When these distributions are assumed to be normal, it turns out that the model is very similar to logistic regression.

Why do we need this method?

When the classes are well-separated, the parameter estimates for the logistic model are surprisingly unstable. LDA does not suffer from this.
If n is small and the distribution of the predictors X is approximately normal in each of the classes, the LDA model is more stable than logistic.
For these reasons, and some others, LDA is the preferred method when dealing with > 2 response classes.

The LDA classifier assumes that each class comes from a normal distribution with a class-specific mean vector and a common variance. We utilize LDA to estimate the parameters so that we can leverage the Bayes classifier. The Bayes classifier is a simple and highly effective classifier that assigns each observation to the most likely class given its predictor values. The Bayes classifier has the lowest possible error rate out of all classifiers if the terms are correctly specified. Thus LDA is a classifier that attempts to approximate the Bayes classifier.

While some parameters could be specified with some prior class membership probability insight, LDA estimates some of these model parameters by simply using the proportion of the training observations that belong to the kth class. For p > 1 predictors, the LDA classifier assumes that the observations in the kth class are drawn from a multivariate gaussian distribution which has a class specific mean and common variance.

Once the parameters have been specified, the Bayes classifier draws p (number of predictors) decision boundaries. For example: for p = 3, there are three pairs of classes among the three classes. That is, one Bayes decision boundary separates class 1 from class 2, one separates class 1 from class 3, and one separates class 2 from class 3. The classifier then simply classifies an observation according the region in which it is located.

Alternative Methods

Quadratic Discriminant Analysis (QDA) holds the same assumptions as LDA except that the co variance matrix that is not common to all K classes.

How to choose between them?

The difference is really a bias-variance trade-off. With p predictors, estimating a co variance matrix requires estimating p(p+1)/2 parameters. The QDA estimates a separate co variance matrix for each class, so as the number of predictors becomes high, we experience a computational expense. Conversely, if we assume a common co variance matrix, we only have to do the computation once. LDA is a much less flexible classifier, than QDA, thus has substantially lower variance. However, if the assumption of uniform variance is highly off, then LDA can suffer high bias. In general, LDA tends to be better than QDA if there are relatively few training observations, so therefore reducing variance is crucial. QDA is recommended if the training set is very large, so that the variance of the classifier is not a major concern.

Between Logistic regression LDA and QDA, the biggest things to take into consideration are the type of decision boundary that is required. If highly linear, than LDA and Logistic may prove superior, if non-linear, the edge may be given to QDA. Though keep in mind we can do simple transformations to take non-linearity into consideration with Logistic models, similar to how we did in linear regression.

Code Examples

LDA

Again we will use the Smarket data.

library(ISLR)
attach(Smarket)
# Split data into testing and training
train<-Smarket[Year<2005,]
test<-Smarket[Year==2005,]

The lda() function is part of the MASS library.

library(MASS)

lda.fit <- lda(Direction~Lag1 + Lag2, data=train)
lda.fit

## Call:
## lda(Direction ~ Lag1 + Lag2, data = train)
## 
## Prior probabilities of groups:
##  Down    Up 
## 0.492 0.508 
## 
## Group means:
##          Lag1     Lag2
## Down  0.04279  0.03389
## Up   -0.03955 -0.03133
## 
## Coefficients of linear discriminants:
##          LD1
## Lag1 -0.6420
## Lag2 -0.5135

From here we can see the model predicts 49.2% of days in the training data correspond to days which the market went down. We also see the group means; these are the average of each predictor within each class. These suggest that there is a tendency for the previous 2 days’ returns to be negative on days when the market increases, and a tendency for the previous 2 days’ returns to be positive on days when the market declines. The cooeficients of the linear discrinimants output provides the linear combination of Lag1 and `Lag2 that are used to form the LDA decision rule. We could use these, along with the predictor values to draw the linear discriminant.

The predict() function in this case returns a list with three elements. class, contains the LDA’s predictions about the movement of the market. posterior, is a matrix whose kth column contains the posterior probability that the corresponding observation belongs to the kth class, and x contains the linear discriminants.

lda.pred <- predict(lda.fit, test)
names(lda.pred)

## [1] "class"     "posterior" "x"

# The results are the same as logistic regression
table(lda.pred$class, test$Direction)

##       
##        Down  Up
##   Down   35  35
##   Up     76 106

mean(lda.pred$class==test$Direction)

## [1] 0.5595

Remember, the output of the class variable is simply created by applying a 50% posterior probability. So, if we wanted to use a different threshold, we could easily do so. For example, suppose that we wish to predict a market decrease only if we are very certain that the market will indeed decrease on that day - say, if the posterior probability is at least 90%.

sum(lda.pred$posterior[,1]>0.9)

## [1] 0

No days in 2005 meet this threshold, actually, the greatest posterior probability of decrease in all of 2005 was 52.02%, which is not too conclusive.

Again, here is how to do this using the caret package

library(caret)

## Loading required package: lattice
## Loading required package: ggplot2

modelFit<- train(Direction~Lag1+Lag2, method='lda',preProcess=c('scale', 'center'), data=train)

confusionMatrix(test$Direction, predict(modelFit, test))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Down  Up
##       Down   35  76
##       Up     35 106
##                                         
##                Accuracy : 0.56          
##                  95% CI : (0.496, 0.622)
##     No Information Rate : 0.722         
##     P-Value [Acc > NIR] : 1.000000      
##                                         
##                   Kappa : 0.07          
##  Mcnemar's Test P-Value : 0.000147      
##                                         
##             Sensitivity : 0.500         
##             Specificity : 0.582         
##          Pos Pred Value : 0.315         
##          Neg Pred Value : 0.752         
##              Prevalence : 0.278         
##          Detection Rate : 0.139         
##    Detection Prevalence : 0.440         
##       Balanced Accuracy : 0.541         
##                                         
##        'Positive' Class : Down          
##

QDA

The format of this code is identical to the lda() function.

qda.fit <- qda(Direction ~ Lag1 + Lag2, data = train)
qda.fit

## Call:
## qda(Direction ~ Lag1 + Lag2, data = train)
## 
## Prior probabilities of groups:
##  Down    Up 
## 0.492 0.508 
## 
## Group means:
##          Lag1     Lag2
## Down  0.04279  0.03389
## Up   -0.03955 -0.03133

There are no linear discriminants since the QDA uses quadratic functions. However, the predict() function works just the same.

qda.class <- predict(qda.fit, test)$class
table(qda.class, test$Direction)

##          
## qda.class Down  Up
##      Down   30  20
##      Up     81 121

mean(qda.class == test$Direction)

## [1] 0.5992

This suggests that QDA predictions are accurate ~ 60% of the time, which is pretty good for stock market data. Thus the quadratic relationship may be better suited for this problem.

Again, QDA using the caret package is very simple.

modelFit <- train(Direction~Lag1+Lag2, method='qda', preProcess = c('scale', 'center'),data=train)

modelFit

## Quadratic Discriminant Analysis 
## 
## 998 samples
##   8 predictors
##   2 classes: 'Down', 'Up' 
## 
## Pre-processing: scaled, centered 
## Resampling: Bootstrapped (25 reps) 
## 
## Summary of sample sizes: 998, 998, 998, 998, 998, 998, ... 
## 
## Resampling results
## 
##   Accuracy  Kappa   Accuracy SD  Kappa SD
##   0.5       -0.004  0.03         0.04    
## 
##

confusionMatrix(test$Direction, predict(modelFit, test))

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Down  Up
##       Down   30  81
##       Up     20 121
##                                        
##                Accuracy : 0.599        
##                  95% CI : (0.536, 0.66)
##     No Information Rate : 0.802        
##     P-Value [Acc > NIR] : 1            
##                                        
##                   Kappa : 0.136        
##  Mcnemar's Test P-Value : 2.37e-09     
##                                        
##             Sensitivity : 0.600        
##             Specificity : 0.599        
##          Pos Pred Value : 0.270        
##          Neg Pred Value : 0.858        
##              Prevalence : 0.198        
##          Detection Rate : 0.119        
##    Detection Prevalence : 0.440        
##       Balanced Accuracy : 0.600        
##                                        
##        'Positive' Class : Down         
##

In this model, we can see that the Negative prediction value is 86%, which means when the market is up, we can predict with up to 86% accuracy that it is going to be up.

My website link

Classification: Linear Discriminant Analysis

Ryan Kelly

June 23, 2014

Data Sources:

1.0 Overview

Alternative Methods

How to choose between them?

Code Examples

LDA

QDA