How to properly perform LDA

Some folks maybe confused LDA (linear discriminant analysis) with PCA (principal component analysis) and thought they are the same thing. Here is a little explanation work to see how to perform LDA. Remember: LDA is a supervised method, which tries to build model to predict the dependent variable, while PCA is unsupervised. So when you use LDA you need to build test data set and so later on you could evaluate the performance of the LDA model or perform proper cross-validation.

In this example I will use a random data set to show you the difference between supervised LDA and the unsupervised PCA.

require(MASS)

## Loading required package: MASS

set.seed(1999)
data <- matrix(runif(10000, 0, 10), nrow=100)

First, let’s do PCA and plot the PC scores out. Since the data is randomly generated, the plot should show no groups at all.

data.pca <- prcomp (data)
pairs(data.pca$x[,1:5])

You can see that it is indeed the case. Now let’s label the data into different groups, and perform LDA in a wrong way, on the whole data set, with the group id as a dependent variable.

data.grp <- transform(data.frame(data), grp = as.factor(rep(1:4, 25)))
data.lda <- lda(grp ~., data.grp)
plot(data.lda)

plot of chunk unnamed-chunk-3

Well, you can see that you could classify them into different groups nicely!! However it’s wrong because you should do it with the following steps. First you want to split the dataset into training set and test set:

train <- sample(1:100, 50)
data.train <- data.grp[train,]
data.test <- data.grp[-train, ]

You train the LDA model using the training set:

odd.lda <- lda(grp ~., data.train)

Then you validate your result using the test data set. Now we can see how accurate the model actually is by getting the percentage of the correct predictions:

pred.train <- predict(odd.lda,data.train)$class
pred.test <- predict(odd.lda,data.test)$class
#accuracy on training data
mean(pred.train == data.train$grp)

## [1] 0.8

#accuracy on test data
mean(pred.test == data.test$grp)

## [1] 0.36

You now know the point. The accuracy on test data is less than 50%, though you got over 50% on training! So now you can correctly reject the LDA model that classifies the random data set.

Now what if you indeed have a data set that can be classified? Let’s try some random generated data with predefined mean values (1, 2, 3, 4, in four groups).

set.seed(1999)
grp <- rep(c(1,2,3,4), 25)
mat <- numeric(0)
for(x in grp){
  mat <- rbind(mat, rnorm(100, x*2, x))
}

Not surprisingly PCA can identify these groups out immediately with the first two components.

mat.pca <- prcomp (mat)
pairs(mat.pca$x[,1:5])

plot of chunk unnamed-chunk-8

Let’s again try the LDA with the correct training and testing method.

# set up the data set by adding grp variable
mat.grp <- data.frame(cbind(grp, mat))
names(mat.grp) <- c("grp", paste("X", 1:100, sep=""))
# set up the traing and testing data sets
train <- sample(1:100, 50)
mat.train <- mat.grp[train,]
mat.test <- mat.grp[-train, ]
# run LDA on training and plot
mat.lda <- lda(grp ~., mat.train)

## Warning: variables are collinear

plot(mat.lda)

plot of chunk unnamed-chunk-9

Again good job. Let’s evaluate the accuracy of the model.

pred.mat.train <- predict(mat.lda,mat.train)$class
pred.mat.test <- predict(mat.lda,mat.test)$class
#accuracy on training data
mean(pred.mat.train == mat.train$grp)

## [1] 1

#accuracy on test data
mean(pred.mat.test == mat.test$grp)

## [1] 0.84

The prediction on the test data set is high. Pretty good!

We could further demo a so-called PC-LDA method. Here we can use the first 2 principle components to work it out.

mat.pc <- data.frame(cbind(grp, mat.pca$x[, 1:2]))
mat.pc.train <- mat.pc[train,]
mat.pc.test <- mat.pc[-train, ]

mat.pc.lda <- lda(grp ~., mat.pc.train)
plot(mat.pc.lda)

plot of chunk unnamed-chunk-11

Wow, doesn’t it look good?

pred.mat.pc.train <- predict(mat.pc.lda,mat.pc.train)$class
pred.mat.pc.test <- predict(mat.pc.lda,mat.pc.test)$class
#accuracy on training data
mean(pred.mat.pc.train == mat.pc.train$grp)

## [1] 1

#accuracy on test data
mean(pred.mat.pc.test == mat.pc.test$grp)

## [1] 1

You have a higher accuracy because you filtered out some noises using the PCA before the LDA. This sometimes is useful. In LDA we assume the sample distribution in each group/class is Gaussian, which worked well in this example.

The classification methods stops working when the data are too noisy. For example, I can make the standard error of the data larger than the mean value and redo everything, and change the means to 9, 8, 3, 4.

set.seed(1999)
grp <- rep(c(9,8,3,4), 25)
noisy <- numeric(0)
for(x in grp){
  noisy <- rbind(noisy, rnorm(100, x, 1.5 * x))
}

PCA would have a “poor” performance of course, because more data smear together.

noisy.pca <- prcomp(noisy)
pairs(noisy.pca$x[,1:5], col= as.integer(as.factor(grp)))

plot of chunk unnamed-chunk-14

However, since we have “some” (actually a lot!) a-priori knowledge of the data so we can use LDA to play the tricks, or a bit of cheating, here, trying to classify the data “correctly”.

noisy.pc <- data.frame(cbind(grp, noisy.pca$x[, 1:2]))
noisy.pc.train <- noisy.pc[train,]
noisy.pc.test <- noisy.pc[-train, ]
noisy.pc.lda <- lda(grp ~., noisy.pc.train)
plot(noisy.pc.lda)

plot of chunk unnamed-chunk-15

The simple validation tells us that we did it with a cost, and it’s not so lucky this time.

pred.noisy.pc.train <- predict(noisy.pc.lda,noisy.pc.train)$class
pred.noisy.pc.test <- predict(noisy.pc.lda,noisy.pc.test)$class
#accuracy on training data
mean(pred.noisy.pc.train == noisy.pc.train$grp)

## [1] 0.76

#accuracy on test data
mean(pred.noisy.pc.test == noisy.pc.test$grp)

## [1] 0.54

The prediction is only slightly better than throwing a coin. In reality, I actually can seperate 9, 8 and 3, 4 into two groups respectively based on the PCA plot. In that case the LDA would generate a much better prediction model.

How to properly perform LDA

Zhao Hao

June 3, 2015