LDA Practice

Analyzing the dataset

library(MASS)
table(iris$Species)

## 
##     setosa versicolor  virginica 
##         50         50         50

#create a lookup table that maps each species name to a corresponding color
lookup <- c(setosa='blue', versicola='green', virginica='orange')

#Using the lookup table to create a new vector that contains the corresponding color for each species in the iris dataset
col.ind <- lookup[iris$Species]

#Updating the bg parameter in the pairs function call to use the colors in the col.ind vector
pairs(iris[-5], pch=21, col="gray", bg=col.ind)

Linear Discriminant Analysis (LDA)

lda.fit <- lda(Species ~ ., data = iris)
lda.fit

## Call:
## lda(Species ~ ., data = iris)
## 
## Prior probabilities of groups:
##     setosa versicolor  virginica 
##  0.3333333  0.3333333  0.3333333 
## 
## Group means:
##            Sepal.Length Sepal.Width Petal.Length Petal.Width
## setosa            5.006       3.428        1.462       0.246
## versicolor        5.936       2.770        4.260       1.326
## virginica         6.588       2.974        5.552       2.026
## 
## Coefficients of linear discriminants:
##                     LD1         LD2
## Sepal.Length  0.8293776 -0.02410215
## Sepal.Width   1.5344731 -2.16452123
## Petal.Length -2.2012117  0.93192121
## Petal.Width  -2.8104603 -2.83918785
## 
## Proportion of trace:
##    LD1    LD2 
## 0.9912 0.0088

The first line of code is fitting an LDA model to the iris dataset using the lda() function. The model is trained to predict the species (Species column) based on the other columns in the dataset (. notation means all other columns). The output shows that the prior probabilities of each group (setosa, versicolor, and virginica) are all equal at 0.33.

Visualizing the results

We can plot the sepal width against the sepal length, and use the LDA model to draw the group centroids on the plot.
Plotting the centroids on the scatterplot of the input variables can help us see how well the LDA model is separating the groups. If the centroids are well separated in the input space, then the LDA model is doing a good job of separating the groups.

#Draw the scatterplot
plot(Sepal.Width ~ Sepal.Length, data = iris, pch=21, col="gray", bg= col.ind)

#Draw the centroids
points(lda.fit$means[,1], lda.fit$means[,2], pch=21, cex=2,
       col="black", bg=lookup)

Making predictions

lda.pred <- predict(lda.fit)
head(lda.pred$x)

##        LD1        LD2
## 1 8.061800 -0.3004206
## 2 7.128688  0.7866604
## 3 7.489828  0.2653845
## 4 6.813201  0.6706311
## 5 8.132309 -0.5144625
## 6 7.701947 -1.4617210

The function predict() applies the model to the input data and generates a set of predicted class labels and discriminant function scores. With head(lda.pred$x), we display the first few rows of the x component of the lda.pred object, which contains the discriminant function scores.
Now, we use a scatterplot to help us visualize how well the LDA model is able to separate the different groups in the input data.

plot(LD2 ~ LD1, data = lda.pred$x, pch=21, col="gray", bg=col.ind)

Assessing the quality of the prediction

To asses the quality of the prediction we can use a confusion matrix and compute the error rate.

table(pred=lda.pred$class, true=iris$Species)

##             true
## pred         setosa versicolor virginica
##   setosa         50          0         0
##   versicolor      0         48         1
##   virginica       0          2        49

The code table(pred=lda.pred$class, true=iris$Species) generates a confusion matrix that shows the predicted class labels (pred) on the rows and the true class labels (true) on the columns. The cells of the table show the number of observations that were predicted to be in each class, given the true class membership of the observations.

1 - mean(lda.pred$class == iris$Species)

## [1] 0.02

The code 1 - mean(lda.pred$class == iris$Species) calculates the misclassification rate of the LDA model, which is the proportion of observations that were misclassified by the model. In this case, the misclassification rate is 0.02, or 2%, which indicates that the LDA model is able to accurately predict the class labels of most of the observations in the input data.