library(MASS)
table(iris$Species)
##
## setosa versicolor virginica
## 50 50 50
#create a lookup table that maps each species name to a corresponding color
lookup <- c(setosa='blue', versicola='green', virginica='orange')
#Using the lookup table to create a new vector that contains the corresponding color for each species in the iris dataset
col.ind <- lookup[iris$Species]
#Updating the bg parameter in the pairs function call to use the colors in the col.ind vector
pairs(iris[-5], pch=21, col="gray", bg=col.ind)
lda.fit <- lda(Species ~ ., data = iris)
lda.fit
## Call:
## lda(Species ~ ., data = iris)
##
## Prior probabilities of groups:
## setosa versicolor virginica
## 0.3333333 0.3333333 0.3333333
##
## Group means:
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## setosa 5.006 3.428 1.462 0.246
## versicolor 5.936 2.770 4.260 1.326
## virginica 6.588 2.974 5.552 2.026
##
## Coefficients of linear discriminants:
## LD1 LD2
## Sepal.Length 0.8293776 -0.02410215
## Sepal.Width 1.5344731 -2.16452123
## Petal.Length -2.2012117 0.93192121
## Petal.Width -2.8104603 -2.83918785
##
## Proportion of trace:
## LD1 LD2
## 0.9912 0.0088
The first line of code is fitting an LDA model to the iris dataset
using the lda()
function. The model is trained to predict
the species (Species
column) based on the other columns in
the dataset (.
notation means all other columns). The
output shows that the prior probabilities of each group
(setosa, versicolor, and virginica) are all equal at 0.33.
We can plot the sepal width against the sepal length, and use the LDA
model to draw the group centroids on the plot.
Plotting the centroids on the scatterplot of the input variables can
help us see how well the LDA model is separating the groups. If the
centroids are well separated in the input space, then the LDA model is
doing a good job of separating the groups.
#Draw the scatterplot
plot(Sepal.Width ~ Sepal.Length, data = iris, pch=21, col="gray", bg= col.ind)
#Draw the centroids
points(lda.fit$means[,1], lda.fit$means[,2], pch=21, cex=2,
col="black", bg=lookup)
lda.pred <- predict(lda.fit)
head(lda.pred$x)
## LD1 LD2
## 1 8.061800 -0.3004206
## 2 7.128688 0.7866604
## 3 7.489828 0.2653845
## 4 6.813201 0.6706311
## 5 8.132309 -0.5144625
## 6 7.701947 -1.4617210
The function predict()
applies the model to the input
data and generates a set of predicted class labels and discriminant
function scores. With head(lda.pred$x)
, we display the
first few rows of the x
component of the
lda.pred
object, which contains the discriminant function
scores.
Now, we use a scatterplot to help us visualize how well the LDA model is
able to separate the different groups in the input data.
plot(LD2 ~ LD1, data = lda.pred$x, pch=21, col="gray", bg=col.ind)
To asses the quality of the prediction we can use a confusion matrix and compute the error rate.
table(pred=lda.pred$class, true=iris$Species)
## true
## pred setosa versicolor virginica
## setosa 50 0 0
## versicolor 0 48 1
## virginica 0 2 49
The code table(pred=lda.pred$class, true=iris$Species)
generates a confusion matrix that shows the predicted class
labels (pred
) on the rows and the true class
labels (true
) on the columns. The cells of
the table show the number of observations that were predicted to be
in each class, given the true class membership of the
observations.
1 - mean(lda.pred$class == iris$Species)
## [1] 0.02
The code 1 - mean(lda.pred$class == iris$Species)
calculates the misclassification rate of the LDA model, which is the
proportion of observations that were misclassified by the model. In
this case, the misclassification rate is 0.02, or 2%, which indicates
that the LDA model is able to accurately predict the class labels of
most of the observations in the input data.