library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(MASS)
## 
## Attaching package: 'MASS'
## 
## The following object is masked from 'package:dplyr':
## 
##     select
library(klaR)
## Warning: package 'klaR' was built under R version 4.4.3
set.seed(101)
sample_n(iris, 10)
##    Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
## 1           6.3         2.5          4.9         1.5 versicolor
## 2           6.3         3.3          4.7         1.6 versicolor
## 3           5.6         2.7          4.2         1.3 versicolor
## 4           6.5         3.0          5.2         2.0  virginica
## 5           5.0         2.0          3.5         1.0 versicolor
## 6           6.6         2.9          4.6         1.3 versicolor
## 7           5.1         2.5          3.0         1.1 versicolor
## 8           6.1         3.0          4.9         1.8  virginica
## 9           7.4         2.8          6.1         1.9  virginica
## 10          5.4         3.4          1.5         0.4     setosa

Preparing the Data

After confirming the data set is available and the packages loaded, there is one last step to prepare our data for analysis. That is to separate the data into two subsets: a training set and a testing set. We will use the training set to build our predictive model and then we will use our testing set to evaluate the accuracy of that model. For convenience sake we will use a 60/40 split, using 60% of the data as the training set and the remaining 40% for the testing set.

training_sample <- sample(c(TRUE, FALSE), nrow(iris), replace = T, prob = c(0.6,0.4))
train <- iris[training_sample, ]
test <- iris[!training_sample, ]

LDA

LDA looks for linear combinations of the independent variables to best explain the data and predict the different classes. Discriminant scores are calculated for each observation for each class based on these linear combinations.

Applying LDA

Now let’s try using the iris training data set to create a model to predict the species. The lda function from the MASS package does most of the leg work for us. Try using the below code and lets analyze the outputs.

lda.iris <- lda(Species ~ ., train)
lda.iris
## Call:
## lda(Species ~ ., data = train)
## 
## Prior probabilities of groups:
##     setosa versicolor  virginica 
##  0.3095238  0.3214286  0.3690476 
## 
## Group means:
##            Sepal.Length Sepal.Width Petal.Length Petal.Width
## setosa         5.019231    3.396154     1.453846   0.2423077
## versicolor     5.948148    2.814815     4.381481   1.3925926
## virginica      6.641935    2.912903     5.609677   1.9645161
## 
## Coefficients of linear discriminants:
##                     LD1       LD2
## Sepal.Length  1.2366962 -1.082723
## Sepal.Width   0.8121037 -1.397630
## Petal.Length -2.3838963  1.172088
## Petal.Width  -2.6747723 -2.295058
## 
## Proportion of trace:
##    LD1    LD2 
## 0.9963 0.0037
plot(lda.iris, col = as.integer(train$Species))

As you can see, there are three distinct groups with some overlap between virginica and versicolor. Plotting again, but adding the code dimen = 1 will only plot in one dimension (LD1). Think of it as a projection of the data onto LD1 with a histogram of that data.

plot(lda.iris, dimen = 1, type = "b")

These plots illustrate the separation between groups as well as overlapping areas that are potential for mix-ups when predicting classes.

LDA Partition Plots

Using the partimat function from the klaR package provides an alternate way to plot the linear discriminant functions. partimat outouts an array of plots for every combination of two variables. Think of each plot as a different view of the same data. Colored regions delineate each classification area. Any observation that falls within a region is predicted to be from a specific class. Each plot also includes the apparent error rate for that view of the data.

partimat(Species ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width, data=train, method="lda")

LDA Predictions

Next let’s evaluate the prediction accuracy of our model. First we’ll run the model against the training set used to verify the model fits the data properly by using the command predict. The table output is a confusion matrix with the actual species as the row labels and the predicted species at the column labels.

lda.train <- predict(lda.iris)
train$lda <- lda.train$class
table(train$lda,train$Species)
##             
##              setosa versicolor virginica
##   setosa         26          0         0
##   versicolor      0         25         1
##   virginica       0          2        30

The total number of correctly predicted observations is the sum of the diagonal. So this model fit the training data correctly for almost every observation. Verifying the training set doesn’t prove accuracy, but a poor fit to the training data could be a sign that the model isn’t a good one.

Now let’s run our test set against this model to determine its accuracy.

lda.test <- predict(lda.iris,test)
test$lda <- lda.test$class
table(test$lda,test$Species)
##             
##              setosa versicolor virginica
##   setosa         24          0         0
##   versicolor      0         23         0
##   virginica       0          0        19

Based on the earlier plots, it makes sense that a few iris versicolor and iris virginica observations may be miscategorized. Overall the model performs very well with the testing set with an accuracy of 98.5%.

Quadratic Discriminant Analysis

QDA, like LDA, assumes the independent variables are normally distributed (multivariate normality). Unlike LDA, QDA does not assume equal covariance between the classes. QDA does not find linear combinations of the independent variables, but finds a quadratic function of the independent variables.

Applying QDA

The function qda from the MASS package is used. The output looks very similar to the lda output with the same prior probabilities and class means, but there are no linear discriminants calculated with QDA.

qda.iris <- qda(Species ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width, train)
qda.iris 
## Call:
## qda(Species ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width, 
##     data = train)
## 
## Prior probabilities of groups:
##     setosa versicolor  virginica 
##  0.3095238  0.3214286  0.3690476 
## 
## Group means:
##            Sepal.Length Sepal.Width Petal.Length Petal.Width
## setosa         5.019231    3.396154     1.453846   0.2423077
## versicolor     5.948148    2.814815     4.381481   1.3925926
## virginica      6.641935    2.912903     5.609677   1.9645161

QDA Partition Plots

Using the partimat function again provides a way to plot the quadratic discriminant functions. The only difference in code from the LDA section above is replacing method=“lda” with method=“qda”. These plots provide a good visual of the difference between the linear functions used in LDA and the quadratic functions used in QDA. Again, colored regions delineate each classification area. Any observation that falls within a region is predicted to be from a specific class. Each plot also includes the apparent error rate for that view of the data.

partimat(Species ~ Sepal.Length + Sepal.Width + Petal.Length + Petal.Width, data=train, method="qda")

QDA Predictions

The predict function works exactly the same way as before and can be used to create a confusion matrix for the training data.

qda.train <- predict(qda.iris)
train$qda <- qda.train$class
table(train$qda,train$Species)
##             
##              setosa versicolor virginica
##   setosa         26          0         0
##   versicolor      0         25         0
##   virginica       0          2        31

This model fit the training data very well. Again verifying the training set does not prove accuracy, but a poor fit could be a sign that the model isn’t a good one. Next we’ll generate the same confusion matrix with the test data.

qda.test <- predict(qda.iris,test)
test$qda <- qda.test$class
table(test$qda,test$Species)
##             
##              setosa versicolor virginica
##   setosa         24          0         0
##   versicolor      0         23         0
##   virginica       0          0        19

The model correctly predicted the class of observations 97% of the time. This is slightly less accurate than the LDA model which is not always the case, but overall still quite impressive.