Linear Discriminant Analysis

Installing ‘ggord’ package

# Enable the r-universe repo
options(repos = c(
    fawda123 = 'https://fawda123.r-universe.dev',
    CRAN = 'https://cloud.r-project.org'))

# Install ggord
install.packages('ggord')

## Installing package into 'C:/Users/USER1/Documents/R/win-library/4.1'
## (as 'lib' is unspecified)

## package 'ggord' successfully unpacked and MD5 sums checked
## 
## The downloaded binary packages are in
##  C:\Users\USER1\AppData\Local\Temp\Rtmp2ptffg\downloaded_packages

Load required packages

library(klaR)

## Loading required package: MASS

library(psych)

## Warning: package 'psych' was built under R version 4.1.3

library(MASS)
library(ggord)

## Warning: package 'ggord' was built under R version 4.1.3

library(devtools)

## Warning: package 'devtools' was built under R version 4.1.3

## Loading required package: usethis

## Warning: package 'usethis' was built under R version 4.1.3

Getting Data Needed

data("iris")
str(iris)

## 'data.frame':    150 obs. of  5 variables:
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

First, we will create a scatterplot for the first four numerical variables. The gap between the points given is zero.

pairs.panels(iris[1:4],
             gap = 0,
             bg = c("red", "green", "blue")[iris$Species],
             pch = 21)

By above, we can see the plot, scatter diagram, histogram, and correlation values. Now we want to create the best separation groups based on these species.

Data partition We create a training dataset and test dataset for prediction and testing purposes. 60% dataset used for training purposes and 40$ used for testing purposes.

set.seed(123)
ind <- sample(2, nrow(iris),
              replace = TRUE,
              prob = c(0.6, 0.4))
training <- iris[ind==1,]
testing <- iris[ind==2,]

Linear Discriminant Analysis (LDA)

linear <- lda(Species~., training)
linear

## Call:
## lda(Species ~ ., data = training)
## 
## Prior probabilities of groups:
##     setosa versicolor  virginica 
##  0.3370787  0.3370787  0.3258427 
## 
## Group means:
##            Sepal.Length Sepal.Width Petal.Length Petal.Width
## setosa         4.946667    3.380000     1.443333    0.250000
## versicolor     5.943333    2.803333     4.240000    1.316667
## virginica      6.527586    2.920690     5.489655    2.048276
## 
## Coefficients of linear discriminants:
##                     LD1         LD2
## Sepal.Length  0.3629008  0.05215114
## Sepal.Width   2.2276982  1.47580354
## Petal.Length -1.7854533 -1.60918547
## Petal.Width  -3.9745504  4.10534268
## 
## Proportion of trace:
##    LD1    LD2 
## 0.9932 0.0068

From the results of LDA above, we see that 38% belongs to setosa group, 31% belongs to versicolor groups and 30% belongs to virginica groups.

The first discriminant function is a linear combination of the four variables.The percentage separations achieved by the first discriminant function is 99.37% and second is 0.63%

attributes(linear)

## $names
##  [1] "prior"   "counts"  "means"   "scaling" "lev"     "svd"     "N"      
##  [8] "call"    "terms"   "xlevels"
## 
## $class
## [1] "lda"

p <- predict(linear, training)
ldahist(data = p$x[,1], g = training$Species)

These histograms are based on ld1. It’s clearly evident that no overlaps between first and second and first and third species. But some overlap observed between the second and third species.

Market Basket Analysis in R

ldahist(data = p$x[,2], g = training$Species)

Histogram based on lda2 showing complete overlap and its not good

Bi-Plot

ggord(linear, training$Species, ylim = c(-10, 10))

This is based on LD1 and LD2 biplots. Setosa separated very clearly and some overlap observed between Versicolor and virginica. Based on arrows, Sepal width and sepal length explained more for setosa, petal width and petal length explained more for versicolor and virginica.

Partition plot This provides the classification of each and every combination in the training dataset.

partimat(Species~., data = training, method = "lda")

partimat(Species~., data = training, method = "qda")

Confusion matrix and accuracy – training data

p1 <- predict(linear, training)$class
tab <- table(Predicted = p1, Actual = training$Species)
tab

##             Actual
## Predicted    setosa versicolor virginica
##   setosa         30          0         0
##   versicolor      0         30         0
##   virginica       0          0        29

Let’s see the correct classifications and miss classifications.

In the training dataset total correct classification is 30+30+29=89

sum(diag(tab))/sum(tab)

## [1] 1

Model’s accuracy is equal to 1

Confusion matrix and accuracy – testing data

p2 <- predict(linear, testing)$class
tab1 <- table(Predicted = p2, Actual = testing$Species)
tab1

##             Actual
## Predicted    setosa versicolor virginica
##   setosa         20          0         0
##   versicolor      0         19         1
##   virginica       0          1        20

sum(diag(tab1))/sum(tab1)

## [1] 0.9672131

The accuracy of the model is around 0.9672131

Conclusion

Histogram and Biplot provide useful insights and helpful for interpretations and if there is not a great difference in the group covariance matrices, then the linear discriminant analysis will perform as well as quadratic. LDA is not useful for solving non-linear problems.

Linear Discriminant Analysis

Darwin Mangubat

3/29/2022