Linear Discriminant Analysis in R

Load Library

library(klaR)

## Loading required package: MASS

library(psych)
library(MASS)
library(ggord)
library(devtools)

## Loading required package: usethis

Getting data. Total of 150 observations and 5 variables contains in the iris dataset.

data("iris")
str(iris)

## 'data.frame':    150 obs. of  5 variables:
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

Create a scatterplot for the first four numerical variables. The gap between the points given is zero.

pairs.panels(iris[1:4],
             gap = 0,
             bg = c("red", "green", "blue")[iris$Species],
             pch = 21)

Data partition. Let’s create a training dataset and test dataset for prediction and testing purposes. 60% dataset used for training purposes and 40$ used for testing purposes.

set.seed(123)
ind <- sample(2, nrow(iris),
              replace = TRUE,
              prob = c(0.6, 0.4))
training <- iris[ind==1,]
testing <- iris[ind==2,]

Linear discriminant analysis

linear <- lda(Species~., training)
linear

## Call:
## lda(Species ~ ., data = training)
## 
## Prior probabilities of groups:
##     setosa versicolor  virginica 
##  0.3370787  0.3370787  0.3258427 
## 
## Group means:
##            Sepal.Length Sepal.Width Petal.Length Petal.Width
## setosa         4.946667    3.380000     1.443333    0.250000
## versicolor     5.943333    2.803333     4.240000    1.316667
## virginica      6.527586    2.920690     5.489655    2.048276
## 
## Coefficients of linear discriminants:
##                     LD1         LD2
## Sepal.Length  0.3629008  0.05215114
## Sepal.Width   2.2276982  1.47580354
## Petal.Length -1.7854533 -1.60918547
## Petal.Width  -3.9745504  4.10534268
## 
## Proportion of trace:
##    LD1    LD2 
## 0.9932 0.0068

Based on the training dataset, 33% belongs to setosa group, 33% belongs to versicolor groups and 32% belongs to virginica groups

Decision Trees in R

The first discriminant function is a linear combination of the four variables.

Percentage separations achieved by the first discriminant function is 99.32% and second is 0.63%

attributes(linear)

## $names
##  [1] "prior"   "counts"  "means"   "scaling" "lev"     "svd"     "N"      
##  [8] "call"    "terms"   "xlevels"
## 
## $class
## [1] "lda"

Histogram. Stacked histogram for discriminant function values.

p <- predict(linear, training)
ldahist(data = p$x[,1], g = training$Species)

These histograms are based on ld1. It’s clearly evident that there are overlaps between first and second and first and third species. Also some overlap observed between the second and third species.

Market Basket Analysis in R

ldahist(data = p$x[,2], g = training$Species)

histogram based on lda2 showing complete overlap and its not good.

Bi-Plot

ggord(linear, training$Species, ylim = c(-10, 10))

Biplot based on LD1 and LD2. Setosa separated very clearly and some overlap observed between Versicolor and virginica.

Based on arrows, Sepal width and sepal length explained more for setosa, petal width and petal length explained more for versicolor and virginica.

Deep Neural Network in R

Partition plot. It provides the classification of each and every combination in the training dataset.

partimat(Species~., data = training, method = "lda")

partimat(Species~., data = training, method = "qda")

Confusion matrix and accuracy – training data

p1 <- predict(linear, training)$class
tab <- table(Predicted = p1, Actual = training$Species)
tab

##             Actual
## Predicted    setosa versicolor virginica
##   setosa         30          0         0
##   versicolor      0         30         0
##   virginica       0          0        29

sum(diag(tab))/sum(tab)

## [1] 1

In the training dataset total correct classification is 30+30+29=89

The accuracy of the model is 1.

Confusion matrix and accuracy – testing data

p2 <- predict(linear, testing)$class
tab1 <- table(Predicted = p2, Actual = testing$Species)
tab1

##             Actual
## Predicted    setosa versicolor virginica
##   setosa         20          0         0
##   versicolor      0         19         1
##   virginica       0          1        20

sum(diag(tab1))/sum(tab1)

## [1] 0.9672131

The accuracy of the model is around .9672131

Conclusion. Histogram and Biplot provide useful insights and helpful for interpretations and if there is not a great difference in the group covariance matrices, then the linear discriminant analysis will perform as well as quadratic. LDA is not useful for solving non-linear problems.

Linear Discriminant Analysis in R

Kristel Jean Guzman

2022-03-28