Linear Discriminant Analysis in R

“Linear Discriminant Analysis in R”

Linear discriminant analysis, originally developed by R A Fisher in 1936 to classify subjects into one of the two clearly defined groups.

It was later expanded to classify subjects into more than two groups.

Linear Discriminant Analysis (LDA) is a dimensionality reduction technique. LDA used for dimensionality reduction to reduce the number of dimensions (i.e. variables) in a dataset while retaining as much information as possible.

Basically, it helps to find the linear combination of original variables that provide the best possible separation between the groups.

Customer Segmentation Analysis in R

The basic purpose is to estimate the relationship between a single categorical dependent variable and a set of quantitative independent variables.The major applications or examples are

$\star$ Predicting success or failure of new products.

$\star$ Accepting or rejecting admission to an applicant.

$\star$ Predicting credit risk category for a person.

$\star$ Classifying patients into different categories.

Let’s look into the iris data set for further analysis.

Regression analysis in R

LOAD LIBRARY

library("klaR")

## Loading required package: MASS

library("psych")
library("MASS")
library("ggord")
library("devtools")

## Loading required package: usethis

library("ggplot2")

## 
## Attaching package: 'ggplot2'

## The following objects are masked from 'package:psych':
## 
##     %+%, alpha

# Enable the r-universe repo
options(repos = c(
    fawda123 = 'https://fawda123.r-universe.dev',
    CRAN = 'https://cloud.r-project.org'))
install.packages('ggord')

## Warning: package 'ggord' is in use and will not be installed

GETTING THE DATA

head(iris)

##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

data("iris")
str(iris)

## 'data.frame':    150 obs. of  5 variables:
##  $ Sepal.Length: num  5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
##  $ Sepal.Width : num  3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
##  $ Petal.Length: num  1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
##  $ Petal.Width : num  0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
##  $ Species     : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...

First will create a scatterplot for the first four numerical variables. The gap between the points given is zero.

Social Network Analysis in R

pairs.panels(iris[1:4],
             gap = 0,
             bg = c("red", "green", "blue")[iris$Species],
             pch = 21)

Now you can see in the plot, scatter diagram, histogram, and correlation values.

Now we want to create the best separation groups based on these species.

DATA PARTITION

Let’s create a training dataset and test dataset for prediction and testing purposes. 60% dataset used for training purposes and $40\$$ used for testing purposes.

set.seed(123)
ind <- sample(2, nrow(iris),
              replace = TRUE,
              prob = c(0.6, 0.4))
training <- iris[ind==1,]
testing <- iris[ind==2,]

LINEAR DISCRIMINANT ANALYSIS

linear <- lda(Species~., training)
linear

## Call:
## lda(Species ~ ., data = training)
## 
## Prior probabilities of groups:
##     setosa versicolor  virginica 
##  0.3370787  0.3370787  0.3258427 
## 
## Group means:
##            Sepal.Length Sepal.Width Petal.Length Petal.Width
## setosa         4.946667    3.380000     1.443333    0.250000
## versicolor     5.943333    2.803333     4.240000    1.316667
## virginica      6.527586    2.920690     5.489655    2.048276
## 
## Coefficients of linear discriminants:
##                     LD1         LD2
## Sepal.Length  0.3629008  0.05215114
## Sepal.Width   2.2276982  1.47580354
## Petal.Length -1.7854533 -1.60918547
## Petal.Width  -3.9745504  4.10534268
## 
## Proportion of trace:
##    LD1    LD2 
## 0.9932 0.0068

Based on the training dataset, 38% belongs to setosa group, 31% belongs to versicolor groups and 30% belongs to virginica groups

Decicsion Trees in R

The first discriminant function is a linear combination of the four variables.

Percentage separations achieved by the first discriminant function is 99.37% and second is 0.63%

attributes(linear)

## $names
##  [1] "prior"   "counts"  "means"   "scaling" "lev"     "svd"     "N"      
##  [8] "call"    "terms"   "xlevels"
## 
## $class
## [1] "lda"

HISTOGRAM

Stacked histogram for discriminant function values.

p <- predict(linear, training)
ldahist(data = p$x[,1], g = training$Species)

These histograms are based on ld1. It’s clearly evident that no overlaps between first and second and first and third species. But some overlap observed between the second and third species.

Market Basket Analysis in R

ldahist(data = p$x[,2], g = training$Species)

histogram based on lda2 showing complete overlap and its not good.

BI-PLOT

ggord(linear, training$Species, ylim = c(-10, 10))

Biplot based on LD1 and LD2. Setosa separated very clearly and some overlap observed between Versicolor and virginica.

Based on arrows, Sepal width and sepal length explained more for setosa, petal width and petal length explained more for versicolor and virginica.

Deep Neutral Network in R

PARTITION PLOT

It provides the classification of each and every combination in the training dataset.

partimat(Species~., data = training, method = "lda")

partimat(Species~., data = training, method = "qda")

CONFUSION MATRIX AND ACCURACY-TRAINING DATA

p1 <- predict(linear, training)$class
tab <- table(Predicted = p1, Actual = training$Species)
tab

##             Actual
## Predicted    setosa versicolor virginica
##   setosa         30          0         0
##   versicolor      0         30         0
##   virginica       0          0        29

Repeated Measures of ANOVA in R

In the training dataset total correct classification is 33+26+25=84

sum(diag(tab))/sum(tab)

## [1] 1

The accuracy of the model is around 1.

CONFUSION MATRIX AND ACCURACY-TESTING DATA

p2 <- predict(linear, testing)$class
tab1 <- table(Predicted = p2, Actual = testing$Species)
tab1

##             Actual
## Predicted    setosa versicolor virginica
##   setosa         20          0         0
##   versicolor      0         19         1
##   virginica       0          1        20

sum(diag(tab1))/sum(tab1)

## [1] 0.9672131

The accuracy of the model is around 0.9672131.

CONCLUSION

Histogram and Biplot provide useful insights and helpful for interpretations and if there is not a great difference in the group covariance matrices, then the linear discriminant analysis will perform as well as quadratic. LDA is not useful for solving non-linear problems.

Linear Discriminant Analysis in R

Develin O. Omayan

3/29/2022