Installing ‘ggord’ package
# Enable the r-universe repo
options(repos = c(
fawda123 = 'https://fawda123.r-universe.dev',
CRAN = 'https://cloud.r-project.org'))
# Install ggord
install.packages('ggord')
## Installing package into 'C:/Users/USER1/Documents/R/win-library/4.1'
## (as 'lib' is unspecified)
## package 'ggord' successfully unpacked and MD5 sums checked
##
## The downloaded binary packages are in
## C:\Users\USER1\AppData\Local\Temp\Rtmp2ptffg\downloaded_packages
Load required packages
library(klaR)
## Loading required package: MASS
library(psych)
## Warning: package 'psych' was built under R version 4.1.3
library(MASS)
library(ggord)
## Warning: package 'ggord' was built under R version 4.1.3
library(devtools)
## Warning: package 'devtools' was built under R version 4.1.3
## Loading required package: usethis
## Warning: package 'usethis' was built under R version 4.1.3
Getting Data Needed
data("iris")
str(iris)
## 'data.frame': 150 obs. of 5 variables:
## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
## $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
First, we will create a scatterplot for the first four numerical variables. The gap between the points given is zero.
pairs.panels(iris[1:4],
gap = 0,
bg = c("red", "green", "blue")[iris$Species],
pch = 21)
By above, we can see the plot, scatter diagram, histogram, and correlation values. Now we want to create the best separation groups based on these species.
Data partition We create a training dataset and test dataset for prediction and testing purposes. 60% dataset used for training purposes and 40$ used for testing purposes.
set.seed(123)
ind <- sample(2, nrow(iris),
replace = TRUE,
prob = c(0.6, 0.4))
training <- iris[ind==1,]
testing <- iris[ind==2,]
Linear Discriminant Analysis (LDA)
linear <- lda(Species~., training)
linear
## Call:
## lda(Species ~ ., data = training)
##
## Prior probabilities of groups:
## setosa versicolor virginica
## 0.3370787 0.3370787 0.3258427
##
## Group means:
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## setosa 4.946667 3.380000 1.443333 0.250000
## versicolor 5.943333 2.803333 4.240000 1.316667
## virginica 6.527586 2.920690 5.489655 2.048276
##
## Coefficients of linear discriminants:
## LD1 LD2
## Sepal.Length 0.3629008 0.05215114
## Sepal.Width 2.2276982 1.47580354
## Petal.Length -1.7854533 -1.60918547
## Petal.Width -3.9745504 4.10534268
##
## Proportion of trace:
## LD1 LD2
## 0.9932 0.0068
From the results of LDA above, we see that 38% belongs to setosa group, 31% belongs to versicolor groups and 30% belongs to virginica groups.
The first discriminant function is a linear combination of the four variables.The percentage separations achieved by the first discriminant function is 99.37% and second is 0.63%
attributes(linear)
## $names
## [1] "prior" "counts" "means" "scaling" "lev" "svd" "N"
## [8] "call" "terms" "xlevels"
##
## $class
## [1] "lda"
p <- predict(linear, training)
ldahist(data = p$x[,1], g = training$Species)
These histograms are based on ld1. It’s clearly evident that no overlaps between first and second and first and third species. But some overlap observed between the second and third species.
Market Basket Analysis in R
ldahist(data = p$x[,2], g = training$Species)
Histogram based on lda2 showing complete overlap and its not good
Bi-Plot
ggord(linear, training$Species, ylim = c(-10, 10))
This is based on LD1 and LD2 biplots. Setosa separated very clearly and some overlap observed between Versicolor and virginica. Based on arrows, Sepal width and sepal length explained more for setosa, petal width and petal length explained more for versicolor and virginica.
Partition plot This provides the classification of each and every combination in the training dataset.
partimat(Species~., data = training, method = "lda")
partimat(Species~., data = training, method = "qda")
Confusion matrix and accuracy – training data
p1 <- predict(linear, training)$class
tab <- table(Predicted = p1, Actual = training$Species)
tab
## Actual
## Predicted setosa versicolor virginica
## setosa 30 0 0
## versicolor 0 30 0
## virginica 0 0 29
Let’s see the correct classifications and miss classifications.
In the training dataset total correct classification is 30+30+29=89
sum(diag(tab))/sum(tab)
## [1] 1
Model’s accuracy is equal to 1
Confusion matrix and accuracy – testing data
p2 <- predict(linear, testing)$class
tab1 <- table(Predicted = p2, Actual = testing$Species)
tab1
## Actual
## Predicted setosa versicolor virginica
## setosa 20 0 0
## versicolor 0 19 1
## virginica 0 1 20
sum(diag(tab1))/sum(tab1)
## [1] 0.9672131
The accuracy of the model is around 0.9672131
Conclusion
Histogram and Biplot provide useful insights and helpful for interpretations and if there is not a great difference in the group covariance matrices, then the linear discriminant analysis will perform as well as quadratic. LDA is not useful for solving non-linear problems.