Load Library
library(klaR)
## Loading required package: MASS
library(psych)
library(MASS)
library(ggord)
library(devtools)
## Loading required package: usethis
Getting data. Total of 150 observations and 5 variables contains in the iris dataset.
data("iris")
str(iris)
## 'data.frame': 150 obs. of 5 variables:
## $ Sepal.Length: num 5.1 4.9 4.7 4.6 5 5.4 4.6 5 4.4 4.9 ...
## $ Sepal.Width : num 3.5 3 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1 ...
## $ Petal.Length: num 1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 ...
## $ Petal.Width : num 0.2 0.2 0.2 0.2 0.2 0.4 0.3 0.2 0.2 0.1 ...
## $ Species : Factor w/ 3 levels "setosa","versicolor",..: 1 1 1 1 1 1 1 1 1 1 ...
Create a scatterplot for the first four numerical variables. The gap between the points given is zero.
pairs.panels(iris[1:4],
gap = 0,
bg = c("red", "green", "blue")[iris$Species],
pch = 21)
Data partition. Let’s create a training dataset and test dataset for
prediction and testing purposes. 60% dataset used for training purposes
and 40$ used for testing purposes.
set.seed(123)
ind <- sample(2, nrow(iris),
replace = TRUE,
prob = c(0.6, 0.4))
training <- iris[ind==1,]
testing <- iris[ind==2,]
Linear discriminant analysis
linear <- lda(Species~., training)
linear
## Call:
## lda(Species ~ ., data = training)
##
## Prior probabilities of groups:
## setosa versicolor virginica
## 0.3370787 0.3370787 0.3258427
##
## Group means:
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## setosa 4.946667 3.380000 1.443333 0.250000
## versicolor 5.943333 2.803333 4.240000 1.316667
## virginica 6.527586 2.920690 5.489655 2.048276
##
## Coefficients of linear discriminants:
## LD1 LD2
## Sepal.Length 0.3629008 0.05215114
## Sepal.Width 2.2276982 1.47580354
## Petal.Length -1.7854533 -1.60918547
## Petal.Width -3.9745504 4.10534268
##
## Proportion of trace:
## LD1 LD2
## 0.9932 0.0068
Based on the training dataset, 33% belongs to setosa group, 33% belongs to versicolor groups and 32% belongs to virginica groups
Decision Trees in R
The first discriminant function is a linear combination of the four variables.
Percentage separations achieved by the first discriminant function is 99.32% and second is 0.63%
attributes(linear)
## $names
## [1] "prior" "counts" "means" "scaling" "lev" "svd" "N"
## [8] "call" "terms" "xlevels"
##
## $class
## [1] "lda"
Histogram. Stacked histogram for discriminant function values.
p <- predict(linear, training)
ldahist(data = p$x[,1], g = training$Species)
These histograms are based on ld1. It’s clearly evident that there are overlaps between first and second and first and third species. Also some overlap observed between the second and third species.
Market Basket Analysis in R
ldahist(data = p$x[,2], g = training$Species)
histogram based on lda2 showing complete overlap and its not good.
Bi-Plot
ggord(linear, training$Species, ylim = c(-10, 10))
Biplot based on LD1 and LD2. Setosa separated very clearly and some overlap observed between Versicolor and virginica.
Based on arrows, Sepal width and sepal length explained more for setosa, petal width and petal length explained more for versicolor and virginica.
Deep Neural Network in R
Partition plot. It provides the classification of each and every combination in the training dataset.
partimat(Species~., data = training, method = "lda")
partimat(Species~., data = training, method = "qda")
Confusion matrix and accuracy – training data
p1 <- predict(linear, training)$class
tab <- table(Predicted = p1, Actual = training$Species)
tab
## Actual
## Predicted setosa versicolor virginica
## setosa 30 0 0
## versicolor 0 30 0
## virginica 0 0 29
sum(diag(tab))/sum(tab)
## [1] 1
In the training dataset total correct classification is 30+30+29=89
The accuracy of the model is 1.
Confusion matrix and accuracy – testing data
p2 <- predict(linear, testing)$class
tab1 <- table(Predicted = p2, Actual = testing$Species)
tab1
## Actual
## Predicted setosa versicolor virginica
## setosa 20 0 0
## versicolor 0 19 1
## virginica 0 1 20
sum(diag(tab1))/sum(tab1)
## [1] 0.9672131
The accuracy of the model is around .9672131
Conclusion. Histogram and Biplot provide useful insights and helpful for interpretations and if there is not a great difference in the group covariance matrices, then the linear discriminant analysis will perform as well as quadratic. LDA is not useful for solving non-linear problems.