The required packages are loaded below
library(tidyverse)
library(MASS)
library(psych)
library(heplots)
library(plotly)
library(MVTests)
library(mvnormtest)
library(klaR)
During this problem set we will attempt to build a LDA model on the Iris dataset to determine group memebership of the Specied variables and see how the model performs. Before building the LDA model we should check for some important assumptions first. The assumptions for LDA are similar to MANOVA given that they are equivalent.
iris.dat <- iris
The correlation matrix is computed below. We can observe that the variables exhibit a moderate to strong linear correlation among themselves, indicating that some multicolinearity exists. This a violation of the assumption of LDA since it is equivalent to MANOVA
cor(iris.dat[,-5])
Sepal.Length Sepal.Width Petal.Length Petal.Width
Sepal.Length 1.0000000 -0.1175698 0.8717538 0.8179411
Sepal.Width -0.1175698 1.0000000 -0.4284401 -0.3661259
Petal.Length 0.8717538 -0.4284401 1.0000000 0.9628654
Petal.Width 0.8179411 -0.3661259 0.9628654 1.0000000
pipeline <- function(data, group){
fact <- levels(group)
cat('====================NORMALITY TEST==================\n')
for(x in fact){
cat(x)
print(data[group == x,][,-5] %>% t() %>% mshapiro.test())
}
cat('====================BOXPLOTS==================\n')
par(mfrow = c(1,3))
for (x in fact){
data[group == x,][,-5] %>% boxplot(main = x)
}
cat('====================VARIANCE HOMOGENEITY==================\n')
print(BoxM(data=iris[,-5],group = iris$Species))
cat('====================COR MATRIX DETERMINANTS==================\n')
cat('Determinants:\n ')
for (x in fact){
t <- data[group == x,][,-5] %>% cor() %>% det()
cat(x,":",t,"\n ")
}
}
pipeline(iris,iris$Species)
====================NORMALITY TEST==================
Setosa
Shapiro-Wilk normality test
data: Z
W = 0.95878, p-value = 0.07906
Versicolor
Shapiro-Wilk normality test
data: Z
W = 0.93043, p-value = 0.005739
Virginica
Shapiro-Wilk normality test
data: Z
W = 0.93414, p-value = 0.007955
====================BOXPLOTS==================
====================VARIANCE HOMOGENEITY==================
$Chisq
[1] 140.943
$df
[1] 20
$p.value
[1] 3.352034e-20
$Test
[1] "BoxM"
attr(,"class")
[1] "MVTests" "list"
====================COR MATRIX DETERMINANTS==================
Determinants:
Setosa : 0.3533595
Versicolor : 0.08359418
Virginica : 0.1373902
The small p.value indicates that the variance covariance matrices are not homogeneous, which is a violation of one of the main assumption given that LDA is equivalent to MANOVA.
The setosa group is the only normally distributed at the 5% and 1% level. The versicolor and virginica group are not normally distributed at the 1% level, which is a violation of the required assumptions.
The boxplots show the presence of outliers in the dataset, which is a violation of an important assumption for the LDA. Two models shall be performed, one with the data as is and another with correction for some of the violations and see how the models performs.
The matrix determinants are possitive, indicating that we can proceed.
iris.dat %>%
group_by(Species) %>%
summarise(Avg.SepLen = mean(Sepal.Length),
Avg.SepWid = mean(Sepal.Width),
Avg.PetLen = mean(Petal.Length),
Avg.PetWid = mean(Petal.Width))
# A tibble: 3 x 5
Species Avg.SepLen Avg.SepWid Avg.PetLen Avg.PetWid
<fct> <dbl> <dbl> <dbl> <dbl>
1 Setosa 5.01 3.43 1.46 0.246
2 Versicolor 5.94 2.77 4.26 1.33
3 Virginica 6.59 2.97 5.55 2.03
We can observe the means for each species above, Setosa seems to be the species that structurally differs the most from the other two species.
library(plotly)
fig <- plot_ly(iris, x = ~Sepal.Length, y = ~Petal.Width, z = ~Petal.Length, color = ~Species, colors = c('red', 'blue','green'))
fig <- fig %>% add_markers()
fig <- fig %>% layout(scene = list(xaxis = list(title = 'Sepal Length'),
yaxis = list(title = 'Sepal Width'),
zaxis = list(title = 'Petal Length')))
fig
ir.out <- lda(Species ~ ., data = iris.dat)
ir.out
Call:
lda(Species ~ ., data = iris.dat)
Prior probabilities of groups:
Setosa Versicolor Virginica
0.3333333 0.3333333 0.3333333
Group means:
Sepal.Length Sepal.Width Petal.Length Petal.Width
Setosa 5.006 3.428 1.462 0.246
Versicolor 5.936 2.770 4.260 1.326
Virginica 6.588 2.974 5.552 2.026
Coefficients of linear discriminants:
LD1 LD2
Sepal.Length 0.8293776 0.02410215
Sepal.Width 1.5344731 2.16452123
Petal.Length -2.2012117 -0.93192121
Petal.Width -2.8104603 2.83918785
Proportion of trace:
LD1 LD2
0.9912 0.0088
We can see that the model yields two discriminant functions with the parameters: \[
D_1 =0.82SL + 1.53SW -2.20PL -2.81PW
\] \[
D_2 = 0.02SL + 2.16SW -0.93PL + 2.83PW
\]
We can observe in the proportion of trace that LD1 explains 99% of the variance in the data.
# Partitions
partimat(Species ~ ., data = iris)
We can see above in the partition plot how the model is partitioning the data accross the different variables it takes into account. This is a nice visual method to see how the model is making the clasifications.
iris.pred <- predict(ir.out, iris.dat)
ldahist(data = iris.pred$x[,1], g = iris.dat$Species)
ldahist(data = iris.pred$x[,2], g = iris.dat$Species)
plot(ir.out, col = as.integer(iris$Species))
We can see in the histogram that LD1 does a good job in grouping the data, whereas LD2 struggles to define group membership. However, given that 99% of the variance is attributed to LD1, we can disregard the poor performance of LD2.
iris.results <- iris.pred$class %>% cbind()
iris.prior <- cbind(iris.dat$Species)
ct <- table(iris.prior,iris.results)
ct
iris.results
iris.prior 1 2 3
1 50 0 0
2 0 48 2
3 0 1 49
prop.table(ct)
iris.results
iris.prior 1 2 3
1 0.333333333 0.000000000 0.000000000
2 0.000000000 0.320000000 0.013333333
3 0.000000000 0.006666667 0.326666667
prop.table(ct) %>% diag() %>% sum()
[1] 0.98
The model has a 98% accuracy, it correctly classified 98% of the observations. This is a very good performance for the model despite the clear violations that we saw earlier. Possible next steps would be to test this classification model with out of sample data and evaluate its performance since we only
chisq.test(iris.dat$Species,iris.results)
Pearson's Chi-squared test
data: iris.dat$Species and iris.results
X-squared = 282.59, df = 4, p-value < 2.2e-16
The chi-squared test indicates that the classification results are statistically significant given the small p-value < 0.001