The required packages are loaded below

library(tidyverse)
library(MASS)
library(psych)
library(heplots)
library(plotly)
library(MVTests)
library(mvnormtest)
library(klaR)

During this problem set we will attempt to build a LDA model on the Iris dataset to determine group memebership of the Specied variables and see how the model performs. Before building the LDA model we should check for some important assumptions first. The assumptions for LDA are similar to MANOVA given that they are equivalent.

iris.dat <- iris

Correlation

The correlation matrix is computed below. We can observe that the variables exhibit a moderate to strong linear correlation among themselves, indicating that some multicolinearity exists. This a violation of the assumption of LDA since it is equivalent to MANOVA

cor(iris.dat[,-5])

             Sepal.Length Sepal.Width Petal.Length Petal.Width
Sepal.Length    1.0000000  -0.1175698    0.8717538   0.8179411
Sepal.Width    -0.1175698   1.0000000   -0.4284401  -0.3661259
Petal.Length    0.8717538  -0.4284401    1.0000000   0.9628654
Petal.Width     0.8179411  -0.3661259    0.9628654   1.0000000

Normality, Variance-Covariance Homogeneity, Matrix Determinants

pipeline <- function(data, group){
  fact <- levels(group)
  cat('====================NORMALITY TEST==================\n')

  
  for(x in fact){
    cat(x)
    print(data[group == x,][,-5] %>% t() %>% mshapiro.test())

  }
   cat('====================BOXPLOTS==================\n')

  par(mfrow = c(1,3))
  for (x in fact){
    data[group == x,][,-5] %>% boxplot(main = x)
  }   
  cat('====================VARIANCE HOMOGENEITY==================\n')
   print(BoxM(data=iris[,-5],group = iris$Species))
   cat('====================COR MATRIX DETERMINANTS==================\n')
  cat('Determinants:\n ')
  for (x in fact){
   t <- data[group == x,][,-5] %>% cor() %>% det()
    cat(x,":",t,"\n ")
  }

  
}
pipeline(iris,iris$Species)

====================NORMALITY TEST==================
Setosa
    Shapiro-Wilk normality test

data:  Z
W = 0.95878, p-value = 0.07906

Versicolor
    Shapiro-Wilk normality test

data:  Z
W = 0.93043, p-value = 0.005739

Virginica
    Shapiro-Wilk normality test

data:  Z
W = 0.93414, p-value = 0.007955

====================BOXPLOTS==================

====================VARIANCE HOMOGENEITY==================
$Chisq
[1] 140.943

$df
[1] 20

$p.value
[1] 3.352034e-20

$Test
[1] "BoxM"

attr(,"class")
[1] "MVTests" "list"   
====================COR MATRIX DETERMINANTS==================
Determinants:
 Setosa : 0.3533595 
 Versicolor : 0.08359418 
 Virginica : 0.1373902

The small p.value indicates that the variance covariance matrices are not homogeneous, which is a violation of one of the main assumption given that LDA is equivalent to MANOVA.

The setosa group is the only normally distributed at the 5% and 1% level. The versicolor and virginica group are not normally distributed at the 1% level, which is a violation of the required assumptions.

The boxplots show the presence of outliers in the dataset, which is a violation of an important assumption for the LDA. Two models shall be performed, one with the data as is and another with correction for some of the violations and see how the models performs.

The matrix determinants are possitive, indicating that we can proceed.

Group Means

iris.dat %>% 
  group_by(Species) %>% 
  summarise(Avg.SepLen = mean(Sepal.Length),
            Avg.SepWid = mean(Sepal.Width),
            Avg.PetLen = mean(Petal.Length),
            Avg.PetWid = mean(Petal.Width))

# A tibble: 3 x 5
  Species    Avg.SepLen Avg.SepWid Avg.PetLen Avg.PetWid
  <fct>           <dbl>      <dbl>      <dbl>      <dbl>
1 Setosa           5.01       3.43       1.46      0.246
2 Versicolor       5.94       2.77       4.26      1.33 
3 Virginica        6.59       2.97       5.55      2.03

We can observe the means for each species above, Setosa seems to be the species that structurally differs the most from the other two species.

3D Plot

library(plotly)
fig <- plot_ly(iris, x = ~Sepal.Length, y = ~Petal.Width, z = ~Petal.Length, color = ~Species, colors = c('red', 'blue','green'))
fig <- fig %>% add_markers()
fig <- fig %>% layout(scene = list(xaxis = list(title = 'Sepal Length'),
                     yaxis = list(title = 'Sepal Width'),
                     zaxis = list(title = 'Petal Length')))
fig

Despite several of the assumptions were violated, in the 3D plot above we can observe that 3 different clusters of data exists. We will proceed with out analysis and evaluate how the model clasifies the data.

LDA Model

ir.out <- lda(Species ~ ., data = iris.dat)
ir.out

Call:
lda(Species ~ ., data = iris.dat)

Prior probabilities of groups:
    Setosa Versicolor  Virginica 
 0.3333333  0.3333333  0.3333333 

Group means:
           Sepal.Length Sepal.Width Petal.Length Petal.Width
Setosa            5.006       3.428        1.462       0.246
Versicolor        5.936       2.770        4.260       1.326
Virginica         6.588       2.974        5.552       2.026

Coefficients of linear discriminants:
                    LD1         LD2
Sepal.Length  0.8293776  0.02410215
Sepal.Width   1.5344731  2.16452123
Petal.Length -2.2012117 -0.93192121
Petal.Width  -2.8104603  2.83918785

Proportion of trace:
   LD1    LD2 
0.9912 0.0088

We can see that the model yields two discriminant functions with the parameters: \[ D_1 =0.82SL + 1.53SW -2.20PL -2.81PW \] \[ D_2 = 0.02SL + 2.16SW -0.93PL + 2.83PW \]

We can observe in the proportion of trace that LD1 explains 99% of the variance in the data.

# Partitions

partimat(Species ~ ., data = iris)

We can see above in the partition plot how the model is partitioning the data accross the different variables it takes into account. This is a nice visual method to see how the model is making the clasifications.

Predictions

iris.pred <- predict(ir.out, iris.dat)
ldahist(data = iris.pred$x[,1], g = iris.dat$Species)

ldahist(data = iris.pred$x[,2], g = iris.dat$Species)

plot(ir.out, col = as.integer(iris$Species))

We can see in the histogram that LD1 does a good job in grouping the data, whereas LD2 struggles to define group membership. However, given that 99% of the variance is attributed to LD1, we can disregard the poor performance of LD2.

Confusion Matrix

iris.results <- iris.pred$class %>% cbind()
iris.prior <- cbind(iris.dat$Species)
ct <- table(iris.prior,iris.results)
ct

          iris.results
iris.prior  1  2  3
         1 50  0  0
         2  0 48  2
         3  0  1 49

prop.table(ct)

          iris.results
iris.prior           1           2           3
         1 0.333333333 0.000000000 0.000000000
         2 0.000000000 0.320000000 0.013333333
         3 0.000000000 0.006666667 0.326666667

prop.table(ct) %>% diag() %>% sum()

[1] 0.98

The model has a 98% accuracy, it correctly classified 98% of the observations. This is a very good performance for the model despite the clear violations that we saw earlier. Possible next steps would be to test this classification model with out of sample data and evaluate its performance since we only

chisq.test(iris.dat$Species,iris.results)


    Pearson's Chi-squared test

data:  iris.dat$Species and iris.results
X-squared = 282.59, df = 4, p-value < 2.2e-16

The chi-squared test indicates that the classification results are statistically significant given the small p-value < 0.001

Problem Set 8

Jorge