This is a method of supervised multivariate classification and it is one of the most powerful tool a data analyst have. With this method we can look for the significance of the variables in a model and we can make predictions. To make this model we need an explanatory matrix X and a response matrix Y. We will use the iris database as example.
Among the problems of this model, we have to consider the huge number of assumptions we have to care about. For example: 1- Similar size of the samples. 2- NA and outlier values. 3- Multicollinearity. 4- Multivariate normality. 5- Multivariate homocedasticity. 6- Linear relationship between variables.
We will work with the Fisher’s iris database.
iris
Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species |
---|---|---|---|---|
5.1 | 3.5 | 1.4 | 0.2 | setosa |
4.9 | 3.0 | 1.4 | 0.2 | setosa |
4.7 | 3.2 | 1.3 | 0.2 | setosa |
4.6 | 3.1 | 1.5 | 0.2 | setosa |
5.0 | 3.6 | 1.4 | 0.2 | setosa |
5.4 | 3.9 | 1.7 | 0.4 | setosa |
4.6 | 3.4 | 1.4 | 0.3 | setosa |
5.0 | 3.4 | 1.5 | 0.2 | setosa |
4.4 | 2.9 | 1.4 | 0.2 | setosa |
4.9 | 3.1 | 1.5 | 0.1 | setosa |
iris.hle <- decostand(as.matrix(iris[1:4]), "hellinger")
gr <- cutree(hclust(vegdist(iris.hle, "euc"), "ward.D"), 3)
table(gr)
## gr
## 1 2 3
## 50 48 52
We used k = 3 groups, as the theory suggests. The model classified the data in three groups of similar size.
The LDA model is a parametric model, so there are assumptions to check and control as we have already discussed.
any(is.na(iris))
## [1] FALSE
There is not any NA value.
iris.pars <- as.matrix(iris[, 1:4])
iris.pars.d <- dist(iris.pars)
(iris.MHV <- betadisper(iris.pars.d, gr))
##
## Homogeneity of multivariate dispersions
##
## Call: betadisper(d = iris.pars.d, group = gr)
##
## No. of Positive Eigenvalues: 4
## No. of Negative Eigenvalues: 0
##
## Average distance to median:
## 1 2 3
## 0.4814 0.6990 0.8190
##
## Eigenvalues for PCoA axes:
## PCoA1 PCoA2 PCoA3 PCoA4 <NA> <NA> <NA> <NA>
## 630.0080 36.1579 11.6532 3.5514 NA NA NA NA
anova(iris.MHV)
## Analysis of Variance Table
##
## Response: Distances
## Df Sum Sq Mean Sq F value Pr(>F)
## Groups 2 2.9735 1.4867 10.884 3.909e-05 ***
## Residuals 147 20.0803 0.1366
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
permutest(iris.MHV)
##
## Permutation test for homogeneity of multivariate dispersions
## Permutation: free
## Number of permutations: 999
##
## Response: Distances
## Df Sum Sq Mean Sq F N.Perm Pr(>F)
## Groups 2 2.9735 1.4867 10.884 999 0.001 ***
## Residuals 147 20.0803 0.1366
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We can not accept the homogeneity assumption. We try to transform the data eliminating outlier values.
skewness(iris[1:4])
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## 0.3086407 0.3126147 -0.2694109 -0.1009166
outliers::outlier(iris[2])
## Sepal.Width
## 4.4
which(iris[2] >= 4.1 | iris[2] < 2.2)
## [1] 16 33 34 61
iris[c(16, 33, 34, 61), 2] <- mean(iris[,2])
iris2 <- cbind(log(iris[1]), iris[2], iris[3], iris[4])
skewness(iris2[1:4])
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## 0.04272597 0.12722662 -0.26941093 -0.10091657
The two first variables are skewed to the right and the third one is skewed to the left.
iris.pars2 <- as.matrix(iris2)
iris.pars.d2 <- dist(iris.pars2)
(iris.MHV2 <- betadisper(iris.pars.d2, gr))
##
## Homogeneity of multivariate dispersions
##
## Call: betadisper(d = iris.pars.d2, group = gr)
##
## No. of Positive Eigenvalues: 4
## No. of Negative Eigenvalues: 0
##
## Average distance to median:
## 1 2 3
## 0.3401 0.5081 0.6266
##
## Eigenvalues for PCoA axes:
## PCoA1 PCoA2 PCoA3 PCoA4 <NA> <NA> <NA> <NA>
## 551.4433 19.7877 5.1255 0.4618 NA NA NA NA
anova(iris.MHV2)
## Analysis of Variance Table
##
## Response: Distances
## Df Sum Sq Mean Sq F value Pr(>F)
## Groups 2 2.1079 1.05394 14.288 2.137e-06 ***
## Residuals 147 10.8436 0.07377
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
permutest(iris.MHV2)
##
## Permutation test for homogeneity of multivariate dispersions
## Permutation: free
## Number of permutations: 999
##
## Response: Distances
## Df Sum Sq Mean Sq F N.Perm Pr(>F)
## Groups 2 2.1079 1.05394 14.288 999 0.001 ***
## Residuals 147 10.8436 0.07377
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We can not accept the homogeneity of the sample even after transforming the data. In this case we should make a non-parametric model like a Quadratic Discriminant Analysis (QDA).
par(mfrow = c(1, ncol(iris.pars2)))
for(j in 1:ncol(iris.pars2)){
hist(iris.pars2[,j])}
mshapiro.test(t(iris.pars2))
##
## Shapiro-Wilk normality test
##
## data: Z
## W = 0.97503, p-value = 0.007804
We can not accept the normality assumption too, due to the third and forth variables anormality problem.
as.dist(cor(iris.pars2))
## Sepal.Length Sepal.Width Petal.Length
## Sepal.Width -0.1377389
## Petal.Length 0.8786347 -0.4001647
## Petal.Width 0.8281772 -0.3359133 0.9628654
faraway::vif(iris.pars2)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## 6.236970 1.729705 27.477004 15.465980
There is one problem of multicollinearity between Petal.Length - Petal.Width. We will continue our analysis without the Petal.Length variable to improve the results.
iris.pars3 <- iris.pars[, -3]
psych::pairs.panels(iris[1:4], gap = 0, bg = c("red", "blue", "green")[iris$Species], pch = 21)
There are some non-linear relationships. We should try a KDA or K-mDA model instead to see a better performance of the model. We will continue with our LDA model in light of example.
iris.pars3.df <- as.data.frame(iris.pars3)
(iris.lda <- lda(gr ~ Sepal.Length + Sepal.Width + Petal.Width, data = iris.pars3.df))
## Call:
## lda(gr ~ Sepal.Length + Sepal.Width + Petal.Width, data = iris.pars3.df)
##
## Prior probabilities of groups:
## 1 2 3
## 0.3333333 0.3200000 0.3466667
##
## Group means:
## Sepal.Length Sepal.Width Petal.Width
## 1 5.006000 3.428000 0.246000
## 2 5.927083 2.777083 1.316667
## 3 6.571154 2.959615 2.007692
##
## Coefficients of linear discriminants:
## LD1 LD2
## Sepal.Length 0.6532649 -0.659779
## Sepal.Width -2.4718234 2.819200
## Petal.Width 4.9757063 1.464479
##
## Proportion of trace:
## LD1 LD2
## 0.9896 0.0104
iris.lda$count
## 1 2 3
## 50 48 52
The formulas of the model will be: \(LD1 = 0.653 * Sepal.Length - 2.471 * Sepal.Width + 4.975 * Petal.Width\) \(LD2 = -0.659 * Sepal.Length + 2.819 * Sepal.Width + 1.464 * Petal.Width\)
The proporton of trace indicates that with just one LD we achive up to a 99% of discimination.
ggord(iris.lda, iris$Species)
There is some overlapping between virginica and versicolor species.
partimat(factor(gr) ~ Sepal.Length + Sepal.Width + Petal.Width, data = iris.pars3.df,
method = "lda", nplots.vert = 1)
Fp <- predict(iris.lda)$x
(iris.class <- predict(iris.lda)$class)
## [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [36] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 2
## [71] 3 2 2 2 2 2 2 3 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3
## [106] 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 3 2 3 3 3 2 2 3 3 3 3 3
## [141] 3 3 3 3 3 3 3 3 3 3
## Levels: 1 2 3
Here we have the partimat plot. We can see the missclassification in red, and the values to do the classification in blue, white and pink.
par(mfrow = c(1, 1))
plot(Fp[, 1], Fp[, 2], type = "n")
text(Fp[, 1], Fp[, 2], row.names(iris), col = c(as.numeric(iris.class) + 1))
abline(v = 0, lty = "dotted")
abline(h = 0, lty = "dotted")
for(i in 1:length(levels(as.factor(gr)))){
cov <- cov(Fp[gr == i, ])
centre <- apply(Fp[gr == i, ], 2, mean)
lines(ellipse(cov, centre = centre, level = 0.95))
}
Here again we can see there are some problems between the variable 2 and 3 (versicolor and virginica species).
iris.lda$svd ^ 2
## [1] 1589.57124 16.66553
100 * iris.lda$svd ^ 2/ sum(iris.lda$svd ^ 2)
## [1] 98.962449 1.037551
The first function can explain a 98% of the total variance of the sample. This means we could explain the model with just a 2D model using LD1.
punt <- predict(iris.lda)$x
summary(lm(punt ~ gr))
## Response LD1 :
##
## Call:
## lm(formula = LD1 ~ gr)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.0906 -0.9410 -0.0288 1.0436 3.9210
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -10.9928 0.2996 -36.69 <2e-16 ***
## gr 5.4600 0.1377 39.65 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.391 on 148 degrees of freedom
## Multiple R-squared: 0.914, Adjusted R-squared: 0.9134
## F-statistic: 1572 on 1 and 148 DF, p-value: < 2.2e-16
##
##
## Response LD2 :
##
## Call:
## lm(formula = LD2 ~ gr)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.8107 -0.6906 -0.0500 0.6202 2.8304
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.2408 0.2369 -1.017 0.311
## gr 0.1196 0.1089 1.099 0.274
##
## Residual standard error: 1.099 on 148 degrees of freedom
## Multiple R-squared: 0.008091, Adjusted R-squared: 0.001389
## F-statistic: 1.207 on 1 and 148 DF, p-value: 0.2737
Our model can predict a 80% of the variance, so we can say our model can predict quite well.
iris.class <- predict(iris.lda)$class
(iris.table <- table(gr, iris.class))
## iris.class
## gr 1 2 3
## 1 50 0 0
## 2 0 45 3
## 3 0 5 47
sum(diag(iris.table))/sum(iris.table)
## [1] 0.9466667
cohen.kappa(iris.table)
## Call: cohen.kappa1(x = x, w = w, n.obs = n.obs, alpha = alpha, levels = levels)
##
## Cohen Kappa and Weighted Kappa correlation coefficients and confidence boundaries
## lower estimate upper
## unweighted kappa 0.87 0.92 0.97
## weighted kappa 0.93 0.96 0.99
##
## Number of subjects = 150
cor.test(gr, as.numeric(iris.class), method = "kendall")
##
## Kendall's rank correlation tau
##
## data: gr and as.numeric(iris.class)
## z = 13.001, p-value < 2.2e-16
## alternative hypothesis: true tau is not equal to 0
## sample estimates:
## tau
## 0.9469192
Our model could manage to classificate correctly the 94% of the data (CCR = .94). A 2% of the succesful predictions were due to chance.
iris.lda.jac <- lda(gr ~ Sepal.Length + Sepal.Width + Petal.Width, data = iris.pars3.df, CV = TRUE)
summary(iris.lda.jac)
## Length Class Mode
## class 150 factor numeric
## posterior 450 -none- numeric
## terms 3 terms call
## call 4 -none- call
## xlevels 0 -none- list
iris.jac.class <- iris.lda.jac$class
iris.jac.table <- table(gr, iris.jac.class)
diag(prop.table(iris.jac.table, 1))
## 1 2 3
## 1.0000000 0.9375000 0.8846154
diag(prop.table(iris.table, 1))
## 1 2 3
## 1.0000000 0.9375000 0.9038462
The crossed model has the same success rate as the previous model. There is no improvement.
Let’s try to test the prediction capability of our model. For instance: Sepal.Length = 5, Sepal.Width = 3.2, Petal.Length = 1.2, Petal.Width = 0.1.
pred <- c(5, 3.2, 0.1)
pred <- as.data.frame(t(pred))
colnames(pred) <- colnames(iris.pars3)
(pred.result <- predict(iris.lda, newdata = pred))
## $class
## [1] 1
## Levels: 1 2 3
##
## $posterior
## 1 2 3
## 1 1 1.800572e-13 1.294227e-27
##
## $x
## LD1 LD2
## 1 -6.373527 -0.6513308
This example is a sample of setosa iris (group 1).
This could be considered as the non-parametric version of the PCA. It is specially indicated with data that don’t follow a normal distribution.
iris.qda <- qda(iris[,-5], iris[,5], CV = TRUE)
(tqda <- table(Original = iris$Species, Prediction = iris.qda$class))
## Prediction
## Original setosa versicolor virginica
## setosa 50 0 0
## versicolor 0 48 2
## virginica 0 1 49
diag(prop.table(tqda, 1))
## setosa versicolor virginica
## 1.00 0.96 0.98
sum(diag(tqda))/sum(tqda)
## [1] 0.98
If we use a QDA we improve the discrimination by a 4% (CCR = .98). The non-parametric model is better since we could not accept the homocedasticity and normality assumptions.
partimat(factor(gr) ~ Sepal.Length + Sepal.Width + Petal.Width, data = iris.pars3.df,
method = "qda", nplots.vert = 1)