The Iris dataset is well-known among the data science crowd, and it is built into R:
View(iris)
head(iris)
## Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1 5.1 3.5 1.4 0.2 setosa
## 2 4.9 3.0 1.4 0.2 setosa
## 3 4.7 3.2 1.3 0.2 setosa
## 4 4.6 3.1 1.5 0.2 setosa
## 5 5.0 3.6 1.4 0.2 setosa
## 6 5.4 3.9 1.7 0.4 setosa
The Iris dataset in R.
One critical condition must met – the dataset must have more observations (rows) per group in the independent variable than a number of the dependent variables. For example, the Iris dataset has 3 groups and 4 dependent variables. This means we need more than 4 observations for each of the flower species.
Let’s visualize the boxplot for every dependent variable. There will be 4 plots in total arranged in a 2×2 grid, each having a dedicated boxplot for a specific flower species:
library(ggplot2)
library(gridExtra)
box_sl <- ggplot(iris, aes(x = Species, y = Sepal.Length, fill = Species)) +
geom_boxplot() +
theme(legend.position = "top")
box_sw <- ggplot(iris, aes(x = Species, y = Sepal.Width, fill = Species)) +
geom_boxplot() +
theme(legend.position = "top")
box_pl <- ggplot(iris, aes(x = Species, y = Petal.Length, fill = Species)) +
geom_boxplot() +
theme(legend.position = "top")
box_pw <- ggplot(iris, aes(x = Species, y = Petal.Width, fill = Species)) +
geom_boxplot() +
theme(legend.position = "top")
grid.arrange(box_sl, box_sw, box_pl, box_pw, ncol = 2, nrow = 2)
Boxplots for all dependent variables and all factors of the independent
variable.
It seems like the setosa species is more separated than the other two, but let’s not jump to conclusions.
We can now perform a one-way MANOVA in R.
dependent_vars <- cbind(iris$Sepal.Length, iris$Sepal.Width, iris$Petal.Length, iris$Petal.Width)
independent_var <- iris$Species
manova_model <- manova(dependent_vars ~ independent_var, data = iris)
summary(manova_model)
## Df Pillai approx F num Df den Df Pr(>F)
## independent_var 2 1.1919 53.466 8 290 < 2.2e-16 ***
## Residuals 147
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
MANOVA in R test summary.
The P-value is practically zero, which means we can safely reject the null hypothesis in the favor of the alternative one – at least one group mean vector differs from the rest.
We could also measure the effect size.
library(effectsize)
## Warning: package 'effectsize' was built under R version 4.1.3
## Registered S3 method overwritten by 'parameters':
## method from
## format.parameters_distribution datawizard
eta_squared(manova_model)
## # Effect Size for ANOVA (Type I)
##
## Parameter | Eta2 (partial) | 95% CI
## -----------------------------------------------
## independent_var | 0.60 | [0.54, 1.00]
##
## - One-sided CIs: upper bound fixed at (1).
Partial Eta Squared value for the MANOVA test.
The value is 0.6, which means the effect size is large.
A post-hoc test will come into play to know which group or groups are different from the rest.
Implement Linear Discriminant Analysis in R using the lda() function from the MASS package:
library(MASS)
iris_lda <- lda(independent_var ~ dependent_vars, CV = F)
iris_lda
## Call:
## lda(independent_var ~ dependent_vars, CV = F)
##
## Prior probabilities of groups:
## setosa versicolor virginica
## 0.3333333 0.3333333 0.3333333
##
## Group means:
## dependent_vars1 dependent_vars2 dependent_vars3 dependent_vars4
## setosa 5.006 3.428 1.462 0.246
## versicolor 5.936 2.770 4.260 1.326
## virginica 6.588 2.974 5.552 2.026
##
## Coefficients of linear discriminants:
## LD1 LD2
## dependent_vars1 0.8293776 0.02410215
## dependent_vars2 1.5344731 2.16452123
## dependent_vars3 -2.2012117 -0.93192121
## dependent_vars4 -2.8104603 2.83918785
##
## Proportion of trace:
## LD1 LD2
## 0.9912 0.0088
Linear Discriminant Analysis results.
Take a look at the coefficients to see how the dependent variables are used to form the LDA decision rule. LD1 is calculated as
LD1 = 0.83 * iris$Sepal.Length + 1.53 * iris$Sepal.Width - 2.20 * iris$Petal.Length - 2.81 * iris$Petal.Width
LD1
## [1] 5.946 5.015 5.375 4.699 6.016 5.585 5.097 5.490 4.447
## [10] 5.229 6.281 5.104 5.213 5.458 7.732 7.039 6.465 5.665
## [19] 5.962 5.904 5.382 5.470 6.564 4.137 4.444 4.658 4.708
## [28] 5.809 5.876 4.715 4.645 5.260 7.008 7.349 4.948 5.844
## [37] 6.498 6.214 4.820 5.573 5.802 3.551 5.126 4.299 4.743
## [46] 4.651 5.965 5.072 6.198 5.557 -3.568 -3.907 -4.525 -4.369
## [55] -4.656 -4.538 -4.558 -2.331 -3.858 -4.067 -3.300 -3.968 -3.264
## [64] -4.774 -2.488 -3.310 -4.877 -2.885 -5.603 -3.198 -5.825 -3.106
## [73] -5.941 -4.365 -3.364 -3.546 -4.566 -5.626 -4.698 -1.801 -3.214
## [82] -2.713 -3.007 -6.605 -5.043 -4.214 -4.251 -4.585 -3.435 -4.063
## [91] -4.509 -4.401 -3.380 -2.401 -4.114 -3.291 -3.725 -3.530 -1.633
## [100] -3.658 -9.947 -7.614 -8.398 -7.712 -8.957 -9.523 -6.785 -8.422
## [109] -8.432 -8.961 -6.549 -7.556 -7.767 -8.064 -8.866 -7.915 -7.173
## [118] -8.717 -11.274 -6.869 -8.380 -7.468 -9.685 -6.478 -7.831 -7.386
## [127] -6.188 -6.185 -8.625 -6.690 -8.333 -7.329 -8.906 -5.922 -7.213
## [136] -8.902 -8.633 -7.103 -6.048 -7.311 -8.760 -7.213 -7.614 -8.903
## [145] -8.955 -7.752 -7.285 -7.075 -7.995 -6.791
The snippet below uses the predict() function to get the linear discriminants and combines them with our independent variable:
lda_df <- data.frame(
species = iris[, "Species"],
lda = predict(iris_lda)$x
)
lda_df
## species lda.LD1 lda.LD2
## 1 setosa 8.0617998 0.300420621
## 2 setosa 7.1286877 -0.786660426
## 3 setosa 7.4898280 -0.265384488
## 4 setosa 6.8132006 -0.670631068
## 5 setosa 8.1323093 0.514462530
## 6 setosa 7.7019467 1.461720967
## 7 setosa 7.2126176 0.355836209
## 8 setosa 7.6052935 -0.011633838
## 9 setosa 6.5605516 -1.015163624
## 10 setosa 7.3430599 -0.947319209
## 11 setosa 8.3973865 0.647363392
## 12 setosa 7.2192969 -0.109646389
## 13 setosa 7.3267960 -1.072989426
## 14 setosa 7.5724707 -0.805464137
## 15 setosa 9.8498430 1.585936985
## 16 setosa 9.1582389 2.737596471
## 17 setosa 8.5824314 1.834489452
## 18 setosa 7.7807538 0.584339407
## 19 setosa 8.0783588 0.968580703
## 20 setosa 8.0209745 1.140503656
## 21 setosa 7.4968023 -0.188377220
## 22 setosa 7.5864812 1.207970318
## 23 setosa 8.6810429 0.877590154
## 24 setosa 6.2514036 0.439696367
## 25 setosa 6.5589334 -0.389222752
## 26 setosa 6.7713832 -0.970634453
## 27 setosa 6.8230803 0.463011612
## 28 setosa 7.9246164 0.209638715
## 29 setosa 7.9912902 0.086378713
## 30 setosa 6.8294645 -0.544960851
## 31 setosa 6.7589549 -0.759002759
## 32 setosa 7.3749525 0.565844592
## 33 setosa 9.1263463 1.224432671
## 34 setosa 9.4676820 1.825226345
## 35 setosa 7.0620139 -0.663400423
## 36 setosa 7.9587624 -0.164961722
## 37 setosa 8.6136720 0.403253602
## 38 setosa 8.3304176 0.228133530
## 39 setosa 6.9341201 -0.705519379
## 40 setosa 7.6882313 -0.009223623
## 41 setosa 7.9179372 0.675121313
## 42 setosa 5.6618807 -1.934355243
## 43 setosa 7.2410147 -0.272615132
## 44 setosa 6.4144356 1.247301306
## 45 setosa 6.8594438 1.051653957
## 46 setosa 6.7647039 -0.505151855
## 47 setosa 8.0818994 0.763392750
## 48 setosa 7.1867690 -0.360986823
## 49 setosa 8.3144488 0.644953177
## 50 setosa 7.6719674 -0.134893840
## 51 versicolor -1.4592755 0.028543764
## 52 versicolor -1.7977057 0.484385502
## 53 versicolor -2.4169489 -0.092784031
## 54 versicolor -2.2624735 -1.587252508
## 55 versicolor -2.5486784 -0.472204898
## 56 versicolor -2.4299673 -0.966132066
## 57 versicolor -2.4484846 0.795961954
## 58 versicolor -0.2226665 -1.584673183
## 59 versicolor -1.7502012 -0.821180130
## 60 versicolor -1.9584224 -0.351563753
## 61 versicolor -1.1937603 -2.634455704
## 62 versicolor -1.8589257 0.319006544
## 63 versicolor -1.1580939 -2.643409913
## 64 versicolor -2.6660572 -0.642504540
## 65 versicolor -0.3783672 0.086638931
## 66 versicolor -1.2011726 0.084437359
## 67 versicolor -2.7681025 0.032199536
## 68 versicolor -0.7768540 -1.659161847
## 69 versicolor -3.4980543 -1.684956162
## 70 versicolor -1.0904279 -1.626583496
## 71 versicolor -3.7158961 1.044514421
## 72 versicolor -0.9976104 -0.490530602
## 73 versicolor -3.8352593 -1.405958061
## 74 versicolor -2.2574125 -1.426794234
## 75 versicolor -1.2557133 -0.546424197
## 76 versicolor -1.4375576 -0.134424979
## 77 versicolor -2.4590614 -0.935277280
## 78 versicolor -3.5184849 0.160588866
## 79 versicolor -2.5897987 -0.174611728
## 80 versicolor 0.3074879 -1.318871459
## 81 versicolor -1.1066918 -1.752253714
## 82 versicolor -0.6055246 -1.942980378
## 83 versicolor -0.8987038 -0.904940034
## 84 versicolor -4.4984664 -0.882749915
## 85 versicolor -2.9339780 0.027379106
## 86 versicolor -2.1036082 1.191567675
## 87 versicolor -2.1425821 0.088779781
## 88 versicolor -2.4794560 -1.940739273
## 89 versicolor -1.3255257 -0.162869550
## 90 versicolor -1.9555789 -1.154348262
## 91 versicolor -2.4015702 -1.594583407
## 92 versicolor -2.2924888 -0.332860296
## 93 versicolor -1.2722722 -1.214584279
## 94 versicolor -0.2931761 -1.798715092
## 95 versicolor -2.0059888 -0.905418042
## 96 versicolor -1.1816631 -0.537570242
## 97 versicolor -1.6161564 -0.470103580
## 98 versicolor -1.4215888 -0.551244626
## 99 versicolor 0.4759738 -0.799905482
## 100 versicolor -1.5494826 -0.593363582
## 101 virginica -7.8394740 2.139733449
## 102 virginica -5.5074800 -0.035813989
## 103 virginica -6.2920085 0.467175777
## 104 virginica -5.6054563 -0.340738058
## 105 virginica -6.8505600 0.829825394
## 106 virginica -7.4181678 -0.173117995
## 107 virginica -4.6779954 -0.499095015
## 108 virginica -6.3169269 -0.968980756
## 109 virginica -6.3277368 -1.383289934
## 110 virginica -6.8528134 2.717589632
## 111 virginica -4.4407251 1.347236918
## 112 virginica -5.4500957 -0.207736942
## 113 virginica -5.6603371 0.832713617
## 114 virginica -5.9582372 -0.094017545
## 115 virginica -6.7592628 1.600232061
## 116 virginica -5.8070433 2.010198817
## 117 virginica -5.0660123 -0.026273384
## 118 virginica -6.6088188 1.751635872
## 119 virginica -9.1714749 -0.748255067
## 120 virginica -4.7645357 -2.155737197
## 121 virginica -6.2728391 1.649481407
## 122 virginica -5.3607119 0.646120732
## 123 virginica -7.5811998 -0.980722934
## 124 virginica -4.3715028 -0.121297458
## 125 virginica -5.7231753 1.293275530
## 126 virginica -5.2791592 -0.042458238
## 127 virginica -4.0808721 0.185936572
## 128 virginica -4.0770364 0.523238483
## 129 virginica -6.5191040 0.296976389
## 130 virginica -4.5837194 -0.856815813
## 131 virginica -6.2282401 -0.712719638
## 132 virginica -5.2204877 1.468195094
## 133 virginica -6.8001500 0.580895175
## 134 virginica -3.8151597 -0.942985932
## 135 virginica -5.1074897 -2.130589999
## 136 virginica -6.7967163 0.863090395
## 137 virginica -6.5244960 2.445035271
## 138 virginica -4.9955028 0.187768525
## 139 virginica -3.9398530 0.614020389
## 140 virginica -5.2038309 1.144768076
## 141 virginica -6.6530868 1.805319760
## 142 virginica -5.1055595 1.992182010
## 143 virginica -5.5074800 -0.035813989
## 144 virginica -6.7960192 1.460686950
## 145 virginica -6.8473594 2.428950671
## 146 virginica -5.6450035 1.677717335
## 147 virginica -5.1795646 -0.363475041
## 148 virginica -4.9677409 0.821140550
## 149 virginica -5.8861454 2.345090513
## 150 virginica -4.6831543 0.332033811
LDA dataset.
The final step in this post-hoc test is to visualize the above lda_df as a scatter plot. Ideally, we should see one or multiple groups stand out:
ggplot(lda_df) +
geom_point(aes(x = lda.LD1, y = lda.LD2, color = species), size = 4) +
theme_classic()
LDA dataset as a scatter plot.
The setosa species is significantly different when compared to virginica and versicolor. These two are more similar, suggesting that it was the setosa group that had the most impact for us to reject the null hypothesis.
Conclusion: The group mean vector of the setosa class is significantly different from the other group means, so it’s safe to assume it was a crucial factor for rejecting the null hypothesis.