The Iris flower dataset or Fisher’s Iris data set is a multivariate
data set used and made famous by the British statistician and biologist
Ronald Fisher in his 1936 paper The use of multiple measurements in
taxonomic problems as an example of linear discriminant analysis (FISHER 1936)
The data set consists of 50 samples from each of three species of
Iris (Iris setosa, Iris virginica and Iris versicolor). Four
features were measured from each sample: the length and the width of the
sepals and petals, in centimeters. Based on the combination of these
four features, Fisher developed a linear discriminant model to
distinguish the species from each other.
kable(iris) %>% kable_styling(fixed_thead = T, full_width = FALSE) %>%
scroll_box( height = "600px")
| Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species |
|---|---|---|---|---|
| 5.1 | 3.5 | 1.4 | 0.2 | setosa |
| 4.9 | 3.0 | 1.4 | 0.2 | setosa |
| 4.7 | 3.2 | 1.3 | 0.2 | setosa |
| 4.6 | 3.1 | 1.5 | 0.2 | setosa |
| 5.0 | 3.6 | 1.4 | 0.2 | setosa |
| 5.4 | 3.9 | 1.7 | 0.4 | setosa |
| 4.6 | 3.4 | 1.4 | 0.3 | setosa |
| 5.0 | 3.4 | 1.5 | 0.2 | setosa |
| 4.4 | 2.9 | 1.4 | 0.2 | setosa |
| 4.9 | 3.1 | 1.5 | 0.1 | setosa |
| 5.4 | 3.7 | 1.5 | 0.2 | setosa |
| 4.8 | 3.4 | 1.6 | 0.2 | setosa |
| 4.8 | 3.0 | 1.4 | 0.1 | setosa |
| 4.3 | 3.0 | 1.1 | 0.1 | setosa |
| 5.8 | 4.0 | 1.2 | 0.2 | setosa |
| 5.7 | 4.4 | 1.5 | 0.4 | setosa |
| 5.4 | 3.9 | 1.3 | 0.4 | setosa |
| 5.1 | 3.5 | 1.4 | 0.3 | setosa |
| 5.7 | 3.8 | 1.7 | 0.3 | setosa |
| 5.1 | 3.8 | 1.5 | 0.3 | setosa |
| 5.4 | 3.4 | 1.7 | 0.2 | setosa |
| 5.1 | 3.7 | 1.5 | 0.4 | setosa |
| 4.6 | 3.6 | 1.0 | 0.2 | setosa |
| 5.1 | 3.3 | 1.7 | 0.5 | setosa |
| 4.8 | 3.4 | 1.9 | 0.2 | setosa |
| 5.0 | 3.0 | 1.6 | 0.2 | setosa |
| 5.0 | 3.4 | 1.6 | 0.4 | setosa |
| 5.2 | 3.5 | 1.5 | 0.2 | setosa |
| 5.2 | 3.4 | 1.4 | 0.2 | setosa |
| 4.7 | 3.2 | 1.6 | 0.2 | setosa |
| 4.8 | 3.1 | 1.6 | 0.2 | setosa |
| 5.4 | 3.4 | 1.5 | 0.4 | setosa |
| 5.2 | 4.1 | 1.5 | 0.1 | setosa |
| 5.5 | 4.2 | 1.4 | 0.2 | setosa |
| 4.9 | 3.1 | 1.5 | 0.2 | setosa |
| 5.0 | 3.2 | 1.2 | 0.2 | setosa |
| 5.5 | 3.5 | 1.3 | 0.2 | setosa |
| 4.9 | 3.6 | 1.4 | 0.1 | setosa |
| 4.4 | 3.0 | 1.3 | 0.2 | setosa |
| 5.1 | 3.4 | 1.5 | 0.2 | setosa |
| 5.0 | 3.5 | 1.3 | 0.3 | setosa |
| 4.5 | 2.3 | 1.3 | 0.3 | setosa |
| 4.4 | 3.2 | 1.3 | 0.2 | setosa |
| 5.0 | 3.5 | 1.6 | 0.6 | setosa |
| 5.1 | 3.8 | 1.9 | 0.4 | setosa |
| 4.8 | 3.0 | 1.4 | 0.3 | setosa |
| 5.1 | 3.8 | 1.6 | 0.2 | setosa |
| 4.6 | 3.2 | 1.4 | 0.2 | setosa |
| 5.3 | 3.7 | 1.5 | 0.2 | setosa |
| 5.0 | 3.3 | 1.4 | 0.2 | setosa |
| 7.0 | 3.2 | 4.7 | 1.4 | versicolor |
| 6.4 | 3.2 | 4.5 | 1.5 | versicolor |
| 6.9 | 3.1 | 4.9 | 1.5 | versicolor |
| 5.5 | 2.3 | 4.0 | 1.3 | versicolor |
| 6.5 | 2.8 | 4.6 | 1.5 | versicolor |
| 5.7 | 2.8 | 4.5 | 1.3 | versicolor |
| 6.3 | 3.3 | 4.7 | 1.6 | versicolor |
| 4.9 | 2.4 | 3.3 | 1.0 | versicolor |
| 6.6 | 2.9 | 4.6 | 1.3 | versicolor |
| 5.2 | 2.7 | 3.9 | 1.4 | versicolor |
| 5.0 | 2.0 | 3.5 | 1.0 | versicolor |
| 5.9 | 3.0 | 4.2 | 1.5 | versicolor |
| 6.0 | 2.2 | 4.0 | 1.0 | versicolor |
| 6.1 | 2.9 | 4.7 | 1.4 | versicolor |
| 5.6 | 2.9 | 3.6 | 1.3 | versicolor |
| 6.7 | 3.1 | 4.4 | 1.4 | versicolor |
| 5.6 | 3.0 | 4.5 | 1.5 | versicolor |
| 5.8 | 2.7 | 4.1 | 1.0 | versicolor |
| 6.2 | 2.2 | 4.5 | 1.5 | versicolor |
| 5.6 | 2.5 | 3.9 | 1.1 | versicolor |
| 5.9 | 3.2 | 4.8 | 1.8 | versicolor |
| 6.1 | 2.8 | 4.0 | 1.3 | versicolor |
| 6.3 | 2.5 | 4.9 | 1.5 | versicolor |
| 6.1 | 2.8 | 4.7 | 1.2 | versicolor |
| 6.4 | 2.9 | 4.3 | 1.3 | versicolor |
| 6.6 | 3.0 | 4.4 | 1.4 | versicolor |
| 6.8 | 2.8 | 4.8 | 1.4 | versicolor |
| 6.7 | 3.0 | 5.0 | 1.7 | versicolor |
| 6.0 | 2.9 | 4.5 | 1.5 | versicolor |
| 5.7 | 2.6 | 3.5 | 1.0 | versicolor |
| 5.5 | 2.4 | 3.8 | 1.1 | versicolor |
| 5.5 | 2.4 | 3.7 | 1.0 | versicolor |
| 5.8 | 2.7 | 3.9 | 1.2 | versicolor |
| 6.0 | 2.7 | 5.1 | 1.6 | versicolor |
| 5.4 | 3.0 | 4.5 | 1.5 | versicolor |
| 6.0 | 3.4 | 4.5 | 1.6 | versicolor |
| 6.7 | 3.1 | 4.7 | 1.5 | versicolor |
| 6.3 | 2.3 | 4.4 | 1.3 | versicolor |
| 5.6 | 3.0 | 4.1 | 1.3 | versicolor |
| 5.5 | 2.5 | 4.0 | 1.3 | versicolor |
| 5.5 | 2.6 | 4.4 | 1.2 | versicolor |
| 6.1 | 3.0 | 4.6 | 1.4 | versicolor |
| 5.8 | 2.6 | 4.0 | 1.2 | versicolor |
| 5.0 | 2.3 | 3.3 | 1.0 | versicolor |
| 5.6 | 2.7 | 4.2 | 1.3 | versicolor |
| 5.7 | 3.0 | 4.2 | 1.2 | versicolor |
| 5.7 | 2.9 | 4.2 | 1.3 | versicolor |
| 6.2 | 2.9 | 4.3 | 1.3 | versicolor |
| 5.1 | 2.5 | 3.0 | 1.1 | versicolor |
| 5.7 | 2.8 | 4.1 | 1.3 | versicolor |
| 6.3 | 3.3 | 6.0 | 2.5 | virginica |
| 5.8 | 2.7 | 5.1 | 1.9 | virginica |
| 7.1 | 3.0 | 5.9 | 2.1 | virginica |
| 6.3 | 2.9 | 5.6 | 1.8 | virginica |
| 6.5 | 3.0 | 5.8 | 2.2 | virginica |
| 7.6 | 3.0 | 6.6 | 2.1 | virginica |
| 4.9 | 2.5 | 4.5 | 1.7 | virginica |
| 7.3 | 2.9 | 6.3 | 1.8 | virginica |
| 6.7 | 2.5 | 5.8 | 1.8 | virginica |
| 7.2 | 3.6 | 6.1 | 2.5 | virginica |
| 6.5 | 3.2 | 5.1 | 2.0 | virginica |
| 6.4 | 2.7 | 5.3 | 1.9 | virginica |
| 6.8 | 3.0 | 5.5 | 2.1 | virginica |
| 5.7 | 2.5 | 5.0 | 2.0 | virginica |
| 5.8 | 2.8 | 5.1 | 2.4 | virginica |
| 6.4 | 3.2 | 5.3 | 2.3 | virginica |
| 6.5 | 3.0 | 5.5 | 1.8 | virginica |
| 7.7 | 3.8 | 6.7 | 2.2 | virginica |
| 7.7 | 2.6 | 6.9 | 2.3 | virginica |
| 6.0 | 2.2 | 5.0 | 1.5 | virginica |
| 6.9 | 3.2 | 5.7 | 2.3 | virginica |
| 5.6 | 2.8 | 4.9 | 2.0 | virginica |
| 7.7 | 2.8 | 6.7 | 2.0 | virginica |
| 6.3 | 2.7 | 4.9 | 1.8 | virginica |
| 6.7 | 3.3 | 5.7 | 2.1 | virginica |
| 7.2 | 3.2 | 6.0 | 1.8 | virginica |
| 6.2 | 2.8 | 4.8 | 1.8 | virginica |
| 6.1 | 3.0 | 4.9 | 1.8 | virginica |
| 6.4 | 2.8 | 5.6 | 2.1 | virginica |
| 7.2 | 3.0 | 5.8 | 1.6 | virginica |
| 7.4 | 2.8 | 6.1 | 1.9 | virginica |
| 7.9 | 3.8 | 6.4 | 2.0 | virginica |
| 6.4 | 2.8 | 5.6 | 2.2 | virginica |
| 6.3 | 2.8 | 5.1 | 1.5 | virginica |
| 6.1 | 2.6 | 5.6 | 1.4 | virginica |
| 7.7 | 3.0 | 6.1 | 2.3 | virginica |
| 6.3 | 3.4 | 5.6 | 2.4 | virginica |
| 6.4 | 3.1 | 5.5 | 1.8 | virginica |
| 6.0 | 3.0 | 4.8 | 1.8 | virginica |
| 6.9 | 3.1 | 5.4 | 2.1 | virginica |
| 6.7 | 3.1 | 5.6 | 2.4 | virginica |
| 6.9 | 3.1 | 5.1 | 2.3 | virginica |
| 5.8 | 2.7 | 5.1 | 1.9 | virginica |
| 6.8 | 3.2 | 5.9 | 2.3 | virginica |
| 6.7 | 3.3 | 5.7 | 2.5 | virginica |
| 6.7 | 3.0 | 5.2 | 2.3 | virginica |
| 6.3 | 2.5 | 5.0 | 1.9 | virginica |
| 6.5 | 3.0 | 5.2 | 2.0 | virginica |
| 6.2 | 3.4 | 5.4 | 2.3 | virginica |
| 5.9 | 3.0 | 5.1 | 1.8 | virginica |
We want to perform a Principal component analysis
(PCA) on the dataset in order to better undertstand the relationship
between the observed values and the variables
First step is scaling data, which consist in subtracting for all
values within each column their mean and then divide by their standard
deviation
\[
X^*_{i,j} = \frac{X_{i,j} - \overline{X_i}}{S_{X_i}}
\]
This operation allows you to have all the data in the same order of
magnitude.
iris_scaled = data.frame(scale(iris[-5]))
iris_scaled$Species = iris$Species
kable(iris_scaled) %>% kable_styling(fixed_thead = T, full_width = FALSE) %>%
scroll_box( height = "600px")
| Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species |
|---|---|---|---|---|
| -0.8976739 | 1.0156020 | -1.3357516 | -1.3110521 | setosa |
| -1.1392005 | -0.1315388 | -1.3357516 | -1.3110521 | setosa |
| -1.3807271 | 0.3273175 | -1.3923993 | -1.3110521 | setosa |
| -1.5014904 | 0.0978893 | -1.2791040 | -1.3110521 | setosa |
| -1.0184372 | 1.2450302 | -1.3357516 | -1.3110521 | setosa |
| -0.5353840 | 1.9333146 | -1.1658087 | -1.0486668 | setosa |
| -1.5014904 | 0.7861738 | -1.3357516 | -1.1798595 | setosa |
| -1.0184372 | 0.7861738 | -1.2791040 | -1.3110521 | setosa |
| -1.7430170 | -0.3609670 | -1.3357516 | -1.3110521 | setosa |
| -1.1392005 | 0.0978893 | -1.2791040 | -1.4422448 | setosa |
| -0.5353840 | 1.4744583 | -1.2791040 | -1.3110521 | setosa |
| -1.2599638 | 0.7861738 | -1.2224563 | -1.3110521 | setosa |
| -1.2599638 | -0.1315388 | -1.3357516 | -1.4422448 | setosa |
| -1.8637803 | -0.1315388 | -1.5056946 | -1.4422448 | setosa |
| -0.0523308 | 2.1627428 | -1.4490469 | -1.3110521 | setosa |
| -0.1730941 | 3.0804554 | -1.2791040 | -1.0486668 | setosa |
| -0.5353840 | 1.9333146 | -1.3923993 | -1.0486668 | setosa |
| -0.8976739 | 1.0156020 | -1.3357516 | -1.1798595 | setosa |
| -0.1730941 | 1.7038865 | -1.1658087 | -1.1798595 | setosa |
| -0.8976739 | 1.7038865 | -1.2791040 | -1.1798595 | setosa |
| -0.5353840 | 0.7861738 | -1.1658087 | -1.3110521 | setosa |
| -0.8976739 | 1.4744583 | -1.2791040 | -1.0486668 | setosa |
| -1.5014904 | 1.2450302 | -1.5623422 | -1.3110521 | setosa |
| -0.8976739 | 0.5567457 | -1.1658087 | -0.9174741 | setosa |
| -1.2599638 | 0.7861738 | -1.0525134 | -1.3110521 | setosa |
| -1.0184372 | -0.1315388 | -1.2224563 | -1.3110521 | setosa |
| -1.0184372 | 0.7861738 | -1.2224563 | -1.0486668 | setosa |
| -0.7769106 | 1.0156020 | -1.2791040 | -1.3110521 | setosa |
| -0.7769106 | 0.7861738 | -1.3357516 | -1.3110521 | setosa |
| -1.3807271 | 0.3273175 | -1.2224563 | -1.3110521 | setosa |
| -1.2599638 | 0.0978893 | -1.2224563 | -1.3110521 | setosa |
| -0.5353840 | 0.7861738 | -1.2791040 | -1.0486668 | setosa |
| -0.7769106 | 2.3921710 | -1.2791040 | -1.4422448 | setosa |
| -0.4146207 | 2.6215991 | -1.3357516 | -1.3110521 | setosa |
| -1.1392005 | 0.0978893 | -1.2791040 | -1.3110521 | setosa |
| -1.0184372 | 0.3273175 | -1.4490469 | -1.3110521 | setosa |
| -0.4146207 | 1.0156020 | -1.3923993 | -1.3110521 | setosa |
| -1.1392005 | 1.2450302 | -1.3357516 | -1.4422448 | setosa |
| -1.7430170 | -0.1315388 | -1.3923993 | -1.3110521 | setosa |
| -0.8976739 | 0.7861738 | -1.2791040 | -1.3110521 | setosa |
| -1.0184372 | 1.0156020 | -1.3923993 | -1.1798595 | setosa |
| -1.6222537 | -1.7375359 | -1.3923993 | -1.1798595 | setosa |
| -1.7430170 | 0.3273175 | -1.3923993 | -1.3110521 | setosa |
| -1.0184372 | 1.0156020 | -1.2224563 | -0.7862814 | setosa |
| -0.8976739 | 1.7038865 | -1.0525134 | -1.0486668 | setosa |
| -1.2599638 | -0.1315388 | -1.3357516 | -1.1798595 | setosa |
| -0.8976739 | 1.7038865 | -1.2224563 | -1.3110521 | setosa |
| -1.5014904 | 0.3273175 | -1.3357516 | -1.3110521 | setosa |
| -0.6561473 | 1.4744583 | -1.2791040 | -1.3110521 | setosa |
| -1.0184372 | 0.5567457 | -1.3357516 | -1.3110521 | setosa |
| 1.3968289 | 0.3273175 | 0.5336209 | 0.2632600 | versicolor |
| 0.6722490 | 0.3273175 | 0.4203256 | 0.3944526 | versicolor |
| 1.2760656 | 0.0978893 | 0.6469162 | 0.3944526 | versicolor |
| -0.4146207 | -1.7375359 | 0.1370873 | 0.1320673 | versicolor |
| 0.7930124 | -0.5903951 | 0.4769732 | 0.3944526 | versicolor |
| -0.1730941 | -0.5903951 | 0.4203256 | 0.1320673 | versicolor |
| 0.5514857 | 0.5567457 | 0.5336209 | 0.5256453 | versicolor |
| -1.1392005 | -1.5081078 | -0.2594462 | -0.2615107 | versicolor |
| 0.9137757 | -0.3609670 | 0.4769732 | 0.1320673 | versicolor |
| -0.7769106 | -0.8198233 | 0.0804397 | 0.2632600 | versicolor |
| -1.0184372 | -2.4258204 | -0.1461509 | -0.2615107 | versicolor |
| 0.0684325 | -0.1315388 | 0.2503826 | 0.3944526 | versicolor |
| 0.1891958 | -1.9669641 | 0.1370873 | -0.2615107 | versicolor |
| 0.3099591 | -0.3609670 | 0.5336209 | 0.2632600 | versicolor |
| -0.2938574 | -0.3609670 | -0.0895033 | 0.1320673 | versicolor |
| 1.0345390 | 0.0978893 | 0.3636779 | 0.2632600 | versicolor |
| -0.2938574 | -0.1315388 | 0.4203256 | 0.3944526 | versicolor |
| -0.0523308 | -0.8198233 | 0.1937350 | -0.2615107 | versicolor |
| 0.4307224 | -1.9669641 | 0.4203256 | 0.3944526 | versicolor |
| -0.2938574 | -1.2786796 | 0.0804397 | -0.1303181 | versicolor |
| 0.0684325 | 0.3273175 | 0.5902685 | 0.7880307 | versicolor |
| 0.3099591 | -0.5903951 | 0.1370873 | 0.1320673 | versicolor |
| 0.5514857 | -1.2786796 | 0.6469162 | 0.3944526 | versicolor |
| 0.3099591 | -0.5903951 | 0.5336209 | 0.0008746 | versicolor |
| 0.6722490 | -0.3609670 | 0.3070303 | 0.1320673 | versicolor |
| 0.9137757 | -0.1315388 | 0.3636779 | 0.2632600 | versicolor |
| 1.1553023 | -0.5903951 | 0.5902685 | 0.2632600 | versicolor |
| 1.0345390 | -0.1315388 | 0.7035638 | 0.6568380 | versicolor |
| 0.1891958 | -0.3609670 | 0.4203256 | 0.3944526 | versicolor |
| -0.1730941 | -1.0492515 | -0.1461509 | -0.2615107 | versicolor |
| -0.4146207 | -1.5081078 | 0.0237920 | -0.1303181 | versicolor |
| -0.4146207 | -1.5081078 | -0.0328556 | -0.2615107 | versicolor |
| -0.0523308 | -0.8198233 | 0.0804397 | 0.0008746 | versicolor |
| 0.1891958 | -0.8198233 | 0.7602115 | 0.5256453 | versicolor |
| -0.5353840 | -0.1315388 | 0.4203256 | 0.3944526 | versicolor |
| 0.1891958 | 0.7861738 | 0.4203256 | 0.5256453 | versicolor |
| 1.0345390 | 0.0978893 | 0.5336209 | 0.3944526 | versicolor |
| 0.5514857 | -1.7375359 | 0.3636779 | 0.1320673 | versicolor |
| -0.2938574 | -0.1315388 | 0.1937350 | 0.1320673 | versicolor |
| -0.4146207 | -1.2786796 | 0.1370873 | 0.1320673 | versicolor |
| -0.4146207 | -1.0492515 | 0.3636779 | 0.0008746 | versicolor |
| 0.3099591 | -0.1315388 | 0.4769732 | 0.2632600 | versicolor |
| -0.0523308 | -1.0492515 | 0.1370873 | 0.0008746 | versicolor |
| -1.0184372 | -1.7375359 | -0.2594462 | -0.2615107 | versicolor |
| -0.2938574 | -0.8198233 | 0.2503826 | 0.1320673 | versicolor |
| -0.1730941 | -0.1315388 | 0.2503826 | 0.0008746 | versicolor |
| -0.1730941 | -0.3609670 | 0.2503826 | 0.1320673 | versicolor |
| 0.4307224 | -0.3609670 | 0.3070303 | 0.1320673 | versicolor |
| -0.8976739 | -1.2786796 | -0.4293892 | -0.1303181 | versicolor |
| -0.1730941 | -0.5903951 | 0.1937350 | 0.1320673 | versicolor |
| 0.5514857 | 0.5567457 | 1.2700404 | 1.7063794 | virginica |
| -0.0523308 | -0.8198233 | 0.7602115 | 0.9192234 | virginica |
| 1.5175922 | -0.1315388 | 1.2133927 | 1.1816087 | virginica |
| 0.5514857 | -0.3609670 | 1.0434497 | 0.7880307 | virginica |
| 0.7930124 | -0.1315388 | 1.1567451 | 1.3128014 | virginica |
| 2.1214087 | -0.1315388 | 1.6099263 | 1.1816087 | virginica |
| -1.1392005 | -1.2786796 | 0.4203256 | 0.6568380 | virginica |
| 1.7591188 | -0.3609670 | 1.4399833 | 0.7880307 | virginica |
| 1.0345390 | -1.2786796 | 1.1567451 | 0.7880307 | virginica |
| 1.6383555 | 1.2450302 | 1.3266880 | 1.7063794 | virginica |
| 0.7930124 | 0.3273175 | 0.7602115 | 1.0504160 | virginica |
| 0.6722490 | -0.8198233 | 0.8735068 | 0.9192234 | virginica |
| 1.1553023 | -0.1315388 | 0.9868021 | 1.1816087 | virginica |
| -0.1730941 | -1.2786796 | 0.7035638 | 1.0504160 | virginica |
| -0.0523308 | -0.5903951 | 0.7602115 | 1.5751867 | virginica |
| 0.6722490 | 0.3273175 | 0.8735068 | 1.4439941 | virginica |
| 0.7930124 | -0.1315388 | 0.9868021 | 0.7880307 | virginica |
| 2.2421720 | 1.7038865 | 1.6665739 | 1.3128014 | virginica |
| 2.2421720 | -1.0492515 | 1.7798692 | 1.4439941 | virginica |
| 0.1891958 | -1.9669641 | 0.7035638 | 0.3944526 | virginica |
| 1.2760656 | 0.3273175 | 1.1000974 | 1.4439941 | virginica |
| -0.2938574 | -0.5903951 | 0.6469162 | 1.0504160 | virginica |
| 2.2421720 | -0.5903951 | 1.6665739 | 1.0504160 | virginica |
| 0.5514857 | -0.8198233 | 0.6469162 | 0.7880307 | virginica |
| 1.0345390 | 0.5567457 | 1.1000974 | 1.1816087 | virginica |
| 1.6383555 | 0.3273175 | 1.2700404 | 0.7880307 | virginica |
| 0.4307224 | -0.5903951 | 0.5902685 | 0.7880307 | virginica |
| 0.3099591 | -0.1315388 | 0.6469162 | 0.7880307 | virginica |
| 0.6722490 | -0.5903951 | 1.0434497 | 1.1816087 | virginica |
| 1.6383555 | -0.1315388 | 1.1567451 | 0.5256453 | virginica |
| 1.8798821 | -0.5903951 | 1.3266880 | 0.9192234 | virginica |
| 2.4836986 | 1.7038865 | 1.4966310 | 1.0504160 | virginica |
| 0.6722490 | -0.5903951 | 1.0434497 | 1.3128014 | virginica |
| 0.5514857 | -0.5903951 | 0.7602115 | 0.3944526 | virginica |
| 0.3099591 | -1.0492515 | 1.0434497 | 0.2632600 | virginica |
| 2.2421720 | -0.1315388 | 1.3266880 | 1.4439941 | virginica |
| 0.5514857 | 0.7861738 | 1.0434497 | 1.5751867 | virginica |
| 0.6722490 | 0.0978893 | 0.9868021 | 0.7880307 | virginica |
| 0.1891958 | -0.1315388 | 0.5902685 | 0.7880307 | virginica |
| 1.2760656 | 0.0978893 | 0.9301544 | 1.1816087 | virginica |
| 1.0345390 | 0.0978893 | 1.0434497 | 1.5751867 | virginica |
| 1.2760656 | 0.0978893 | 0.7602115 | 1.4439941 | virginica |
| -0.0523308 | -0.8198233 | 0.7602115 | 0.9192234 | virginica |
| 1.1553023 | 0.3273175 | 1.2133927 | 1.4439941 | virginica |
| 1.0345390 | 0.5567457 | 1.1000974 | 1.7063794 | virginica |
| 1.0345390 | -0.1315388 | 0.8168591 | 1.4439941 | virginica |
| 0.5514857 | -1.2786796 | 0.7035638 | 0.9192234 | virginica |
| 0.7930124 | -0.1315388 | 0.8168591 | 1.0504160 | virginica |
| 0.4307224 | 0.7861738 | 0.9301544 | 1.4439941 | virginica |
| 0.0684325 | -0.1315388 | 0.7602115 | 0.7880307 | virginica |
Each column of the resulting dataset will have mean \(0\) and standard deviation \(1\)
Now we can generate the correlation matrix which tell us the degree
of correlation between each pair of variables.
iris_cor = cor(iris_scaled[-5])
kable(iris_cor) %>% kable_styling(full_width = FALSE)
| Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | |
|---|---|---|---|---|
| Sepal.Length | 1.0000000 | -0.1175698 | 0.8717538 | 0.8179411 |
| Sepal.Width | -0.1175698 | 1.0000000 | -0.4284401 | -0.3661259 |
| Petal.Length | 0.8717538 | -0.4284401 | 1.0000000 | 0.9628654 |
| Petal.Width | 0.8179411 | -0.3661259 | 0.9628654 | 1.0000000 |
General formula for correlation between two variables \(X\) and \(Y\) is
\[
corr(X,Y) = \frac{cov(X,Y)}{\sigma_x \sigma_y}
\]
We can now calculate the eigenvalues
and eigenvectors of this matrix.
There is the following relationship between a generic matrix \({\bf C}\), a scalar \(\lambda\) (eigenvalue) and a vector \({\bf v}\) (eigenvector)
\[ {\bf Cv} = {\bf \lambda v} \]
For our correlation matrix eigenvalues are
eig = eigen(iris_cor)
eig$values
## [1] 2.91849782 0.91403047 0.14675688 0.02071484
You may notice that the sum of these eigenvalues is equal to
the number of variables of our original data matrix.
Each of them rapresents the amount of variability explained by every
principal component. We can also express them as a percentage of total
variance and put them in a plot also known as scree
plot.
values_prc = round((eig$values/sum(eig$values))*100, digits = 2)
values_prc = data.frame('PC' = c('PC1', 'PC2', 'PC3','PC4'),values_prc)
ggplot(values_prc, aes(PC, values_prc))+geom_col(fill = 'cornflowerblue')+ylab('Varianza')+
xlab('')+
theme_minimal()+
theme(
axis.line = element_line(color='black'),
panel.grid.major = element_blank(),
panel.grid.minor = element_blank()
)+geom_text(label=(paste0(values_prc$values_prc,'%')), nudge_y = 2.5)
Looking at this chart it’s clear that most of the total variance is
explained by the first principal component (the 72.96%).
We can therefore represent our original data using a 2-dimensional graph with PC1 and PC2 as x and Y axis.
Loadings are the correlations between the original variables and the unit-scaled components. We can define them as
\[ \text{Loadings} = \text{Eigenvectors} \cdot \sqrt{\text{Eigenvalues}} \]
and generate the so-called loading
vectors for each component.
load1 = eig$vectors[,1]*(sqrt(eig$values[1]))
load2 = eig$vectors[,2]*(sqrt(eig$values[2]))
load_df = data.frame('Variable' = colnames(iris[-5]) ,load1,load2)
kable(load_df) %>% kable_styling(full_width = FALSE)
| Variable | load1 | load2 |
|---|---|---|
| Sepal.Length | 0.8901688 | -0.3608299 |
| Sepal.Width | -0.4601427 | -0.8827163 |
| Petal.Length | 0.9915552 | -0.0234152 |
| Petal.Width | 0.9649790 | -0.0639998 |
We can project the original observation values (objects) onto the
principal components and call them scores. To do this properly
we have to multiply each loading vectors component for every value in
the original data matrix.
The loadings can be understood as the weights for each original
variable when calculating the principal component.
In our dataset, the first two score vectors are
pca = prcomp(iris_scaled[-5])
score = data.frame(pca$x[,1:2])
kable(score) %>% kable_styling(full_width = FALSE, fixed_thead = TRUE) %>% scroll_box(height = '600px')
| PC1 | PC2 |
|---|---|
| -2.2571412 | -0.4784238 |
| -2.0740130 | 0.6718827 |
| -2.3563351 | 0.3407664 |
| -2.2917068 | 0.5953999 |
| -2.3818627 | -0.6446757 |
| -2.0687006 | -1.4842053 |
| -2.4358684 | -0.0474851 |
| -2.2253919 | -0.2224030 |
| -2.3268453 | 1.1116037 |
| -2.1770349 | 0.4674476 |
| -2.1590770 | -1.0402059 |
| -2.3183641 | -0.1326340 |
| -2.2110437 | 0.7262432 |
| -2.6243090 | 0.9582963 |
| -2.1913992 | -1.8538466 |
| -2.2546612 | -2.6773152 |
| -2.2002168 | -1.4786557 |
| -2.1830361 | -0.4872061 |
| -1.8922328 | -1.4003276 |
| -2.3355448 | -1.1240836 |
| -1.9079312 | -0.4074906 |
| -2.1996438 | -0.9210359 |
| -2.7650814 | -0.4568133 |
| -1.8125972 | -0.0852729 |
| -2.2197270 | -0.1367962 |
| -1.9453293 | 0.6235297 |
| -2.0443028 | -0.2413550 |
| -2.1613365 | -0.5253894 |
| -2.1324196 | -0.3121720 |
| -2.2576980 | 0.3366042 |
| -2.1329765 | 0.5028561 |
| -1.8254792 | -0.4222804 |
| -2.6062169 | -1.7875873 |
| -2.4380098 | -2.1435468 |
| -2.1029299 | 0.4586653 |
| -2.2004372 | 0.2054192 |
| -2.0383177 | -0.6593492 |
| -2.5188934 | -0.5903152 |
| -2.4215203 | 0.9011611 |
| -2.1624662 | -0.2679812 |
| -2.2788408 | -0.4402405 |
| -1.8519184 | 2.3296107 |
| -2.5451120 | 0.4775010 |
| -1.9578886 | -0.4707496 |
| -2.1299236 | -1.1384155 |
| -2.0628336 | 0.7086786 |
| -2.3767708 | -1.1166887 |
| -2.3863817 | 0.3849572 |
| -2.2220026 | -0.9946277 |
| -2.1964750 | -0.0091856 |
| 1.0981024 | -0.8600910 |
| 0.7288956 | -0.5926294 |
| 1.2368358 | -0.6142399 |
| 0.4061225 | 1.7485462 |
| 1.0718838 | 0.2077251 |
| 0.3873895 | 0.5913027 |
| 0.7440371 | -0.7704383 |
| -0.4856956 | 1.8462440 |
| 0.9248035 | -0.0321185 |
| 0.0113880 | 1.0305658 |
| -0.1098283 | 2.6452111 |
| 0.4392220 | 0.0630839 |
| 0.5602315 | 1.7588321 |
| 0.7171593 | 0.1856028 |
| -0.0332433 | 0.4375374 |
| 0.8724843 | -0.5073642 |
| 0.3490822 | 0.1956563 |
| 0.1582798 | 0.7894510 |
| 1.2210032 | 1.6168273 |
| 0.1643673 | 1.2982599 |
| 0.7352196 | -0.3952474 |
| 0.4746969 | 0.4159269 |
| 1.2300573 | 0.9302094 |
| 0.6307451 | 0.4149974 |
| 0.7003151 | 0.0632001 |
| 0.8713545 | -0.2499560 |
| 1.2523137 | 0.0769981 |
| 1.3538695 | -0.3302055 |
| 0.6625807 | 0.2251735 |
| -0.0401242 | 1.0551836 |
| 0.1303585 | 1.5570556 |
| 0.0233744 | 1.5672252 |
| 0.2407318 | 0.7746612 |
| 1.0575517 | 0.6317269 |
| 0.2232309 | 0.2868127 |
| 0.4277063 | -0.8427589 |
| 1.0452264 | -0.5203087 |
| 1.0410438 | 1.3783710 |
| 0.0693560 | 0.2187704 |
| 0.2825307 | 1.3248861 |
| 0.2781460 | 1.1162889 |
| 0.6224844 | -0.0248398 |
| 0.3354067 | 0.9851038 |
| -0.3609741 | 2.0124958 |
| 0.2876227 | 0.8528731 |
| 0.0910556 | 0.1805871 |
| 0.2269565 | 0.3836349 |
| 0.5744638 | 0.1543565 |
| -0.4461723 | 1.5386375 |
| 0.2558734 | 0.5968523 |
| 1.8384100 | -0.8675151 |
| 1.1540156 | 0.6965364 |
| 2.1979036 | -0.5601340 |
| 1.4353421 | 0.0468307 |
| 1.8615758 | -0.2940597 |
| 2.7426851 | -0.7977367 |
| 0.3657922 | 1.5562892 |
| 2.2947518 | -0.4186630 |
| 1.9999863 | 0.7090632 |
| 2.2522322 | -1.9145963 |
| 1.3596206 | -0.6904434 |
| 1.5973275 | 0.4202924 |
| 1.8776105 | -0.4178498 |
| 1.2559077 | 1.1583797 |
| 1.4627449 | 0.4407949 |
| 1.5847682 | -0.6739869 |
| 1.4665185 | -0.2547683 |
| 2.4182277 | -2.5481248 |
| 3.2996415 | -0.0177216 |
| 1.2595471 | 1.7010467 |
| 2.0309126 | -0.9074274 |
| 0.9747153 | 0.5698553 |
| 2.8879765 | -0.4122600 |
| 1.3287806 | 0.4802025 |
| 1.6950553 | -1.0105365 |
| 1.9478014 | -1.0044127 |
| 1.1711801 | 0.3153381 |
| 1.0175417 | -0.0641312 |
| 1.7823788 | 0.1867356 |
| 1.8574250 | -0.5604133 |
| 2.4278203 | -0.2584187 |
| 2.2972318 | -2.6175544 |
| 1.8564838 | 0.1779533 |
| 1.1104277 | 0.2919446 |
| 1.1984584 | 0.8086064 |
| 2.7894256 | -0.8539425 |
| 1.5709929 | -1.0650132 |
| 1.3417970 | -0.4210202 |
| 0.9217370 | -0.0171656 |
| 1.8458612 | -0.6738706 |
| 2.0080832 | -0.6118359 |
| 1.8954342 | -0.6872731 |
| 1.1540156 | 0.6965364 |
| 2.0337450 | -0.8646240 |
| 1.9914755 | -1.0456657 |
| 1.8642579 | -0.3856740 |
| 1.5593565 | 0.8936929 |
| 1.5160915 | -0.2681707 |
| 1.3682042 | -1.0078779 |
| 0.9574485 | 0.0242504 |
A score plot will look like this
species = iris$Species
ggplot(score, aes(PC1, PC2, color=species))+geom_point()+
stat_ellipse()+
xlab(paste('PC1 - ', values_prc$values_prc[1], '%'))+
ylab(paste('PC2 - ', values_prc$values_prc[2], '%'))+
theme_classic()+
theme(plot.title = element_text(hjust = 0.5))
It’s immediately clear how the data are grouped into two major clusters and flowers of the specie setosa seem to have different characteristics than the others.
The same result could have been achieved through a classical
cluster analysis. With PCA however we’re able to see which
variables are most responsible for the variation.
Let’s take a look to the first two loading vectors we calculate
before and see what happens when we plot them one against the
other.
names = colnames(iris[-5])
ggplot(load_df, aes(load1, load2))+geom_point(color = 'white')+
xlab(paste('PC1 - ', values_prc$values_prc[1], '%'))+
ylab(paste('PC2 - ', values_prc$values_prc[2], '%'))+
theme_classic()+
xlim(-1,1.5)+
ylim(-1,1)+
geom_label(label= names, size=3, nudge_x = 0.15, fontface='bold')+
geom_segment(aes(xend=0, yend=0), color="brown1")+
theme(plot.title = element_text(hjust = 0.5))
This graph show us the direction of each variable along the principal
components.
To better understand the relationship between loadings and scores, i.e. variables and objects, a biplot may be useful!
ggplot(score, aes(PC1,PC2, color = species))+geom_point()+xlim(-4,4)+ylim(-4,4)+
theme_minimal()+
xlab(paste('PC1 - ', values_prc$values_prc[1], '%'))+
ylab(paste('PC2 - ', values_prc$values_prc[2], '%'))+
geom_label(data=load_df, aes(load1*2,load2*2), label=names, size=3, color='black' )+
stat_ellipse()
\(~\)