Iris dataset

The Iris flower dataset or Fisher’s Iris data set is a multivariate data set used and made famous by the British statistician and biologist Ronald Fisher in his 1936 paper The use of multiple measurements in taxonomic problems as an example of linear discriminant analysis (FISHER 1936)

The data set consists of 50 samples from each of three species of Iris (Iris setosa, Iris virginica and Iris versicolor). Four features were measured from each sample: the length and the width of the sepals and petals, in centimeters. Based on the combination of these four features, Fisher developed a linear discriminant model to distinguish the species from each other.


kable(iris) %>% kable_styling(fixed_thead = T, full_width = FALSE) %>%
  scroll_box( height = "600px")
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
5.1 3.5 1.4 0.2 setosa
4.9 3.0 1.4 0.2 setosa
4.7 3.2 1.3 0.2 setosa
4.6 3.1 1.5 0.2 setosa
5.0 3.6 1.4 0.2 setosa
5.4 3.9 1.7 0.4 setosa
4.6 3.4 1.4 0.3 setosa
5.0 3.4 1.5 0.2 setosa
4.4 2.9 1.4 0.2 setosa
4.9 3.1 1.5 0.1 setosa
5.4 3.7 1.5 0.2 setosa
4.8 3.4 1.6 0.2 setosa
4.8 3.0 1.4 0.1 setosa
4.3 3.0 1.1 0.1 setosa
5.8 4.0 1.2 0.2 setosa
5.7 4.4 1.5 0.4 setosa
5.4 3.9 1.3 0.4 setosa
5.1 3.5 1.4 0.3 setosa
5.7 3.8 1.7 0.3 setosa
5.1 3.8 1.5 0.3 setosa
5.4 3.4 1.7 0.2 setosa
5.1 3.7 1.5 0.4 setosa
4.6 3.6 1.0 0.2 setosa
5.1 3.3 1.7 0.5 setosa
4.8 3.4 1.9 0.2 setosa
5.0 3.0 1.6 0.2 setosa
5.0 3.4 1.6 0.4 setosa
5.2 3.5 1.5 0.2 setosa
5.2 3.4 1.4 0.2 setosa
4.7 3.2 1.6 0.2 setosa
4.8 3.1 1.6 0.2 setosa
5.4 3.4 1.5 0.4 setosa
5.2 4.1 1.5 0.1 setosa
5.5 4.2 1.4 0.2 setosa
4.9 3.1 1.5 0.2 setosa
5.0 3.2 1.2 0.2 setosa
5.5 3.5 1.3 0.2 setosa
4.9 3.6 1.4 0.1 setosa
4.4 3.0 1.3 0.2 setosa
5.1 3.4 1.5 0.2 setosa
5.0 3.5 1.3 0.3 setosa
4.5 2.3 1.3 0.3 setosa
4.4 3.2 1.3 0.2 setosa
5.0 3.5 1.6 0.6 setosa
5.1 3.8 1.9 0.4 setosa
4.8 3.0 1.4 0.3 setosa
5.1 3.8 1.6 0.2 setosa
4.6 3.2 1.4 0.2 setosa
5.3 3.7 1.5 0.2 setosa
5.0 3.3 1.4 0.2 setosa
7.0 3.2 4.7 1.4 versicolor
6.4 3.2 4.5 1.5 versicolor
6.9 3.1 4.9 1.5 versicolor
5.5 2.3 4.0 1.3 versicolor
6.5 2.8 4.6 1.5 versicolor
5.7 2.8 4.5 1.3 versicolor
6.3 3.3 4.7 1.6 versicolor
4.9 2.4 3.3 1.0 versicolor
6.6 2.9 4.6 1.3 versicolor
5.2 2.7 3.9 1.4 versicolor
5.0 2.0 3.5 1.0 versicolor
5.9 3.0 4.2 1.5 versicolor
6.0 2.2 4.0 1.0 versicolor
6.1 2.9 4.7 1.4 versicolor
5.6 2.9 3.6 1.3 versicolor
6.7 3.1 4.4 1.4 versicolor
5.6 3.0 4.5 1.5 versicolor
5.8 2.7 4.1 1.0 versicolor
6.2 2.2 4.5 1.5 versicolor
5.6 2.5 3.9 1.1 versicolor
5.9 3.2 4.8 1.8 versicolor
6.1 2.8 4.0 1.3 versicolor
6.3 2.5 4.9 1.5 versicolor
6.1 2.8 4.7 1.2 versicolor
6.4 2.9 4.3 1.3 versicolor
6.6 3.0 4.4 1.4 versicolor
6.8 2.8 4.8 1.4 versicolor
6.7 3.0 5.0 1.7 versicolor
6.0 2.9 4.5 1.5 versicolor
5.7 2.6 3.5 1.0 versicolor
5.5 2.4 3.8 1.1 versicolor
5.5 2.4 3.7 1.0 versicolor
5.8 2.7 3.9 1.2 versicolor
6.0 2.7 5.1 1.6 versicolor
5.4 3.0 4.5 1.5 versicolor
6.0 3.4 4.5 1.6 versicolor
6.7 3.1 4.7 1.5 versicolor
6.3 2.3 4.4 1.3 versicolor
5.6 3.0 4.1 1.3 versicolor
5.5 2.5 4.0 1.3 versicolor
5.5 2.6 4.4 1.2 versicolor
6.1 3.0 4.6 1.4 versicolor
5.8 2.6 4.0 1.2 versicolor
5.0 2.3 3.3 1.0 versicolor
5.6 2.7 4.2 1.3 versicolor
5.7 3.0 4.2 1.2 versicolor
5.7 2.9 4.2 1.3 versicolor
6.2 2.9 4.3 1.3 versicolor
5.1 2.5 3.0 1.1 versicolor
5.7 2.8 4.1 1.3 versicolor
6.3 3.3 6.0 2.5 virginica
5.8 2.7 5.1 1.9 virginica
7.1 3.0 5.9 2.1 virginica
6.3 2.9 5.6 1.8 virginica
6.5 3.0 5.8 2.2 virginica
7.6 3.0 6.6 2.1 virginica
4.9 2.5 4.5 1.7 virginica
7.3 2.9 6.3 1.8 virginica
6.7 2.5 5.8 1.8 virginica
7.2 3.6 6.1 2.5 virginica
6.5 3.2 5.1 2.0 virginica
6.4 2.7 5.3 1.9 virginica
6.8 3.0 5.5 2.1 virginica
5.7 2.5 5.0 2.0 virginica
5.8 2.8 5.1 2.4 virginica
6.4 3.2 5.3 2.3 virginica
6.5 3.0 5.5 1.8 virginica
7.7 3.8 6.7 2.2 virginica
7.7 2.6 6.9 2.3 virginica
6.0 2.2 5.0 1.5 virginica
6.9 3.2 5.7 2.3 virginica
5.6 2.8 4.9 2.0 virginica
7.7 2.8 6.7 2.0 virginica
6.3 2.7 4.9 1.8 virginica
6.7 3.3 5.7 2.1 virginica
7.2 3.2 6.0 1.8 virginica
6.2 2.8 4.8 1.8 virginica
6.1 3.0 4.9 1.8 virginica
6.4 2.8 5.6 2.1 virginica
7.2 3.0 5.8 1.6 virginica
7.4 2.8 6.1 1.9 virginica
7.9 3.8 6.4 2.0 virginica
6.4 2.8 5.6 2.2 virginica
6.3 2.8 5.1 1.5 virginica
6.1 2.6 5.6 1.4 virginica
7.7 3.0 6.1 2.3 virginica
6.3 3.4 5.6 2.4 virginica
6.4 3.1 5.5 1.8 virginica
6.0 3.0 4.8 1.8 virginica
6.9 3.1 5.4 2.1 virginica
6.7 3.1 5.6 2.4 virginica
6.9 3.1 5.1 2.3 virginica
5.8 2.7 5.1 1.9 virginica
6.8 3.2 5.9 2.3 virginica
6.7 3.3 5.7 2.5 virginica
6.7 3.0 5.2 2.3 virginica
6.3 2.5 5.0 1.9 virginica
6.5 3.0 5.2 2.0 virginica
6.2 3.4 5.4 2.3 virginica
5.9 3.0 5.1 1.8 virginica


Scaling data

We want to perform a Principal component analysis (PCA) on the dataset in order to better undertstand the relationship between the observed values and the variables

First step is scaling data, which consist in subtracting for all values within each column their mean and then divide by their standard deviation


\[ X^*_{i,j} = \frac{X_{i,j} - \overline{X_i}}{S_{X_i}} \]

This operation allows you to have all the data in the same order of magnitude.

iris_scaled = data.frame(scale(iris[-5]))
iris_scaled$Species = iris$Species

kable(iris_scaled) %>% kable_styling(fixed_thead = T, full_width = FALSE) %>%
  scroll_box( height = "600px")
Sepal.Length Sepal.Width Petal.Length Petal.Width Species
-0.8976739 1.0156020 -1.3357516 -1.3110521 setosa
-1.1392005 -0.1315388 -1.3357516 -1.3110521 setosa
-1.3807271 0.3273175 -1.3923993 -1.3110521 setosa
-1.5014904 0.0978893 -1.2791040 -1.3110521 setosa
-1.0184372 1.2450302 -1.3357516 -1.3110521 setosa
-0.5353840 1.9333146 -1.1658087 -1.0486668 setosa
-1.5014904 0.7861738 -1.3357516 -1.1798595 setosa
-1.0184372 0.7861738 -1.2791040 -1.3110521 setosa
-1.7430170 -0.3609670 -1.3357516 -1.3110521 setosa
-1.1392005 0.0978893 -1.2791040 -1.4422448 setosa
-0.5353840 1.4744583 -1.2791040 -1.3110521 setosa
-1.2599638 0.7861738 -1.2224563 -1.3110521 setosa
-1.2599638 -0.1315388 -1.3357516 -1.4422448 setosa
-1.8637803 -0.1315388 -1.5056946 -1.4422448 setosa
-0.0523308 2.1627428 -1.4490469 -1.3110521 setosa
-0.1730941 3.0804554 -1.2791040 -1.0486668 setosa
-0.5353840 1.9333146 -1.3923993 -1.0486668 setosa
-0.8976739 1.0156020 -1.3357516 -1.1798595 setosa
-0.1730941 1.7038865 -1.1658087 -1.1798595 setosa
-0.8976739 1.7038865 -1.2791040 -1.1798595 setosa
-0.5353840 0.7861738 -1.1658087 -1.3110521 setosa
-0.8976739 1.4744583 -1.2791040 -1.0486668 setosa
-1.5014904 1.2450302 -1.5623422 -1.3110521 setosa
-0.8976739 0.5567457 -1.1658087 -0.9174741 setosa
-1.2599638 0.7861738 -1.0525134 -1.3110521 setosa
-1.0184372 -0.1315388 -1.2224563 -1.3110521 setosa
-1.0184372 0.7861738 -1.2224563 -1.0486668 setosa
-0.7769106 1.0156020 -1.2791040 -1.3110521 setosa
-0.7769106 0.7861738 -1.3357516 -1.3110521 setosa
-1.3807271 0.3273175 -1.2224563 -1.3110521 setosa
-1.2599638 0.0978893 -1.2224563 -1.3110521 setosa
-0.5353840 0.7861738 -1.2791040 -1.0486668 setosa
-0.7769106 2.3921710 -1.2791040 -1.4422448 setosa
-0.4146207 2.6215991 -1.3357516 -1.3110521 setosa
-1.1392005 0.0978893 -1.2791040 -1.3110521 setosa
-1.0184372 0.3273175 -1.4490469 -1.3110521 setosa
-0.4146207 1.0156020 -1.3923993 -1.3110521 setosa
-1.1392005 1.2450302 -1.3357516 -1.4422448 setosa
-1.7430170 -0.1315388 -1.3923993 -1.3110521 setosa
-0.8976739 0.7861738 -1.2791040 -1.3110521 setosa
-1.0184372 1.0156020 -1.3923993 -1.1798595 setosa
-1.6222537 -1.7375359 -1.3923993 -1.1798595 setosa
-1.7430170 0.3273175 -1.3923993 -1.3110521 setosa
-1.0184372 1.0156020 -1.2224563 -0.7862814 setosa
-0.8976739 1.7038865 -1.0525134 -1.0486668 setosa
-1.2599638 -0.1315388 -1.3357516 -1.1798595 setosa
-0.8976739 1.7038865 -1.2224563 -1.3110521 setosa
-1.5014904 0.3273175 -1.3357516 -1.3110521 setosa
-0.6561473 1.4744583 -1.2791040 -1.3110521 setosa
-1.0184372 0.5567457 -1.3357516 -1.3110521 setosa
1.3968289 0.3273175 0.5336209 0.2632600 versicolor
0.6722490 0.3273175 0.4203256 0.3944526 versicolor
1.2760656 0.0978893 0.6469162 0.3944526 versicolor
-0.4146207 -1.7375359 0.1370873 0.1320673 versicolor
0.7930124 -0.5903951 0.4769732 0.3944526 versicolor
-0.1730941 -0.5903951 0.4203256 0.1320673 versicolor
0.5514857 0.5567457 0.5336209 0.5256453 versicolor
-1.1392005 -1.5081078 -0.2594462 -0.2615107 versicolor
0.9137757 -0.3609670 0.4769732 0.1320673 versicolor
-0.7769106 -0.8198233 0.0804397 0.2632600 versicolor
-1.0184372 -2.4258204 -0.1461509 -0.2615107 versicolor
0.0684325 -0.1315388 0.2503826 0.3944526 versicolor
0.1891958 -1.9669641 0.1370873 -0.2615107 versicolor
0.3099591 -0.3609670 0.5336209 0.2632600 versicolor
-0.2938574 -0.3609670 -0.0895033 0.1320673 versicolor
1.0345390 0.0978893 0.3636779 0.2632600 versicolor
-0.2938574 -0.1315388 0.4203256 0.3944526 versicolor
-0.0523308 -0.8198233 0.1937350 -0.2615107 versicolor
0.4307224 -1.9669641 0.4203256 0.3944526 versicolor
-0.2938574 -1.2786796 0.0804397 -0.1303181 versicolor
0.0684325 0.3273175 0.5902685 0.7880307 versicolor
0.3099591 -0.5903951 0.1370873 0.1320673 versicolor
0.5514857 -1.2786796 0.6469162 0.3944526 versicolor
0.3099591 -0.5903951 0.5336209 0.0008746 versicolor
0.6722490 -0.3609670 0.3070303 0.1320673 versicolor
0.9137757 -0.1315388 0.3636779 0.2632600 versicolor
1.1553023 -0.5903951 0.5902685 0.2632600 versicolor
1.0345390 -0.1315388 0.7035638 0.6568380 versicolor
0.1891958 -0.3609670 0.4203256 0.3944526 versicolor
-0.1730941 -1.0492515 -0.1461509 -0.2615107 versicolor
-0.4146207 -1.5081078 0.0237920 -0.1303181 versicolor
-0.4146207 -1.5081078 -0.0328556 -0.2615107 versicolor
-0.0523308 -0.8198233 0.0804397 0.0008746 versicolor
0.1891958 -0.8198233 0.7602115 0.5256453 versicolor
-0.5353840 -0.1315388 0.4203256 0.3944526 versicolor
0.1891958 0.7861738 0.4203256 0.5256453 versicolor
1.0345390 0.0978893 0.5336209 0.3944526 versicolor
0.5514857 -1.7375359 0.3636779 0.1320673 versicolor
-0.2938574 -0.1315388 0.1937350 0.1320673 versicolor
-0.4146207 -1.2786796 0.1370873 0.1320673 versicolor
-0.4146207 -1.0492515 0.3636779 0.0008746 versicolor
0.3099591 -0.1315388 0.4769732 0.2632600 versicolor
-0.0523308 -1.0492515 0.1370873 0.0008746 versicolor
-1.0184372 -1.7375359 -0.2594462 -0.2615107 versicolor
-0.2938574 -0.8198233 0.2503826 0.1320673 versicolor
-0.1730941 -0.1315388 0.2503826 0.0008746 versicolor
-0.1730941 -0.3609670 0.2503826 0.1320673 versicolor
0.4307224 -0.3609670 0.3070303 0.1320673 versicolor
-0.8976739 -1.2786796 -0.4293892 -0.1303181 versicolor
-0.1730941 -0.5903951 0.1937350 0.1320673 versicolor
0.5514857 0.5567457 1.2700404 1.7063794 virginica
-0.0523308 -0.8198233 0.7602115 0.9192234 virginica
1.5175922 -0.1315388 1.2133927 1.1816087 virginica
0.5514857 -0.3609670 1.0434497 0.7880307 virginica
0.7930124 -0.1315388 1.1567451 1.3128014 virginica
2.1214087 -0.1315388 1.6099263 1.1816087 virginica
-1.1392005 -1.2786796 0.4203256 0.6568380 virginica
1.7591188 -0.3609670 1.4399833 0.7880307 virginica
1.0345390 -1.2786796 1.1567451 0.7880307 virginica
1.6383555 1.2450302 1.3266880 1.7063794 virginica
0.7930124 0.3273175 0.7602115 1.0504160 virginica
0.6722490 -0.8198233 0.8735068 0.9192234 virginica
1.1553023 -0.1315388 0.9868021 1.1816087 virginica
-0.1730941 -1.2786796 0.7035638 1.0504160 virginica
-0.0523308 -0.5903951 0.7602115 1.5751867 virginica
0.6722490 0.3273175 0.8735068 1.4439941 virginica
0.7930124 -0.1315388 0.9868021 0.7880307 virginica
2.2421720 1.7038865 1.6665739 1.3128014 virginica
2.2421720 -1.0492515 1.7798692 1.4439941 virginica
0.1891958 -1.9669641 0.7035638 0.3944526 virginica
1.2760656 0.3273175 1.1000974 1.4439941 virginica
-0.2938574 -0.5903951 0.6469162 1.0504160 virginica
2.2421720 -0.5903951 1.6665739 1.0504160 virginica
0.5514857 -0.8198233 0.6469162 0.7880307 virginica
1.0345390 0.5567457 1.1000974 1.1816087 virginica
1.6383555 0.3273175 1.2700404 0.7880307 virginica
0.4307224 -0.5903951 0.5902685 0.7880307 virginica
0.3099591 -0.1315388 0.6469162 0.7880307 virginica
0.6722490 -0.5903951 1.0434497 1.1816087 virginica
1.6383555 -0.1315388 1.1567451 0.5256453 virginica
1.8798821 -0.5903951 1.3266880 0.9192234 virginica
2.4836986 1.7038865 1.4966310 1.0504160 virginica
0.6722490 -0.5903951 1.0434497 1.3128014 virginica
0.5514857 -0.5903951 0.7602115 0.3944526 virginica
0.3099591 -1.0492515 1.0434497 0.2632600 virginica
2.2421720 -0.1315388 1.3266880 1.4439941 virginica
0.5514857 0.7861738 1.0434497 1.5751867 virginica
0.6722490 0.0978893 0.9868021 0.7880307 virginica
0.1891958 -0.1315388 0.5902685 0.7880307 virginica
1.2760656 0.0978893 0.9301544 1.1816087 virginica
1.0345390 0.0978893 1.0434497 1.5751867 virginica
1.2760656 0.0978893 0.7602115 1.4439941 virginica
-0.0523308 -0.8198233 0.7602115 0.9192234 virginica
1.1553023 0.3273175 1.2133927 1.4439941 virginica
1.0345390 0.5567457 1.1000974 1.7063794 virginica
1.0345390 -0.1315388 0.8168591 1.4439941 virginica
0.5514857 -1.2786796 0.7035638 0.9192234 virginica
0.7930124 -0.1315388 0.8168591 1.0504160 virginica
0.4307224 0.7861738 0.9301544 1.4439941 virginica
0.0684325 -0.1315388 0.7602115 0.7880307 virginica


Each column of the resulting dataset will have mean \(0\) and standard deviation \(1\)


Correlation matrix

Now we can generate the correlation matrix which tell us the degree of correlation between each pair of variables.

iris_cor = cor(iris_scaled[-5])
kable(iris_cor) %>% kable_styling(full_width = FALSE)
Sepal.Length Sepal.Width Petal.Length Petal.Width
Sepal.Length 1.0000000 -0.1175698 0.8717538 0.8179411
Sepal.Width -0.1175698 1.0000000 -0.4284401 -0.3661259
Petal.Length 0.8717538 -0.4284401 1.0000000 0.9628654
Petal.Width 0.8179411 -0.3661259 0.9628654 1.0000000


General formula for correlation between two variables \(X\) and \(Y\) is


\[ corr(X,Y) = \frac{cov(X,Y)}{\sigma_x \sigma_y} \]


Eigenvalues and eigenvectors


We can now calculate the eigenvalues and eigenvectors of this matrix.

There is the following relationship between a generic matrix \({\bf C}\), a scalar \(\lambda\) (eigenvalue) and a vector \({\bf v}\) (eigenvector)


\[ {\bf Cv} = {\bf \lambda v} \]


For our correlation matrix eigenvalues are

eig = eigen(iris_cor)
eig$values
## [1] 2.91849782 0.91403047 0.14675688 0.02071484


You may notice that the sum of these eigenvalues is equal to the number of variables of our original data matrix.

Each of them rapresents the amount of variability explained by every principal component. We can also express them as a percentage of total variance and put them in a plot also known as scree plot.

values_prc = round((eig$values/sum(eig$values))*100, digits = 2)
values_prc = data.frame('PC' = c('PC1', 'PC2', 'PC3','PC4'),values_prc)

ggplot(values_prc, aes(PC, values_prc))+geom_col(fill = 'cornflowerblue')+ylab('Varianza')+
  xlab('')+
  theme_minimal()+
    theme(
     axis.line = element_line(color='black'),
     panel.grid.major = element_blank(),
     panel.grid.minor = element_blank()
    
    )+geom_text(label=(paste0(values_prc$values_prc,'%')), nudge_y = 2.5)


Looking at this chart it’s clear that most of the total variance is explained by the first principal component (the 72.96%).

We can therefore represent our original data using a 2-dimensional graph with PC1 and PC2 as x and Y axis.



Loading vectors

Loadings are the correlations between the original variables and the unit-scaled components. We can define them as


\[ \text{Loadings} = \text{Eigenvectors} \cdot \sqrt{\text{Eigenvalues}} \]


and generate the so-called loading vectors for each component.

load1 = eig$vectors[,1]*(sqrt(eig$values[1]))
load2 = eig$vectors[,2]*(sqrt(eig$values[2]))
load_df = data.frame('Variable' = colnames(iris[-5]) ,load1,load2)
kable(load_df) %>% kable_styling(full_width = FALSE)
Variable load1 load2
Sepal.Length 0.8901688 -0.3608299
Sepal.Width -0.4601427 -0.8827163
Petal.Length 0.9915552 -0.0234152
Petal.Width 0.9649790 -0.0639998

Score vectors

We can project the original observation values (objects) onto the principal components and call them scores. To do this properly we have to multiply each loading vectors component for every value in the original data matrix.

The loadings can be understood as the weights for each original variable when calculating the principal component.  


In our dataset, the first two score vectors are

pca = prcomp(iris_scaled[-5])
score = data.frame(pca$x[,1:2])
kable(score) %>% kable_styling(full_width = FALSE, fixed_thead = TRUE) %>% scroll_box(height = '600px')
PC1 PC2
-2.2571412 -0.4784238
-2.0740130 0.6718827
-2.3563351 0.3407664
-2.2917068 0.5953999
-2.3818627 -0.6446757
-2.0687006 -1.4842053
-2.4358684 -0.0474851
-2.2253919 -0.2224030
-2.3268453 1.1116037
-2.1770349 0.4674476
-2.1590770 -1.0402059
-2.3183641 -0.1326340
-2.2110437 0.7262432
-2.6243090 0.9582963
-2.1913992 -1.8538466
-2.2546612 -2.6773152
-2.2002168 -1.4786557
-2.1830361 -0.4872061
-1.8922328 -1.4003276
-2.3355448 -1.1240836
-1.9079312 -0.4074906
-2.1996438 -0.9210359
-2.7650814 -0.4568133
-1.8125972 -0.0852729
-2.2197270 -0.1367962
-1.9453293 0.6235297
-2.0443028 -0.2413550
-2.1613365 -0.5253894
-2.1324196 -0.3121720
-2.2576980 0.3366042
-2.1329765 0.5028561
-1.8254792 -0.4222804
-2.6062169 -1.7875873
-2.4380098 -2.1435468
-2.1029299 0.4586653
-2.2004372 0.2054192
-2.0383177 -0.6593492
-2.5188934 -0.5903152
-2.4215203 0.9011611
-2.1624662 -0.2679812
-2.2788408 -0.4402405
-1.8519184 2.3296107
-2.5451120 0.4775010
-1.9578886 -0.4707496
-2.1299236 -1.1384155
-2.0628336 0.7086786
-2.3767708 -1.1166887
-2.3863817 0.3849572
-2.2220026 -0.9946277
-2.1964750 -0.0091856
1.0981024 -0.8600910
0.7288956 -0.5926294
1.2368358 -0.6142399
0.4061225 1.7485462
1.0718838 0.2077251
0.3873895 0.5913027
0.7440371 -0.7704383
-0.4856956 1.8462440
0.9248035 -0.0321185
0.0113880 1.0305658
-0.1098283 2.6452111
0.4392220 0.0630839
0.5602315 1.7588321
0.7171593 0.1856028
-0.0332433 0.4375374
0.8724843 -0.5073642
0.3490822 0.1956563
0.1582798 0.7894510
1.2210032 1.6168273
0.1643673 1.2982599
0.7352196 -0.3952474
0.4746969 0.4159269
1.2300573 0.9302094
0.6307451 0.4149974
0.7003151 0.0632001
0.8713545 -0.2499560
1.2523137 0.0769981
1.3538695 -0.3302055
0.6625807 0.2251735
-0.0401242 1.0551836
0.1303585 1.5570556
0.0233744 1.5672252
0.2407318 0.7746612
1.0575517 0.6317269
0.2232309 0.2868127
0.4277063 -0.8427589
1.0452264 -0.5203087
1.0410438 1.3783710
0.0693560 0.2187704
0.2825307 1.3248861
0.2781460 1.1162889
0.6224844 -0.0248398
0.3354067 0.9851038
-0.3609741 2.0124958
0.2876227 0.8528731
0.0910556 0.1805871
0.2269565 0.3836349
0.5744638 0.1543565
-0.4461723 1.5386375
0.2558734 0.5968523
1.8384100 -0.8675151
1.1540156 0.6965364
2.1979036 -0.5601340
1.4353421 0.0468307
1.8615758 -0.2940597
2.7426851 -0.7977367
0.3657922 1.5562892
2.2947518 -0.4186630
1.9999863 0.7090632
2.2522322 -1.9145963
1.3596206 -0.6904434
1.5973275 0.4202924
1.8776105 -0.4178498
1.2559077 1.1583797
1.4627449 0.4407949
1.5847682 -0.6739869
1.4665185 -0.2547683
2.4182277 -2.5481248
3.2996415 -0.0177216
1.2595471 1.7010467
2.0309126 -0.9074274
0.9747153 0.5698553
2.8879765 -0.4122600
1.3287806 0.4802025
1.6950553 -1.0105365
1.9478014 -1.0044127
1.1711801 0.3153381
1.0175417 -0.0641312
1.7823788 0.1867356
1.8574250 -0.5604133
2.4278203 -0.2584187
2.2972318 -2.6175544
1.8564838 0.1779533
1.1104277 0.2919446
1.1984584 0.8086064
2.7894256 -0.8539425
1.5709929 -1.0650132
1.3417970 -0.4210202
0.9217370 -0.0171656
1.8458612 -0.6738706
2.0080832 -0.6118359
1.8954342 -0.6872731
1.1540156 0.6965364
2.0337450 -0.8646240
1.9914755 -1.0456657
1.8642579 -0.3856740
1.5593565 0.8936929
1.5160915 -0.2681707
1.3682042 -1.0078779
0.9574485 0.0242504

Graphs


A score plot will look like this

species = iris$Species

ggplot(score, aes(PC1, PC2, color=species))+geom_point()+
  stat_ellipse()+
  xlab(paste('PC1 - ', values_prc$values_prc[1], '%'))+
  ylab(paste('PC2 - ', values_prc$values_prc[2], '%'))+
  theme_classic()+
  theme(plot.title = element_text(hjust = 0.5))


It’s immediately clear how the data are grouped into two major clusters and flowers of the specie setosa seem to have different characteristics than the others.  

The same result could have been achieved through a classical cluster analysis. With PCA however we’re able to see which variables are most responsible for the variation.

Let’s take a look to the first two loading vectors we calculate before and see what happens when we plot them one against the other.

names = colnames(iris[-5])

ggplot(load_df, aes(load1, load2))+geom_point(color = 'white')+
  xlab(paste('PC1 - ', values_prc$values_prc[1], '%'))+
  ylab(paste('PC2 - ', values_prc$values_prc[2], '%'))+
  theme_classic()+
  xlim(-1,1.5)+
  ylim(-1,1)+
  geom_label(label= names, size=3, nudge_x = 0.15, fontface='bold')+
  geom_segment(aes(xend=0, yend=0), color="brown1")+
  theme(plot.title = element_text(hjust = 0.5))


This graph show us the direction of each variable along the principal components.

To better understand the relationship between loadings and scores, i.e. variables and objects, a biplot may be useful!

ggplot(score, aes(PC1,PC2, color = species))+geom_point()+xlim(-4,4)+ylim(-4,4)+
  theme_minimal()+
  xlab(paste('PC1 - ', values_prc$values_prc[1], '%'))+
  ylab(paste('PC2 - ', values_prc$values_prc[2], '%'))+
  geom_label(data=load_df, aes(load1*2,load2*2), label=names, size=3, color='black' )+
  stat_ellipse()









\(~\)

References

FISHER, R. A. 1936. “THE USE OF MULTIPLE MEASUREMENTS IN TAXONOMIC PROBLEMS.” Annals of Eugenics 7 (2): 179–88. https://doi.org/10.1111/j.1469-1809.1936.tb02137.x.