Chen et al. (2018) examined morphological variation in the fruits of stone oaks (genus Lithocarpus), based on 595 nuts from 98 different species. The nuts have two primary tissues surrounding the seed, the exocarp and the receptacle.
The authors made a variety of measurements to quantify the morphology of the fruits, including aspects of the length (L), radius (r, R), and area (A) of the exocarp, receptacle, and seed space.
These parameters are available in the StoneOak.csv dataset.
StoneOaks <- read.csv("../Data/StoneOaks.csv")
head(StoneOaks)
## FruitType Species Theta Ae Re Ar Rr As Rs Le
## 1 ER L. amygdalifolius 61.5 0.33 0.78 0.62 0.68 1.28 0.39 2.12
## 2 ER L. amygdalifolius 63.5 0.27 0.64 0.33 0.63 1.30 0.41 1.86
## 3 ER L. amygdalifolius 67.5 0.21 0.64 0.67 0.69 1.16 0.42 1.46
## 4 ER L. amygdalifolius 67.5 0.25 0.70 0.79 0.80 1.59 0.47 1.67
## 5 AC L. bacgiangensis 76.0 0.39 0.81 0.09 0.53 0.96 0.45 2.24
## 6 AC L. bacgiangensis 85.0 0.31 0.75 0.13 0.54 0.65 0.40 1.87
## re Lr rr Ls rs
## 1 0.87 2.26 0.80 3.00 0.65
## 2 0.80 2.05 0.72 2.99 0.56
## 3 0.57 1.96 0.59 2.66 0.52
## 4 0.59 2.97 0.73 3.35 0.84
## 5 0.76 1.28 0.57 2.71 0.57
## 6 0.62 1.39 0.79 2.33 0.60
The authors went on to calculate additional variables related to the surface area and volume of the fruits and they also calculated means for each species, but we will work with their raw data.
With high dimensional multivariate data, it is often useful to explore the data a bit visually. One place to start is a matrix of scatterplots to examine the pairwise relationships between all the quantitative variables.
pairs(StoneOaks[3:15], panel=panel.smooth)
From this plot we can see that the different measures are correlated with one another to different degrees. That is many of them are not independent and the dataset might be described by a reduced set of independent principal components.
Using R, we can easily calculate the covariance matrix.
To examine correlations between the variables we can use the correlation matrix. Doing so reveals a wide variation in the pairwise correlation among the variables. Remember that correlations are normalized covariances and can range from -1 to 1.
StoneOakCov <- cov(StoneOaks[,3:15])
StoneOakCov
## Theta Ae Re Ar Rr As
## Theta 149.7607483 0.60167239 1.08175293 3.314593176 1.71832790 2.13752071
## Ae 0.6016724 0.03828256 0.03132380 0.042970480 0.02339404 0.05189740
## Re 1.0817529 0.03132380 0.06488572 0.050367021 0.03469600 0.07019661
## Ar 3.3145932 0.04297048 0.05036702 0.246078837 0.07914478 0.13372308
## Rr 1.7183279 0.02339404 0.03469600 0.079144780 0.05128747 0.08524873
## As 2.1375207 0.05189740 0.07019661 0.133723083 0.08524873 0.22781883
## Rs 0.7366638 0.01460227 0.02463396 0.035002446 0.02232554 0.04971484
## Le -1.4894656 0.06700876 0.08178344 0.007653201 0.01519463 0.12888984
## re 0.9166048 0.02498523 0.04277429 0.041159183 0.02989438 0.06363018
## Lr 7.1424637 0.08034449 0.11981844 0.378538081 0.20213298 0.34143382
## rr 1.3963691 0.01944005 0.02981709 0.057713284 0.04310170 0.07523087
## Ls 3.0134546 0.08456967 0.11716076 0.212185292 0.12277305 0.30958098
## rs 0.8278724 0.01668589 0.02657458 0.038004801 0.02764905 0.06333366
## Rs Le re Lr rr
## Theta 0.73666376 -1.489465620 0.91660476 7.142463679 1.39636910
## Ae 0.01460227 0.067008757 0.02498523 0.080344491 0.01944005
## Re 0.02463396 0.081783441 0.04277429 0.119818444 0.02981709
## Ar 0.03500245 0.007653201 0.04115918 0.378538081 0.05771328
## Rr 0.02232554 0.015194634 0.02989438 0.202132978 0.04310170
## As 0.04971484 0.128889842 0.06363018 0.341433823 0.07523087
## Rs 0.01765000 0.037212876 0.02247653 0.083190448 0.02058125
## Le 0.03721288 0.358704478 0.06803388 -0.003366839 0.01892589
## re 0.02247653 0.068033877 0.05727818 0.101316074 0.02679569
## Lr 0.08319045 -0.003366839 0.10131607 0.944171148 0.16758050
## rr 0.02058125 0.018925894 0.02679569 0.167580501 0.04634332
## Ls 0.07744979 0.227054061 0.10426093 0.476987679 0.10871888
## rs 0.01854388 0.041567851 0.02415352 0.099176520 0.02581814
## Ls rs
## Theta 3.01345458 0.82787236
## Ae 0.08456967 0.01668589
## Re 0.11716076 0.02657458
## Ar 0.21218529 0.03800480
## Rr 0.12277305 0.02764905
## As 0.30958098 0.06333366
## Rs 0.07744979 0.01854388
## Le 0.22705406 0.04156785
## re 0.10426093 0.02415352
## Lr 0.47698768 0.09917652
## rr 0.10871888 0.02581814
## Ls 0.50090500 0.09327469
## rs 0.09327469 0.02794408
One thing to notice is that the variable Theta has a substantially higher variance (the diagonal elements) than any of the other measures. This tells us that we should center and scale the variables before performing the PCA. To examine correlations between the variables we can use the correlation matrix, which is equivalent to the covariance matrix of the centered and scaled variables. Doing so reveals a wide variation in the pairwise correlation among the variables. Remember that correlations are normalized covariances and can range from -1 to 1.
StoneOakCor<-cor(StoneOaks[,3:15])
StoneOakCor
## Theta Ae Re Ar Rr As
## Theta 1.0000000 0.2512816 0.3470200 0.54600183 0.6200139 0.3659456
## Ae 0.2512816 1.0000000 0.6284916 0.44272352 0.5279578 0.5557127
## Re 0.3470200 0.6284916 1.0000000 0.39859749 0.6014495 0.5773601
## Ar 0.5460018 0.4427235 0.3985975 1.00000000 0.7044976 0.5647737
## Rr 0.6200139 0.5279578 0.6014495 0.70449761 1.0000000 0.7886558
## As 0.3659456 0.5557127 0.5773601 0.56477368 0.7886558 1.0000000
## Rs 0.4531041 0.5617557 0.7279263 0.53111542 0.7420343 0.7840052
## Le -0.2032184 0.5718244 0.5360711 0.02575951 0.1120253 0.4508743
## re 0.3129598 0.5335661 0.7016389 0.34668496 0.5515556 0.5570238
## Lr 0.6006530 0.4226008 0.4840874 0.78532076 0.9185572 0.7361840
## rr 0.5300388 0.4615335 0.5437474 0.54043701 0.8840875 0.7321629
## Ls 0.3479269 0.6107121 0.6498752 0.60436661 0.7659841 0.9164353
## rs 0.4046872 0.5101573 0.6240896 0.45830689 0.7303477 0.7937703
## Rs Le re Lr rr Ls
## Theta 0.4531041 -0.203218426 0.3129598 0.600653037 0.5300388 0.3479269
## Ae 0.5617557 0.571824407 0.5335661 0.422600752 0.4615335 0.6107121
## Re 0.7279263 0.536071117 0.7016389 0.484087358 0.5437474 0.6498752
## Ar 0.5311154 0.025759509 0.3466850 0.785320764 0.5404370 0.6043666
## Rr 0.7420343 0.112025328 0.5515556 0.918557172 0.8840875 0.7659841
## As 0.7840052 0.450874307 0.5570238 0.736183999 0.7321629 0.9164353
## Rs 1.0000000 0.467683972 0.7069071 0.644429998 0.7196242 0.8237022
## Le 0.4676840 1.000000000 0.4746377 -0.005785336 0.1467893 0.5356526
## re 0.7069071 0.474637684 1.0000000 0.435670729 0.5200878 0.6155301
## Lr 0.6444300 -0.005785336 0.4356707 1.000000000 0.8011329 0.6935921
## rr 0.7196242 0.146789314 0.5200878 0.801132874 1.0000000 0.7135653
## Ls 0.8237022 0.535652569 0.6155301 0.693592065 0.7135653 1.0000000
## rs 0.8349938 0.415187467 0.6037275 0.610574362 0.7174414 0.7883902
## rs
## Theta 0.4046872
## Ae 0.5101573
## Re 0.6240896
## Ar 0.4583069
## Rr 0.7303477
## As 0.7937703
## Rs 0.8349938
## Le 0.4151875
## re 0.6037275
## Lr 0.6105744
## rr 0.7174414
## Ls 0.7883902
## rs 1.0000000
max(StoneOakCor[StoneOakCor<1])
## [1] 0.9185572
min(abs(StoneOakCor))
## [1] 0.005785336
min(StoneOakCor)
## [1] -0.2032184
Now we can do the Principle Component Analysis “by hand” by doing an eigen analysis of the correlation matrix.
StoneOak.ManPCA<-eigen(StoneOakCor)
StoneOak.ManPCA$values/(sum(StoneOak.ManPCA$values)) #Proportion of variance in each PC
## [1] 0.611233831 0.146711268 0.055029524 0.050024802 0.029968930
## [6] 0.029110084 0.021655743 0.017403588 0.012567671 0.010035605
## [11] 0.007786416 0.004904356 0.003568180
StoneOak.ManPCA$vectors
## [,1] [,2] [,3] [,4] [,5]
## [1,] -0.1974044 0.40929126 0.513322558 -0.09978369 -0.222765129
## [2,] -0.2436566 -0.24707163 0.245054036 0.57302345 -0.520723504
## [3,] -0.2726068 -0.22636199 0.418971786 -0.05120042 0.042307879
## [4,] -0.2474476 0.29010072 0.006332077 0.56377516 0.512554456
## [5,] -0.3220976 0.22779735 -0.057159876 -0.01997325 -0.132314470
## [6,] -0.3155902 -0.04602524 -0.398673333 -0.01038796 0.005456579
## [7,] -0.3193796 -0.09846388 0.015509802 -0.23678390 0.125335564
## [8,] -0.1500575 -0.60736826 -0.086255766 0.15456582 -0.010977704
## [9,] -0.2569560 -0.22089468 0.415377556 -0.27775420 0.454889344
## [10,] -0.2939702 0.32631433 -0.133510547 0.12978869 0.088112876
## [11,] -0.2999453 0.17426657 -0.137059951 -0.23099556 -0.369729327
## [12,] -0.3223068 -0.10969427 -0.299014542 0.06085816 0.123645221
## [13,] -0.3032488 -0.07589222 -0.180066718 -0.33121626 -0.121153563
## [,6] [,7] [,8] [,9] [,10]
## [1,] 0.5780064 -0.11390392 -0.33089539 -0.03778539 -0.10002578
## [2,] -0.1121452 -0.28911764 0.23153747 0.15060265 0.19736230
## [3,] -0.1626784 0.75629263 -0.05041847 0.22529590 -0.03624990
## [4,] 0.1273928 0.05175965 0.25856260 -0.24875432 -0.19898895
## [5,] -0.2582862 0.03755131 -0.04608092 0.09394954 -0.05401412
## [6,] 0.1319615 -0.10168128 -0.34231354 0.40743507 0.11015806
## [7,] 0.2189837 0.07116547 0.26570963 -0.34778361 0.72360823
## [8,] 0.2180912 0.04760340 -0.29464822 -0.40646246 -0.36177958
## [9,] -0.3268321 -0.54722830 -0.05902572 0.08148467 -0.10083794
## [10,] -0.2626415 0.09130232 -0.10964321 0.11046563 -0.01156756
## [11,] -0.3791696 -0.01256630 -0.05912813 -0.56479342 -0.16015469
## [12,] 0.1564896 -0.03340364 -0.26661182 0.08592225 0.13824889
## [13,] 0.2948513 -0.02511660 0.63300643 0.23883566 -0.43072296
## [,11] [,12] [,13]
## [1,] -0.02475046 0.008802056 0.04184261
## [2,] 0.03725101 -0.020611282 0.08249909
## [3,] 0.17068203 -0.104132624 0.03826058
## [4,] 0.18984921 -0.181398063 -0.13353193
## [5,] -0.20144505 0.392997975 -0.73903403
## [6,] 0.02913990 -0.621315144 -0.17104137
## [7,] -0.20721766 -0.039945478 -0.05973507
## [8,] -0.37011609 0.046614735 -0.07233894
## [9,] 0.02528321 -0.028161113 0.02380228
## [10,] -0.59509048 0.090636510 0.54831013
## [11,] 0.32603898 -0.236667025 0.12822783
## [12,] 0.49918996 0.586737744 0.24300217
## [13,] -0.03473355 0.047115394 0.10591417
Let’s see how this compares with the results from princomp() using the correlation matrix.
StoneOak.PCA<-princomp(StoneOaks[,3:15], cor=TRUE, scores=TRUE)
summary(StoneOak.PCA)
## Importance of components:
## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5
## Standard deviation 2.8188721 1.3810309 0.84580365 0.8064257 0.62417633
## Proportion of Variance 0.6112338 0.1467113 0.05502952 0.0500248 0.02996893
## Cumulative Proportion 0.6112338 0.7579451 0.81297462 0.8629994 0.89296836
## Comp.6 Comp.7 Comp.8 Comp.9
## Standard deviation 0.61516753 0.53058898 0.47565391 0.40420258
## Proportion of Variance 0.02911008 0.02165574 0.01740359 0.01256767
## Cumulative Proportion 0.92207844 0.94373418 0.96113777 0.97370544
## Comp.10 Comp.11 Comp.12 Comp.13
## Standard deviation 0.3611964 0.318156264 0.252500756 0.21537489
## Proportion of Variance 0.0100356 0.007786416 0.004904356 0.00356818
## Cumulative Proportion 0.9837410 0.991527463 0.996431820 1.00000000
loadings(StoneOak.PCA)
##
## Loadings:
## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8 Comp.9
## Theta 0.197 0.409 0.513 0.223 0.578 0.114 0.331
## Ae 0.244 -0.247 0.245 -0.573 0.521 -0.112 0.289 -0.232 -0.151
## Re 0.273 -0.226 0.419 -0.163 -0.756 -0.225
## Ar 0.247 0.290 -0.564 -0.513 0.127 -0.259 0.249
## Rr 0.322 0.228 0.132 -0.258
## As 0.316 -0.399 0.132 0.102 0.342 -0.407
## Rs 0.319 0.237 -0.125 0.219 -0.266 0.348
## Le 0.150 -0.607 -0.155 0.218 0.295 0.406
## re 0.257 -0.221 0.415 0.278 -0.455 -0.327 0.547
## Lr 0.294 0.326 -0.134 -0.130 -0.263 0.110 -0.110
## rr 0.300 0.174 -0.137 0.231 0.370 -0.379 0.565
## Ls 0.322 -0.110 -0.299 -0.124 0.156 0.267
## rs 0.303 -0.180 0.331 0.121 0.295 -0.633 -0.239
## Comp.10 Comp.11 Comp.12 Comp.13
## Theta 0.100
## Ae -0.197
## Re -0.171 -0.104
## Ar 0.199 -0.190 -0.181 -0.134
## Rr 0.201 0.393 -0.739
## As -0.110 -0.621 -0.171
## Rs -0.724 0.207
## Le 0.362 0.370
## re 0.101
## Lr 0.595 0.548
## rr 0.160 -0.326 -0.237 0.128
## Ls -0.138 -0.499 0.587 0.243
## rs 0.431 0.106
##
## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6 Comp.7 Comp.8
## SS loadings 1.000 1.000 1.000 1.000 1.000 1.000 1.000 1.000
## Proportion Var 0.077 0.077 0.077 0.077 0.077 0.077 0.077 0.077
## Cumulative Var 0.077 0.154 0.231 0.308 0.385 0.462 0.538 0.615
## Comp.9 Comp.10 Comp.11 Comp.12 Comp.13
## SS loadings 1.000 1.000 1.000 1.000 1.000
## Proportion Var 0.077 0.077 0.077 0.077 0.077
## Cumulative Var 0.692 0.769 0.846 0.923 1.000
Notice that the eigenvalues, when normalized by their sum, describe the proportion of variance in the data contained by that PC. Also, the loadings of the different variables on the PC actually correspond to the eigenvectors of the correlation matrix.
We can also visually examine the PCA. A “screeplot” gives a visual representation of eigenvalues. In our case we see that the first two or three PCs contain the bulk of the variance present in the data.
screeplot(StoneOak.PCA)
biplot(StoneOak.PCA, cex=0.5, xlabs=rep("·", nrow(StoneOaks)))
In these “biplots” the points correspond to the component scores of the individual observations, while the vectors correspond to the loading of each of the original variables on each component. We can also extract the scores and add them back to the original data.
StoneOaksPCScores<-as.data.frame(StoneOak.PCA$scores)
head(StoneOaksPCScores)
## Comp.1 Comp.2 Comp.3 Comp.4 Comp.5 Comp.6
## 1 2.842425 0.03667149 -0.3693292 0.07378349 -0.03827781 -1.293314331
## 2 2.011859 0.30272532 -0.4995037 0.34976411 0.05955874 -1.016143417
## 3 1.409203 1.33018459 -0.3633893 -0.15071107 -0.10815532 -0.517530326
## 4 3.615345 1.29749715 -1.4547347 0.27539535 0.10277510 -0.225302171
## 5 1.654915 -0.50988993 0.9514592 0.30350444 0.68581875 0.043195158
## 6 1.323148 0.81171348 1.1647015 0.70552097 1.37782601 -0.008722672
## Comp.7 Comp.8 Comp.9 Comp.10 Comp.11 Comp.12
## 1 0.31444537 0.03862543 0.3617406 0.67826900 -0.198461519 -0.17680220
## 2 0.55404140 0.39781634 0.2395934 0.01434514 -0.165045962 -0.10181179
## 3 -0.09336016 0.01902432 0.1625734 -0.23902905 -0.054723676 -0.08797517
## 4 -0.15543292 -0.53984109 -0.2960682 0.37367076 0.115257259 0.07215872
## 5 0.22067397 0.31393496 0.3368367 -0.06753862 0.003897437 0.06802637
## 6 0.01748807 0.09829213 0.9939816 0.47496665 -0.193538178 -0.02182579
## Comp.13
## 1 0.020505474
## 2 0.003904616
## 3 -0.491211822
## 4 0.036876438
## 5 -0.046002448
## 6 0.144771833
StoneOaksPC<-cbind(StoneOaks, StoneOaksPCScores)
Now we can see if the two fruit types are shaped differently.
require(lattice)
## Loading required package: lattice
xyplot(Comp.2~Comp.1, groups=FruitType, data=StoneOaksPC)
xyplot(Comp.3~Comp.2, groups=FruitType, data=StoneOaksPC)
xyplot(Comp.3~Comp.1, groups=FruitType, data=StoneOaksPC)
There are also some more flexible plotting tools for examining PCA models.
require(ggfortify)
## Loading required package: ggfortify
## Loading required package: ggplot2
autoplot(StoneOak.PCA, data=StoneOaks, colour="FruitType", loadings.label=TRUE, loadings.label.repel=TRUE, loadings.colour="black", loadings.label.colour="black", frame=TRUE, frame.type="t", frame.colour="FruitType")
autoplot(StoneOak.PCA, data=StoneOaks, x=2, y=3, colour="FruitType", loadings.label=TRUE, loadings.label.repel=TRUE, loadings.colour="black", loadings.label.colour="black", frame=TRUE, frame.type="t", frame.colour="FruitType")
Often it is useful to examine the PC loadings in light of the underlying variables. In this case, the positive loadings on all the variable on the first PC suggest that it is a measure of overall size of the fruit. The second PC represents differential size of the exocarp (e measures) relative to the receptacle (r measures) and Theta (the tip angle).