Principal Component Analysis

This is a non-restricted/unsupervised model which only uses a response matrix. This procedure convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components. This is a method of dimensionality reduction, so it will find the components which can explain the biggest amount of variance in the data.

The assumptions to be considered here are:

Independence between samples.
No outliers nor NA values, since they can create an unreal horseshoe representation of the data.
Multivariant normality.
Linear relation between samples.

1- Assumptions of the model

1.1. Multivariant normality

mshapiro.test(t(env))

## 
##  Shapiro-Wilk normality test
## 
## data:  Z
## W = 0.55746, p-value = 3.276e-08

The multivariant normality assumption can not be accepted.

1.2. Linear relationship

panel.hist <- function(x, ...)
    {
      usr <- par("usr"); on.exit(par(usr))
      par(usr = c(usr[1:2], 0, 1.5) )
      h <- hist(x, plot = FALSE)
      breaks <- h$breaks; nB <- length(breaks)
      y <- h$counts; y <- y/max(y)
      rect(breaks[-nB], 0, breaks[-1], y, col = "cyan", ...)
    }
panel.cor <- function(x, y, digits = 2, prefix = "", cex.cor, ...)
    {
      usr <- par("usr"); on.exit(par(usr))
      par(usr = c(0, 1, 0, 1))
      r <- abs(cor(x, y))
      txt <- format(c(r, 0.123456789), digits = digits)[1]
      txt <- paste(prefix, txt, sep = "")
      if(missing(cex.cor)) cex.cor <- 0.8/strwidth(txt)
      text(0.5, 0.5, txt, cex = cex.cor * r)
    }
pairs(env, diag.panel = panel.hist, lower.panel = panel.smooth, upper.panel = panel.cor)

Maybe this is not a test easy to check. We should try with another one:

PerformanceAnalytics::chart.Correlation(doubs$env, histogram = TRUE, pch = 19)

Not all the relationships between variables are linear. Despite both of the assumptions are violated and we should be careful with this dataset, we will continue our analysis as an example.

2- PCA model

The PCA model uses the distance matrix of the dataset. So we have to scale it before creating the model.

env.pca <- rda(env, scale = TRUE) 
summary(env.pca) # By default scaling 2

## 
## Call:
## rda(X = env, scale = TRUE) 
## 
## Partitioning of correlations:
##               Inertia Proportion
## Total              11          1
## Unconstrained      11          1
## 
## Eigenvalues, and their contribution to the correlations 
## 
## Importance of components:
##                         PC1    PC2     PC3     PC4     PC5     PC6     PC7
## Eigenvalue            6.446 2.2252 0.99825 0.39831 0.36224 0.25361 0.16106
## Proportion Explained  0.586 0.2023 0.09075 0.03621 0.03293 0.02306 0.01464
## Cumulative Proportion 0.586 0.7883 0.87903 0.91524 0.94817 0.97123 0.98587
##                           PC8      PC9     PC10     PC11
## Eigenvalue            0.11062 0.022949 0.017476 0.004378
## Proportion Explained  0.01006 0.002086 0.001589 0.000398
## Cumulative Proportion 0.99593 0.998013 0.999602 1.000000
## 
## Scaling 2 for species and site scores
## * Species are scaled proportional to eigenvalues
## * Sites are unscaled: weighted dispersion equal on all dimensions
## * General scaling constant of scores:  4.189264 
## 
## 
## Species scores
## 
##          PC1     PC2      PC3       PC4       PC5      PC6
## dfs  1.10786  0.4901 -0.20230  0.045871 -0.187500 -0.20053
## alt -1.06604 -0.5607  0.15135  0.193929  0.096582 -0.05541
## slo -0.95746 -0.5295 -0.25933 -0.352365  0.066943 -0.39476
## flo  0.98732  0.6262 -0.23731  0.009137 -0.105599 -0.31209
## pH  -0.02679  0.4991  1.14319 -0.014864 -0.005398 -0.17594
## har  0.91493  0.5514 -0.09538 -0.128923  0.639664  0.07145
## pho  1.02238 -0.6481  0.20016 -0.178923  0.040772 -0.04615
## nit  1.14057 -0.1628  0.05081 -0.283125 -0.272940  0.20434
## amm  0.96776 -0.7382  0.18740 -0.198493 -0.021140  0.04733
## oxy -0.99282  0.4993  0.04744 -0.543295 -0.012373  0.06953
## bdo  0.96051 -0.7274  0.09630  0.089149  0.178489 -0.14521
## 
## 
## Site scores (weighted sums of species scores)
## 
##         PC1     PC2       PC3      PC4     PC5        PC6
## 1  -1.35309 -1.0614 -0.626184 -1.14841 -1.0494 -1.821e+00
## 2  -1.05499 -0.7841  0.195705  0.90757 -1.7294  2.655e-01
## 3  -0.97457 -0.4890  1.340010  0.61152 -0.8592 -7.288e-01
## 4  -0.90281 -0.3118  0.000372  0.17712  0.2052  5.288e-01
## 5  -0.45666 -0.6973  0.550276  1.16964  1.2500  1.273e-01
## 6  -0.81070 -0.7590 -0.318284  0.77249 -0.2537  1.263e-01
## 7  -0.85277 -0.1858  0.231151 -0.37159  1.3564 -3.616e-01
## 9  -0.27926 -0.4610  0.061754  1.60909  1.2127  8.870e-01
## 10 -0.59145 -0.5554 -1.595293 -0.35281  0.6397 -2.559e-01
## 11 -0.34078  0.3167  0.005014 -1.23777  0.7883 -2.345e-01
## 12 -0.44165  0.3209 -0.697214 -0.46903  0.3928  5.963e-01
## 13 -0.39855  0.6314  0.003511 -0.90415  1.0778  5.002e-05
## 14 -0.22649  0.7350  0.946755 -0.87823  0.8730  3.814e-02
## 15 -0.21927  1.0432  2.269502 -0.19576 -0.1024 -2.360e-01
## 16 -0.16778  0.2507 -0.340084 -0.54136 -0.1054  6.195e-01
## 17  0.14914  0.3628 -0.171537 -0.14337 -0.1372  1.435e+00
## 18  0.08633  0.3674 -0.238531 -0.44014 -0.2901  1.015e+00
## 19  0.10967  0.4821  0.232435 -0.28363 -0.7316  9.414e-01
## 20  0.18575  0.3732 -0.268777 -0.69333 -0.9729  1.131e+00
## 21  0.16766  0.3112 -0.834583  0.21603 -0.7147  5.100e-01
## 22  0.13041  0.4842 -0.108135  0.18812 -0.2895 -5.716e-01
## 23  1.28519 -1.3164  0.715652 -0.57204  0.6499 -1.021e+00
## 24  1.01542 -0.4735  0.019519  1.43289  0.5981  1.440e-01
## 25  2.10059 -2.1406  0.361157 -1.21146 -0.1786  4.424e-01
## 26  0.89379 -0.1213 -0.671652  0.86581 -0.3046 -8.871e-02
## 27  0.61092  0.3178 -0.139402  0.31511 -0.8178 -9.684e-01
## 28  0.82353  0.8569  0.802337 -0.04239 -0.8747  5.103e-02
## 29  0.67793  1.0652 -1.729365  0.28387  0.2944 -1.028e+00
## 30  0.83450  1.4380  0.003890  0.93621  0.0730 -1.543e+00

summary(env.pca, scaling = 1)

## 
## Call:
## rda(X = env, scale = TRUE) 
## 
## Partitioning of correlations:
##               Inertia Proportion
## Total              11          1
## Unconstrained      11          1
## 
## Eigenvalues, and their contribution to the correlations 
## 
## Importance of components:
##                         PC1    PC2     PC3     PC4     PC5     PC6     PC7
## Eigenvalue            6.446 2.2252 0.99825 0.39831 0.36224 0.25361 0.16106
## Proportion Explained  0.586 0.2023 0.09075 0.03621 0.03293 0.02306 0.01464
## Cumulative Proportion 0.586 0.7883 0.87903 0.91524 0.94817 0.97123 0.98587
##                           PC8      PC9     PC10     PC11
## Eigenvalue            0.11062 0.022949 0.017476 0.004378
## Proportion Explained  0.01006 0.002086 0.001589 0.000398
## Cumulative Proportion 0.99593 0.998013 0.999602 1.000000
## 
## Scaling 1 for species and site scores
## * Sites are scaled proportional to eigenvalues
## * Species are unscaled: weighted dispersion equal on all dimensions
## * General scaling constant of scores:  4.189264 
## 
## 
## Species scores
## 
##        PC1     PC2     PC3      PC4      PC5     PC6
## dfs  1.447  1.0896 -0.6715  0.24106 -1.03324 -1.3207
## alt -1.393 -1.2467  0.5024  1.01912  0.53222 -0.3649
## slo -1.251 -1.1772 -0.8608 -1.85173  0.36890 -2.5998
## flo  1.290  1.3924 -0.7878  0.04801 -0.58192 -2.0554
## pH  -0.035  1.1098  3.7948 -0.07811 -0.02975 -1.1587
## har  1.195  1.2260 -0.3166 -0.67751  3.52493  0.4706
## pho  1.336 -1.4410  0.6644 -0.94026  0.22468 -0.3040
## nit  1.490 -0.3619  0.1687 -1.48786 -1.50406  1.3457
## amm  1.264 -1.6413  0.6221 -1.04311 -0.11649  0.3117
## oxy -1.297  1.1100  0.1575 -2.85509 -0.06818  0.4579
## bdo  1.255 -1.6174  0.3197  0.46849  0.98358 -0.9563
## 
## 
## Site scores (weighted sums of species scores)
## 
##         PC1      PC2        PC3       PC4      PC5        PC6
## 1  -1.03580 -0.47737 -0.1886364 -0.218531 -0.19043 -2.765e-01
## 2  -0.80759 -0.35267  0.0589557  0.172702 -0.31383  4.031e-02
## 3  -0.74604 -0.21992  0.4036746  0.116366 -0.15591 -1.107e-01
## 4  -0.69110 -0.14022  0.0001121  0.033704  0.03725  8.029e-02
## 5  -0.34957 -0.31363  0.1657693  0.222572  0.22683  1.933e-02
## 6  -0.62059 -0.34137 -0.0958822  0.146997 -0.04605  1.918e-02
## 7  -0.65279 -0.08358  0.0696338 -0.070711  0.24615 -5.490e-02
## 9  -0.21377 -0.20732  0.0186032  0.306194  0.22006  1.347e-01
## 10 -0.45276 -0.24979 -0.4805779 -0.067136  0.11609 -3.886e-02
## 11 -0.26087  0.14244  0.0015104 -0.235536  0.14305 -3.560e-02
## 12 -0.33808  0.14434 -0.2100341 -0.089252  0.07127  9.055e-02
## 13 -0.30509  0.28397  0.0010576 -0.172050  0.19558  7.596e-06
## 14 -0.17338  0.33057  0.2852075 -0.167119  0.15842  5.792e-03
## 15 -0.16785  0.46919  0.6836817 -0.037251 -0.01858 -3.583e-02
## 16 -0.12843  0.11274 -0.1024494 -0.103016 -0.01913  9.406e-02
## 17  0.11417  0.16318 -0.0516750 -0.027282 -0.02491  2.178e-01
## 18  0.06609  0.16523 -0.0718567 -0.083754 -0.05265  1.542e-01
## 19  0.08395  0.21681  0.0700204 -0.053972 -0.13277  1.429e-01
## 20  0.14219  0.16784 -0.0809685 -0.131933 -0.17656  1.717e-01
## 21  0.12834  0.13995 -0.2514160  0.041108 -0.12969  7.744e-02
## 22  0.09983  0.21776 -0.0325754  0.035797 -0.05254 -8.679e-02
## 23  0.98382 -0.59208  0.2155883 -0.108853  0.11794 -1.550e-01
## 24  0.77731 -0.21296  0.0058801  0.272664  0.10854  2.186e-02
## 25  1.60801 -0.96276  0.1087977 -0.230529 -0.03241  6.717e-02
## 26  0.68419 -0.05457 -0.2023335  0.164755 -0.05528 -1.347e-02
## 27  0.46766  0.14295 -0.0419945  0.059962 -0.14840 -1.470e-01
## 28  0.63041  0.38542  0.2417020 -0.008066 -0.15873  7.749e-03
## 29  0.51895  0.47910 -0.5209667  0.054018  0.05342 -1.561e-01
## 30  0.63881  0.64676  0.0011718  0.178152  0.01325 -2.343e-01

The total amount of variance that we can explain with two components is 78.83%, so this is a good model. We also can see the coordenates along each PC for each sample.

3- Selection of the PC

At this step we need to select the PC that best fits to the data. To do this, we will see the next criteria:

3.1. Kaiser-Guttman test (Eigenvalues)

(ev <- env.pca$CA$eig)

##         PC1         PC2         PC3         PC4         PC5         PC6 
## 6.445919056 2.225185922 0.998250385 0.398313507 0.362238435 0.253611800 
##         PC7         PC8         PC9        PC10        PC11 
## 0.161055460 0.110621889 0.022948981 0.017476247 0.004378319

ev[ev > mean(ev)]

##      PC1      PC2 
## 6.445919 2.225186

Any value >1 is supposed to be good for our model. In this case, we would choose both PC1 and PC2.

3.2. Broken-stick plot

For this purpose we will use the evplot() function created by François Gillet (2007)

n <- length(ev)
bsm <- data.frame(j = seq(1: n), p = 0)
bsm$p[1] <- 1/n
for (i in 2:n){bsm$p[i] = bsm$p[i - 1] + (1/(n + 1 - i))}
bsm$p <- 100 * bsm$p/ n
    
evplot <- function(ev) {
    n = length(ev)
    bsm = data.frame(j=seq(1:n), p=0)
   bsm$p[1] = 1/n
    for (i in 2:n) {
       bsm$p[i] = bsm$p[i-1] + (1/(n + 1 - i))
     }
    bsm$p = 100*bsm$p/n
# Plot eigenvalues and % of variation for each axis
    op <- par(mfrow = c(2,1))
    barplot(ev, main = "Eigenvalues", col = "bisque", las = 2)
    abline(h = mean(ev), col = "red")
    legend("topright", "Average eigenvalue", lwd = 1, col = 2, bty = "n")
    barplot(t(cbind(100 * ev/sum(ev), bsm$p[n: 1])), beside = TRUE, 
             main = "% variation", col = c("bisque", 2), las = 2)
    legend("topright", c("% eigenvalue", "Broken stick model"), 
           pch = 15, col = c("bisque", 2), bty = "n")
    par(op)
}
evplot(ev)

The eigenvalue is higher in the PC1 and PC2 cases, so that is the number of components we need.

4- Biplot

We are going to create a plot where we can see how much variance of the axis is explained by each factor.

"cleanplot.pca" <- function(res.pca, ax1 = 1, ax2 = 2, point = FALSE, 
                                ahead = 0.07, cex = 0.7) 
    {
      require("vegan")
      par(mfrow = c(1, 2))
      p <- length(res.pca$CA$eig)
# Scaling 1: "species" scores scaled to relative eigenvalues
      sit.sc1 <- scores(res.pca, display = "wa", scaling = 1, choices = c(1: p))
      spe.sc1 <- scores(res.pca, display = "sp", scaling = 1, choices = c(1: p))
      plot(res.pca, choices = c(ax1, ax2), display = c("wa", "sp"), type = "n", 
           main = "PCA - scaling 1", scaling = 1)
      if(point)
      {
        points(sit.sc1[,ax1], sit.sc1[,ax2], pch = 20)
        text(res.pca, display = "wa", choices = c(ax1, ax2), cex = cex, pos = 3, scaling = 1)
      }
      else
      {
        text(res.pca, display = "wa", choices = c(ax1, ax2), cex = cex, scaling = 1)
      }
      text(res.pca, display = "sp", choices = c(ax1, ax2), cex = cex, pos = 4, 
           col = "red", scaling = 1)
      arrows(0, 0, spe.sc1[,ax1], spe.sc1[,ax2], length = ahead, angle = 20, col = "red")
      pcacircle(res.pca)
      # Scaling 2: site scores scaled to relative eigenvalues
      sit.sc2 <- scores(res.pca, display = "wa", choices = c(1: p))
      spe.sc2 <- scores(res.pca, display = "sp", choices = c(1: p))
      plot(res.pca, choices = c(ax1, ax2), display = c("wa", "sp"), type = "n", 
           main = "PCA - scaling 2")
      if (point) {
        points(sit.sc2[,ax1], sit.sc2[,ax2], pch = 20)
        text(res.pca, display = "wa", choices = c(ax1, ax2), cex = cex, pos = 3)
      }
      else
      {
        text(res.pca, display = "wa", choices = c(ax1, ax2), cex = cex)
      }
      text(res.pca, display = "sp", choices = c(ax1, ax2), cex = cex, pos = 4, col = "red")
      arrows(0, 0, spe.sc2[,ax1], spe.sc2[,ax2], length=ahead, angle=20, col="red")
    }
"pcacircle" <- function (pca) 
    { 
      eigenv <- pca$CA$eig
      p <- length(eigenv)
      n <- nrow(pca$CA$u)
      tot <- sum(eigenv)
      const <- ((n - 1) * tot) ^ 0.25
      radius <- (2/ p) ^ 0.5
      radius <- radius * const
      symbols(0, 0, circles = radius, inches = FALSE, add = TRUE, fg = 2)
    }
cleanplot.pca(env.pca)

In this kind of plot we can see the values which are similar depending on the distance between them. The scaling 1 plot also represents the contribution of every variable to the axis according to its size. The scaling 2 plot is a representation of the correlation between values where the correlation is the cosine of the angle (90º means no correlation at all).

Redundancy analysis (RDA)

This is the supervised version of the PCA, that is to say, an analysis where we need an explanatory matrix and a response matrix.

1- Transformation of the data

spe.hel <- decostand(spe, "hellinger")

2- Creation of the model

(spe.rda <- rda(spe.hel ~ ., env[, -1]))   # Scaling = 2 (by default)

## Call: rda(formula = spe.hel ~ alt + slo + flo + pH + har + pho +
## nit + amm + oxy + bdo, data = env[, -1])
## 
##               Inertia Proportion Rank
## Total          0.5025     1.0000     
## Constrained    0.3590     0.7144   10
## Unconstrained  0.1435     0.2856   18
## Inertia is variance 
## 
## Eigenvalues for constrained axes:
##    RDA1    RDA2    RDA3    RDA4    RDA5    RDA6    RDA7    RDA8    RDA9 
## 0.22606 0.05338 0.03152 0.02750 0.00680 0.00570 0.00304 0.00279 0.00134 
##   RDA10 
## 0.00087 
## 
## Eigenvalues for unconstrained axes:
##     PC1     PC2     PC3     PC4     PC5     PC6     PC7     PC8 
## 0.04579 0.02424 0.01828 0.01541 0.01005 0.00852 0.00522 0.00480 
## (Showed only 8 of all 18 unconstrained eigenvalues)

summary(spe.rda)

## 
## Call:
## rda(formula = spe.hel ~ alt + slo + flo + pH + har + pho + nit +      amm + oxy + bdo, data = env[, -1]) 
## 
## Partitioning of variance:
##               Inertia Proportion
## Total          0.5025     1.0000
## Constrained    0.3590     0.7144
## Unconstrained  0.1435     0.2856
## 
## Eigenvalues, and their contribution to the variance 
## 
## Importance of components:
##                         RDA1    RDA2    RDA3    RDA4     RDA5     RDA6
## Eigenvalue            0.2261 0.05338 0.03152 0.02750 0.006801 0.005696
## Proportion Explained  0.4499 0.10622 0.06273 0.05472 0.013534 0.011336
## Cumulative Proportion 0.4499 0.55609 0.61882 0.67354 0.687075 0.698411
##                           RDA7     RDA8     RDA9     RDA10     PC1     PC2
## Eigenvalue            0.003036 0.002794 0.001341 0.0008689 0.04579 0.02424
## Proportion Explained  0.006042 0.005561 0.002669 0.0017290 0.09112 0.04824
## Cumulative Proportion 0.704453 0.710014 0.712682 0.7144114 0.80553 0.85376
##                           PC3     PC4     PC5      PC6      PC7      PC8
## Eigenvalue            0.01828 0.01541 0.01005 0.008518 0.005224 0.004799
## Proportion Explained  0.03638 0.03066 0.02000 0.016950 0.010395 0.009549
## Cumulative Proportion 0.89015 0.92081 0.94080 0.957750 0.968146 0.977695
##                            PC9     PC10     PC11     PC12      PC13
## Eigenvalue            0.003111 0.002779 0.001824 0.001309 0.0007633
## Proportion Explained  0.006191 0.005530 0.003630 0.002604 0.0015189
## Cumulative Proportion 0.983886 0.989415 0.993045 0.995650 0.9971685
##                            PC14      PC15      PC16      PC17      PC18
## Eigenvalue            0.0007211 0.0003466 0.0001628 0.0001183 7.405e-05
## Proportion Explained  0.0014350 0.0006897 0.0003240 0.0002354 1.474e-04
## Cumulative Proportion 0.9986035 0.9992932 0.9996172 0.9998526 1.000e+00
## 
## Accumulated constrained eigenvalues
## Importance of components:
##                         RDA1    RDA2    RDA3   RDA4     RDA5     RDA6
## Eigenvalue            0.2261 0.05338 0.03152 0.0275 0.006801 0.005696
## Proportion Explained  0.6297 0.14868 0.08781 0.0766 0.018944 0.015867
## Cumulative Proportion 0.6297 0.77838 0.86619 0.9428 0.961736 0.977603
##                           RDA7     RDA8     RDA9     RDA10
## Eigenvalue            0.003036 0.002794 0.001341 0.0008689
## Proportion Explained  0.008457 0.007784 0.003736 0.0024202
## Cumulative Proportion 0.986060 0.993844 0.997580 1.0000000
## 
## Scaling 2 for species and site scores
## * Species are scaled proportional to eigenvalues
## * Sites are unscaled: weighted dispersion equal on all dimensions
## * General scaling constant of scores:  1.93676 
## 
## 
## Species scores
## 
##          RDA1      RDA2      RDA3       RDA4      RDA5       RDA6
## Cogo  0.12855  0.110633 -0.209194 -0.1071449 -0.028611  0.0141994
## Satr  0.63509  0.033253  0.208510 -0.1439723  0.019792 -0.0111685
## Phph  0.47357  0.105922 -0.090979  0.1251551  0.034336  0.0009744
## Neba  0.36572  0.112390 -0.056054  0.2106755  0.032913 -0.0126325
## Thth  0.12897  0.105128 -0.199485 -0.1260043 -0.044784  0.0073734
## Teso  0.06532  0.122998 -0.181470 -0.0790183 -0.007990  0.0165030
## Chna -0.17173  0.077858 -0.016572  0.0102750  0.027716 -0.0635971
## Chto -0.12774  0.164448 -0.035461 -0.0055416  0.088885 -0.0253512
## Lele -0.08582  0.037082 -0.018133  0.0519389  0.005327  0.0774115
## Lece -0.09794 -0.146753 -0.146155  0.1089256 -0.079503  0.0057278
## Baba -0.18219  0.213000 -0.044083 -0.0486677  0.016435 -0.0116779
## Spbi -0.15864  0.162595 -0.009591 -0.0036451  0.058718 -0.0070201
## Gogo -0.20564  0.032595 -0.031655  0.0255721  0.040118  0.0853105
## Eslu -0.11131  0.030323  0.059115  0.0556236  0.027627  0.0717972
## Pefl -0.09514  0.114076  0.026404  0.0972731  0.012597 -0.0081279
## Rham -0.20707  0.161102  0.039337 -0.0181777  0.010822 -0.0007049
## Legi -0.22981  0.110587  0.014642  0.0068236 -0.009266 -0.0471023
## Scer -0.16703 -0.001545  0.043678  0.0005884  0.007734  0.0904951
## Cyca -0.17709  0.142182  0.030518 -0.0146280 -0.006343 -0.0014130
## Titi -0.13869  0.119661  0.032858  0.1454023 -0.031858 -0.0078230
## Abbr -0.18814  0.111557  0.070736 -0.0277619 -0.054583 -0.0043464
## Icme -0.15010  0.074868  0.077897 -0.0270093 -0.088491  0.0069117
## Acce -0.30997  0.011383  0.037899 -0.0131401 -0.003935  0.0447406
## Ruru -0.31367 -0.154037 -0.058600  0.1366742  0.005703 -0.0531753
## Blbj -0.24063  0.085917  0.054743 -0.0089691 -0.056208 -0.0477626
## Alal -0.42634 -0.225180 -0.081617 -0.1348421  0.092524 -0.0256802
## Anan -0.19386  0.139705  0.046797 -0.0159675 -0.007872  0.0013463
## 
## 
## Site scores (weighted sums of species scores)
## 
##        RDA1      RDA2     RDA3      RDA4     RDA5     RDA6
## 1   0.40041 -0.295511  1.14270 -1.175086 -0.11686 -0.56914
## 2   0.53738 -0.030567  0.48623 -0.061268  0.45958 -0.61134
## 3   0.49599 -0.006984  0.46543  0.116524  0.58777 -0.18445
## 4   0.33605  0.039678  0.31614  0.545907  0.24808  0.33426
## 5   0.02646 -0.180823  0.20169  0.764075 -0.03656  1.27753
## 6   0.24631 -0.081858  0.13671  0.800683 -0.03204  0.33302
## 7   0.46870 -0.110516  0.25090  0.208714  0.04653 -0.11912
## 9   0.04338 -0.523846 -0.42403  1.026668 -1.06941 -1.07501
## 10  0.31677 -0.145957 -0.14045  0.542830 -0.03552  0.75199
## 11  0.48369 -0.037798 -0.29659 -0.345340 -0.60788 -0.32583
## 12  0.49406  0.017009 -0.29630 -0.325663 -0.52782 -0.37330
## 13  0.49904  0.186836 -0.44001 -0.725909 -0.33387 -0.22269
## 14  0.38323  0.228190 -0.61700 -0.523171 -0.36312  0.45362
## 15  0.28869  0.229638 -0.67296 -0.216753 -0.60059  0.75672
## 16  0.09032  0.415053 -0.38401  0.111877  0.50058  0.41060
## 17 -0.05323  0.450469 -0.41933 -0.033371  0.99567 -0.38515
## 18 -0.14222  0.415224 -0.40307  0.009911  0.82358 -0.15473
## 19 -0.28308  0.309911 -0.14823  0.320089  0.90414 -0.16896
## 20 -0.39851  0.223514  0.04728  0.198958  0.55326 -0.11193
## 21 -0.43175  0.272783  0.20026  0.087309  0.16304  0.02239
## 22 -0.46911  0.233362  0.24707  0.016741 -0.25624 -0.02843
## 23 -0.27656 -1.156080 -0.42377 -0.339974  0.05514 -1.29147
## 24 -0.40768 -0.771586 -0.17939 -0.397065 -0.00161 -1.12261
## 25 -0.35385 -0.808410 -0.08725 -0.263340  0.56199  1.91413
## 26 -0.47197  0.077281  0.25065  0.002092 -0.32899  0.03663
## 27 -0.47349  0.193232  0.28063 -0.004540 -0.37185  0.03277
## 28 -0.47634  0.209299  0.31020 -0.024574 -0.52938  0.01265
## 29 -0.37836  0.369162  0.18643 -0.171011 -0.27590  0.09357
## 30 -0.49433  0.279294  0.41006 -0.145314 -0.41174  0.31428
## 
## 
## Site constraints (linear combinations of constraining variables)
## 
##        RDA1      RDA2      RDA3     RDA4      RDA5     RDA6
## 1   0.56869 -0.158590  0.980138 -0.65131  0.051320 -0.06640
## 2   0.26581  0.036765  0.674201  0.26846  0.718990 -0.14076
## 3   0.37338 -0.139943  0.519657 -0.02578  0.066108 -0.25253
## 4   0.46595  0.062636  0.053487  0.42704  0.114337  0.39565
## 5   0.21562 -0.451325  0.011212  0.65932 -0.335005 -0.11493
## 6   0.37544 -0.148582  0.204815  0.24235  0.129476 -0.31015
## 7   0.55448 -0.146226 -0.035860  0.20126 -0.430822  0.36722
## 9   0.01718 -0.251178 -0.115538  1.06866 -0.296933  0.05943
## 10  0.17546 -0.138170  0.173185  0.06342 -0.165935 -0.17391
## 11  0.28137  0.165418 -0.130714 -0.23404 -0.397820  0.39167
## 12  0.32329  0.167258 -0.401398 -0.18786  0.264101 -0.28265
## 13  0.38209  0.107356 -0.507743 -0.32445 -0.147620 -0.06059
## 14  0.38151  0.152013 -0.579728 -0.35651 -0.281294 -0.03069
## 15  0.24564  0.271175 -0.412127 -0.33449 -0.279939  0.22797
## 16 -0.03340  0.250314 -0.184407  0.08800  0.073618  0.07840
## 17 -0.10817  0.258271 -0.343190  0.14689  0.513962 -0.40907
## 18 -0.06189  0.311727 -0.257640  0.07952  0.356584  0.22352
## 19 -0.03711  0.379177 -0.294081 -0.03274  0.349407  0.18128
## 20 -0.22431  0.375617 -0.070890 -0.03861  0.737369 -0.12087
## 21 -0.34883  0.258940  0.015988  0.10805  0.246878 -0.32125
## 22 -0.27772  0.068382 -0.008991 -0.20070 -0.189680 -0.46539
## 23 -0.18786 -1.048280 -0.339613 -0.71285 -0.008217 -0.42604
## 24 -0.51613 -0.555550 -0.319464  0.28272 -0.143026 -0.64407
## 25 -0.40978 -0.924082 -0.070743 -0.23578  0.678956  1.18999
## 26 -0.49964  0.002689  0.100691  0.26791 -0.374118  0.31294
## 27 -0.60047  0.063920  0.468172 -0.14854 -0.226959 -0.26419
## 28 -0.51391  0.403655  0.176190  0.07713 -0.038538  0.31623
## 29 -0.36818  0.324611  0.361512 -0.25986 -0.323494  0.09805
## 30 -0.43852  0.302002  0.332876 -0.23720 -0.661703  0.24111
## 
## 
## Biplot scores for constraining variables
## 
##        RDA1     RDA2     RDA3    RDA4     RDA5     RDA6
## alt  0.8281 -0.18518  0.37869  0.3532  0.04195 -0.05880
## slo  0.7399 -0.16882  0.44280 -0.0841 -0.03149 -0.10932
## flo -0.7820  0.22102 -0.05683 -0.3560 -0.22000  0.18778
## pH   0.1023  0.17603 -0.24148 -0.1881 -0.27655  0.07525
## har -0.5743  0.05063 -0.57561 -0.1145 -0.39249  0.22844
## pho -0.4936 -0.67392 -0.16113 -0.2175  0.17603  0.42118
## nit -0.7786 -0.22415 -0.19299 -0.2071  0.28305  0.33923
## amm -0.4747 -0.70878 -0.13815 -0.1882  0.32390  0.32113
## oxy  0.7648  0.58218 -0.06834 -0.2290  0.04329 -0.11011
## bdo -0.5169 -0.81046 -0.14577 -0.1320  0.05312  0.05688

Here we can see the values of the constrained axes. Another important value is the inertia, which is the explained variance of the model. Our explanatory matrix can explain a 71% of the total variance, which means that we have a good model.

3- Number of PC axis

sp <- spe.rda$CA$eig 
sp[sp > mean(sp)] # Kaiser-Guttman test

##         PC1         PC2         PC3         PC4         PC5         PC6 
## 0.045786390 0.024239717 0.018282574 0.015406393 0.010047723 0.008517531

evplot(sp)

The criteria among the tests are non consistent. We see we have up to 4 significant axis according to the broken-stick test.

4- Goodness of fit

anova(spe.rda, step = 1000)

## Permutation test for rda under reduced model
## Permutation: free
## Number of permutations: 999
## 
## Model: rda(formula = spe.hel ~ alt + slo + flo + pH + har + pho + nit + amm + oxy + bdo, data = env[, -1])
##          Df Variance      F Pr(>F)    
## Model    10  0.35900 4.5028  0.001 ***
## Residual 18  0.14351                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The model is statiscally significant (p < 0.001)

anova(spe.rda, by = "axis", step = 1000)

## Permutation test for rda under reduced model
## Forward tests for axes
## Permutation: free
## Number of permutations: 999
## 
## Model: rda(formula = spe.hel ~ alt + slo + flo + pH + har + pho + nit + amm + oxy + bdo, data = env[, -1])
##          Df Variance       F Pr(>F)    
## RDA1      1 0.226063 28.3542  0.001 ***
## RDA2      1 0.053376  6.6947  0.001 ***
## RDA3      1 0.031523  3.9538  0.111    
## RDA4      1 0.027499  3.4491  0.231    
## RDA5      1 0.006801  0.8530  1.000    
## RDA6      1 0.005696  0.7145  1.000    
## RDA7      1 0.003036  0.3808  1.000    
## RDA8      1 0.002794  0.3505  1.000    
## RDA9      1 0.001341  0.1682  1.000    
## RDA10     1 0.000869  0.1090  1.000    
## Residual 18 0.143511                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Just the two first axis are significants.

anova(spe.rda, by = "margin", step = 1000)

## Permutation test for rda under reduced model
## Marginal effects of terms
## Permutation: free
## Number of permutations: 999
## 
## Model: rda(formula = spe.hel ~ alt + slo + flo + pH + har + pho + nit + amm + oxy + bdo, data = env[, -1])
##          Df Variance      F Pr(>F)  
## alt       1 0.008165 1.0240  0.382  
## slo       1 0.011706 1.4682  0.202  
## flo       1 0.015431 1.9354  0.101  
## pH        1 0.004976 0.6241  0.644  
## har       1 0.015158 1.9012  0.095 .
## pho       1 0.005999 0.7525  0.522  
## nit       1 0.003812 0.4781  0.787  
## amm       1 0.003209 0.4025  0.856  
## oxy       1 0.025887 3.2469  0.018 *
## bdo       1 0.004806 0.6028  0.688  
## Residual 18 0.143511                
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

RsquareAdj(spe.rda); vif.cca(spe.rda)

## $r.squared
## [1] 0.7144114
## 
## $adj.r.squared
## [1] 0.555751

##       alt       slo       flo        pH       har       pho       nit 
## 17.475490  4.221651  6.021996  1.471728  2.994588 25.379501 15.630086 
##       amm       oxy       bdo 
## 30.353271  7.623932 18.098637

5- Triplot

Because we have an explanatory matrix, the matrix we use in PCA analysis becomes triplot.

par(mfrow = c(1, 2))    
# Scaling = 1
plot(spe.rda, scaling = 1, 
         main = "scaling 1")
spe.sc <- scores(spe.rda, choices = 1:2, scaling = 1, display = "sp")
arrows(0, 0, spe.sc[, 1], spe.sc[, 2], length = 0, lty = 1, col ="red")
    
# Scaling = 2 
plot(spe.rda, main = "scaling 2")
spe2.sc <- scores(spe.rda, choices = 1:2, display = "sp")
arrows(0, 0, spe2.sc[, 1], spe2.sc[, 2], length = 0, lty = 1, col = "red")

We can see painted in blue the arrows of the explanatory matrix. The longer they are the more variance they explain.

6- Selection of explanatory variables

We will use different criteria to choose this:

6.1. Variance inflation factors (VIF)

First up, we are going to create a model with every explanatory variable and see how it behaves. Then we will use the step function to see how many variables helps us to create the fittest model.

vif.cca(spe.rda)

##       alt       slo       flo        pH       har       pho       nit 
## 17.475490  4.221651  6.021996  1.471728  2.994588 25.379501 15.630086 
##       amm       oxy       bdo 
## 30.353271  7.623932 18.098637

spe.rda.all <- rda(spe.hel ~ ., data=env)
(R2a.all <- RsquareAdj(spe.rda.all)$adj.r.squared)

## [1] 0.5577387

(step.forward <- ordistep(rda(spe.hel ~ 1, data = env), trace = 0,
                              scope = formula(spe.rda.all), direction = "forward",                                  pstep = 1000))

## Call: rda(formula = spe.hel ~ dfs + oxy + bdo + alt, data = env)
## 
##               Inertia Proportion Rank
## Total          0.5025     1.0000     
## Constrained    0.3199     0.6365    4
## Unconstrained  0.1827     0.3635   24
## Inertia is variance 
## 
## Eigenvalues for constrained axes:
##    RDA1    RDA2    RDA3    RDA4 
## 0.22016 0.05089 0.02880 0.02002 
## 
## Eigenvalues for unconstrained axes:
##     PC1     PC2     PC3     PC4     PC5     PC6     PC7     PC8 
## 0.05290 0.03130 0.02457 0.01695 0.01375 0.01193 0.00736 0.00575 
## (Showed only 8 of all 24 unconstrained eigenvalues)

The suggested model will be: \(dfs + oxy + bdo + alt + har\)

6.2. Ordistep function

(step.forward <- ordiR2step(rda(spe.hel ~ 1, data = env), trace = 0,
                                scope = formula(spe.rda.all), direction = "forward",                                  pstep = 1000))

## Call: rda(formula = spe.hel ~ dfs + oxy + bdo, data = env)
## 
##               Inertia Proportion Rank
## Total          0.5025     1.0000     
## Constrained    0.2947     0.5865    3
## Unconstrained  0.2078     0.4135   25
## Inertia is variance 
## 
## Eigenvalues for constrained axes:
##    RDA1    RDA2    RDA3 
## 0.21874 0.05015 0.02581 
## 
## Eigenvalues for unconstrained axes:
##     PC1     PC2     PC3     PC4     PC5     PC6     PC7     PC8 
## 0.05422 0.04398 0.02639 0.01803 0.01486 0.01298 0.00971 0.00633 
## (Showed only 8 of all 25 unconstrained eigenvalues)

This time the suggested model is \(dfs + oxy + bdo\)

6.3. Adjusted R^² comparison

RsquareAdj(rda(spe.hel ~ dfs, data = env))$adj.r.squared

## [1] 0.3676474

RsquareAdj(rda(spe.hel ~ dfs + oxy, data = env))$adj.r.squared

## [1] 0.4706162

RsquareAdj(rda(spe.hel ~ dfs + oxy + bdo, data = env))$adj.r.squared

## [1] 0.5368359

The result is self-explanatory. The best model is \(dfs + oxy + bdo\).

6.4.Parsimonious RDA model selection

(spe.rda.pars <- rda(spe.hel ~ dfs + oxy + bdo, data = env))

## Call: rda(formula = spe.hel ~ dfs + oxy + bdo, data = env)
## 
##               Inertia Proportion Rank
## Total          0.5025     1.0000     
## Constrained    0.2947     0.5865    3
## Unconstrained  0.2078     0.4135   25
## Inertia is variance 
## 
## Eigenvalues for constrained axes:
##    RDA1    RDA2    RDA3 
## 0.21874 0.05015 0.02581 
## 
## Eigenvalues for unconstrained axes:
##     PC1     PC2     PC3     PC4     PC5     PC6     PC7     PC8 
## 0.05422 0.04398 0.02639 0.01803 0.01486 0.01298 0.00971 0.00633 
## (Showed only 8 of all 25 unconstrained eigenvalues)

anova(spe.rda.pars, step = 1000)

## Permutation test for rda under reduced model
## Permutation: free
## Number of permutations: 999
## 
## Model: rda(formula = spe.hel ~ dfs + oxy + bdo, data = env)
##          Df Variance      F Pr(>F)    
## Model     3  0.29470 11.818  0.001 ***
## Residual 25  0.20781                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

anova(spe.rda.pars, step = 1000, by = "axis")

## Permutation test for rda under reduced model
## Forward tests for axes
## Permutation: free
## Number of permutations: 999
## 
## Model: rda(formula = spe.hel ~ dfs + oxy + bdo, data = env)
##          Df Variance       F Pr(>F)    
## RDA1      1 0.218736 26.3147  0.001 ***
## RDA2      1 0.050153  6.0336  0.001 ***
## RDA3      1 0.025813  3.1054  0.004 ** 
## Residual 25 0.207808                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

anova(spe.rda.pars, step = 1000, by = "margin")

## Permutation test for rda under reduced model
## Marginal effects of terms
## Permutation: free
## Number of permutations: 999
## 
## Model: rda(formula = spe.hel ~ dfs + oxy + bdo, data = env)
##          Df Variance       F Pr(>F)    
## dfs       1 0.097402 11.7178  0.001 ***
## oxy       1 0.045476  5.4710  0.004 ** 
## bdo       1 0.039212  4.7173  0.006 ** 
## Residual 25 0.207808                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(R2a.pars <- RsquareAdj(spe.rda.pars)$adj.r.squared)

## [1] 0.5368359

All the variables are statiscally significant (p < 0.005).

6.5. VIF comparison

vif.cca(spe.rda.all)

##        dfs        alt        slo        flo         pH        har 
## 135.817134  51.714483   5.324914  45.740857   1.807935   4.400607 
##        pho        nit        amm        oxy        bdo 
##  27.453045  16.614554  30.685220  17.493662  18.210558

vif.cca(spe.rda.pars)

##      dfs      oxy      bdo 
## 1.496925 4.116333 3.426102

The most parsimonious model is a model with just three explanatory variables: \(dfs + oxy + bdo\).

7- Final triplot

And now let’s finish this long analysis with a triplot with the new formula.

par(mfrow = c(1, 2))
# Scaling 1
plot(spe.rda.pars, scaling = 1, display = c("sp", "lc", "cn"), 
         main = "Scaling 1")
spe4.sc <- scores(spe.rda.pars, choices = 1:2, scaling = 1, display = "sp")
arrows(0, 0, spe4.sc[, 1], spe4.sc[, 2], length = 0, lty = 1, col = "red")

# Scaling 2
plot(spe.rda.pars, display = c("sp", "lc", "cn"), 
      main = "Scaling 2")
spe5.sc <- scores(spe.rda.pars, choices = 1:2, display = "sp")
arrows(0, 0, spe5.sc[,1], spe5.sc[,2], length = 0, lty = 1, col = "red")

Compared to the previous triplot, this time we only have three blue arrows (namely, explanatory variables), whith a low intercorrelation between them.

Principal Component Analysis

David Valls Lanàquera

Principal Component Analysis

1- Assumptions of the model

1.1. Multivariant normality

1.2. Linear relationship

2- PCA model

3- Selection of the PC

3.1. Kaiser-Guttman test (Eigenvalues)

3.2. Broken-stick plot

4- Biplot

Redundancy analysis (RDA)

1- Transformation of the data

2- Creation of the model

3- Number of PC axis

4- Goodness of fit

5- Triplot

6- Selection of explanatory variables

6.1. Variance inflation factors (VIF)

6.2. Ordistep function

6.3. Adjusted R2 comparison

6.4.Parsimonious RDA model selection

6.5. VIF comparison

7- Final triplot

6.3. Adjusted R^² comparison