Multivariate statistics

Analysis of continuous data

Perform a PCA of the data in the allometry dataset without changing the scale argument and plot the ordination result.

library(vegan)

## Loading required package: permute

## Loading required package: lattice

## This is vegan 2.4-3

allom<-read.csv('Allometry.csv')

allom.pca<-prcomp(allom[,2:5])

biplot(allom.pca)

In the previous ordination, the length of each vector varies extensively. This is because the variance in ‘branchmass’ and ‘leafarea’ is much larger than that of ‘diameter’ and ‘height’. Use the var function to calculate the variance for each variable in allom to con1rm this. Then use the scale argument to repeat the analysis after standardising the variables.

library(vegan)
allom<-read.csv('Allometry.csv')
var(allom$diameter)

## [1] 373.2473

var(allom$height)

## [1] 130.3961

var(allom$leafarea)

## [1] 11573.64

var(allom$branchmass)

## [1] 43385.5

allom.pca<-prcomp(allom[,2:5],scale=T)

biplot(allom.pca)

The ordination plot shows very strong overlap for most of the sites and a few very divergent sites. In addition, the plot is partly off-centre. These properties indicate that there is lack of normality in the response variables. Log-transform the variables to see the effects on the ordination, trying this two ways: (a) using log() to generate new vectors containing transformed data and (b) using decostand() with an appropriate ‘method’ argument).

library(vegan)
allom<-read.csv('Allometry.csv')
allom.pca.log1<-prcomp(log(allom[,2:5]),scale=T)
biplot(allom.pca.log1)

allom.pca.log2<-prcomp(decostand(allom[,2:5],'log'),scale=T)

## Warning: non-integer data: divided by smallest positive value

biplot(allom.pca.log2)

Plot a dendrogram of multivariate distances (euclidean) among individual trees, labelling the tips of the dendrogram with the species level. Use ANOSIM and PERMANOVA to test the hypothesis that clusters can be explained by interspeci1c variation.

library(vegan)
allom<-read.csv('Allometry.csv')
allom.dist <- vegdist(decostand(allom[,2:5],'log'),method='euclidean')

## Warning: non-integer data: divided by smallest positive value

allom.clust<-hclust(allom.dist)
plot(allom.clust,labels=allom[,1])

allom.ano<-anosim(allom.dist,allom[,1])
summary(allom.ano)

## 
## Call:
## anosim(dat = allom.dist, grouping = allom[, 1]) 
## Dissimilarity: euclidean 
## 
## ANOSIM statistic R: -0.01231 
##       Significance: 0.664 
## 
## Permutation: free
## Number of permutations: 999
## 
## Upper quantiles of permutations (null model):
##    90%    95%  97.5%    99% 
## 0.0323 0.0475 0.0601 0.0801 
## 
## Dissimilarity ranks between and within classes:
##         0%   25%    50%     75% 100%    N
## Between  1 488.0  967.5 1450.25 1952 1320
## PIMO     3 432.0  895.0 1377.50 1925  171
## PIPO     8 577.5 1074.0 1486.00 1953  231
## PSME     2 470.0  980.0 1502.50 1938  231

adonis(allom.dist~allom[,1])

## 
## Call:
## adonis(formula = allom.dist ~ allom[, 1]) 
## 
## Permutation: free
## Number of permutations: 999
## 
## Terms added sequentially (first to last)
## 
##            Df SumsOfSqs MeanSqs F.Model      R2 Pr(>F)
## allom[, 1]  2      9.06  4.5302 0.42656 0.01402  0.695
## Residuals  60    637.22 10.6204         0.98598       
## Total      62    646.29                 1.00000

variation between species is similar to variation within species

Plot the ordination results using coloured circles to represent the different tree species and include a legend.

library(vegan)
allom<-read.csv('Allometry.csv')

allom.pca.log<-prcomp(log(allom[,2:5]),scale=T)

allom.scores<-scores(allom.pca.log)[,1:2]

palette(c("blue","red","forestgreen"))
plot(allom.scores, pch=16, col=allom$species)

legend('topright', legend=levels(allom$species), pch=16, col=palette())

Overlay the plot with the centroid (i.e., average) for each species, using a different symbol than for the individual points. Modify the axes to reFLect the percentage of inertia (i.e., variance) explained by each axis.

library(vegan)
allom<-read.csv('Allometry.csv')

allom.pca.log<-prcomp(log(allom[,2:5]),scale=T)

allom.scores<-scores(allom.pca.log)[,1:2]

allom.agg<-aggregate(allom.scores, by=list(species=allom$species), FUN=mean)

summary(allom.pca.log)

## Importance of components%s:
##                           PC1     PC2     PC3     PC4
## Standard deviation     1.9081 0.53172 0.24722 0.12366
## Proportion of Variance 0.9102 0.07068 0.01528 0.00382
## Cumulative Proportion  0.9102 0.98090 0.99618 1.00000

str(summary(allom.pca.log))

## List of 6
##  $ sdev      : num [1:4] 1.908 0.532 0.247 0.124
##  $ rotation  : num [1:4, 1:4] -0.515 -0.483 -0.496 -0.506 0.243 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:4] "diameter" "height" "leafarea" "branchmass"
##   .. ..$ : chr [1:4] "PC1" "PC2" "PC3" "PC4"
##  $ center    : Named num [1:4] 3.37 3.09 4.16 4.01
##   ..- attr(*, "names")= chr [1:4] "diameter" "height" "leafarea" "branchmass"
##  $ scale     : Named num [1:4] 0.714 0.673 1.25 1.576
##   ..- attr(*, "names")= chr [1:4] "diameter" "height" "leafarea" "branchmass"
##  $ x         : num [1:63, 1:4] -1.907 -0.677 2.127 -0.29 -0.387 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : NULL
##   .. ..$ : chr [1:4] "PC1" "PC2" "PC3" "PC4"
##  $ importance: num [1:3, 1:4] 1.9081 0.9102 0.9102 0.5317 0.0707 ...
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:3] "Standard deviation" "Proportion of Variance" "Cumulative Proportion"
##   .. ..$ : chr [1:4] "PC1" "PC2" "PC3" "PC4"
##  - attr(*, "class")= chr "summary.prcomp"

allom.pcvar<-summary(allom.pca.log)$importance[2,1:2]

palette(c("blue","red","forestgreen"))
plot(allom.scores, pch=16, col=allom$species,
xlab=paste('PC1 (', 100*round(allom.pcvar[1], 3), ' %)', sep=''),
ylab=paste('PC2 (', 100*round(allom.pcvar[2], 3), ' %)', sep='')
)

legend('topright', legend=levels(allom$species), pch=16, col=palette())

points(allom.agg[,2:3], pch=17, cex=2, col=palette())

Analysis of count data or other types of abundance data

Read in the data from ‘endophytes.csv’ for a description of the data. Use the decorana function to calculate gradient lengths and determine whether PCA is appropriate for these data.

endo<-read.csv('endophytes.csv')

decorana(endo)

## 
## Call:
## decorana(veg = endo) 
## 
## Detrended correspondence analysis with 26 segments.
## Rescaling of axes with 4 iterations.
## 
##                   DCA1   DCA2   DCA3   DCA4
## Eigenvalues     0.5160 0.3277 0.2663 0.2971
## Decorana values 0.5406 0.4279 0.2738 0.2552
## Axis lengths    3.6757 4.0614 2.7659 2.9553

The gradient (axis) lengths are around or greater than 3, suggesting that PCA is not appropriate.

Perform CA for these data and plot the results. Notice the strong skew in the data along both axes. Try again after standardising the community matrix using the decostand function (try the ‘hellinger’ and ‘max’ methods). Notice the undesireable parabolic pattern in the ordination and strong skew; this suggests that CA is not an improvement over PCA (common for data matrices that contain many zeros, collected along long environmental gradients).

endo<-read.csv('endophytes.csv')

plot(vegan::cca(endo))

plot(vegan::cca(decostand(endo, method='hellinger')))

plot(vegan::cca(decostand(endo, method='max')))

summary(as.numeric(endo == 0))

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.0000  1.0000  1.0000  0.8688  1.0000  1.0000

About 87% of cells in the matrix equal zero (species is absent)

Perform PCoA on these data, using the ‘hellinger’ method for the decostand function and ‘bray’ method for the vegdist() function, and plot the results. Repeat as before but use the binary argument in the vegdist function to convert the matrix to a ‘presence/absence’ matrix.

endo<-read.csv('endophytes.csv')

#PCoA with Bray-Curtis dissimilarities
endo.pco.bray<-wcmdscale(vegdist(endo,method='bray'))
plot(endo.pco.bray)

#PCoA with Jaccard index (species presence / absence)
endo.pco.jac<-wcmdscale(vegdist(endo,method='bray',binary=T))
plot(endo.pco.jac)

Use the adonis function to test for main and interactive effects of tree species and tissue type (fresh vs litter) on fungal community composition. The predictor variables can be found in ‘endophytes_env.csv’. What terms were signi1cant? Which term explained the most variation?

endo<-read.csv('endophytes.csv')

endo.env<-read.csv('endophytes_env.csv')

# use PERMANOVA to test statistical significance of tree species and
# tissue type, interaction.
# Use the formula interface, response is a distance matrix
adonis(vegdist(endo,method='bray') ~ type * species, data=endo.env)

## 
## Call:
## adonis(formula = vegdist(endo, method = "bray") ~ type * species,      data = endo.env) 
## 
## Permutation: free
## Number of permutations: 999
## 
## Terms added sequentially (first to last)
## 
##              Df SumsOfSqs MeanSqs F.Model      R2 Pr(>F)    
## type          1     3.498  3.4977 18.9119 0.10707  0.001 ***
## species       8     7.885  0.9856  5.3291 0.24137  0.001 ***
## type:species  8     6.488  0.8110  4.3852 0.19862  0.001 ***
## Residuals    80    14.796  0.1849         0.45293           
## Total        97    32.666                 1.00000           
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

All terms were significant, species had the largest R2

Plot the PCoA results and use different symbols and colours to re2ect the identities of tree species and tissue types. Add a legend to the plot.

endo<-read.csv('endophytes.csv')
endo.env<-read.csv('endophytes_env.csv')
endo.pcoa<-wcmdscale(vegdist(endo,method='bray'))

plot(scores(endo.pcoa,display='sites'),col=rainbow(8)[endo.env$species],
pch=c(1,16)[endo.env$type])

legend('bottomleft',levels(endo.env$type),pch=c(1,16))
legend('bottomright',levels(endo.env$species),col=rainbow(8),pch=16,cex=0.75)

Use capscale to perform distance-based RDA (constrained PCoA) using the continuous variables in ‘endophytes_env.csv’ (percentC, percentN, CNratio) as predictors, then plot the results. First use the envfit function to determine which variables to include in db-RDA

endo<-read.csv('endophytes.csv')
endo.env<-read.csv('endophytes_env.csv')
endo.pcoa<-wcmdscale(vegdist(endo,method='bray'))
envfit(endo.pcoa, scale(endo.env[,3:5]))

## 
## ***VECTORS
## 
##              Dim1     Dim2     r2 Pr(>r)    
## percentC -0.71763  0.69643 0.0637  0.033 *  
## percentN  0.84421 -0.53601 0.5852  0.001 ***
## CNratio  -0.83489  0.55042 0.5352  0.001 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Permutation: free
## Number of permutations: 999

# all three variable are significant, include all three in analysis
# perform db-RDA (aka CAP) using continuous environmental variables as predictors
endo.cap<-capscale(endo~percentC+percentN+CNratio,data=endo.env,distance='bray')
endo.cap

## Call: capscale(formula = endo ~ percentC + percentN + CNratio,
## data = endo.env, distance = "bray")
## 
##               Inertia Proportion Eigenvals Rank
## Total         32.6662     1.0000   33.3894     
## Constrained    4.1124     0.1259    4.1268    3
## Unconstrained 28.5538     0.8741   29.2626   77
## Imaginary                          -0.7232   20
## Inertia is squared Bray distance 
## 
## Eigenvalues for constrained axes:
##   CAP1   CAP2   CAP3 
## 3.0212 0.6235 0.4820 
## 
## Eigenvalues for unconstrained axes:
##  MDS1  MDS2  MDS3  MDS4  MDS5  MDS6  MDS7  MDS8 
## 3.690 2.251 2.077 1.775 1.492 1.297 1.067 0.941 
## (Showed only 8 of all 77 unconstrained eigenvalues)

About 12 % of the variation is explained by C, N, and C:N, Most of that variation is accounted for in one axis (CAP1)

plot(endo.cap)

N and C:N are strongly collinear, C is separated out along the second CAP axis

Multivariate statistics

Kushan De Silva

August 16, 2017