Fri Nov 14 01:46:08 2025
Background
Soils form the foundation to many functions, from nature to agricultural systems and from one of the most important systems for carbon sequestration (in the context of climate change) to the location sustainable food production. Justifiably, soils are receiving increasing attention in the context of sustainability. For instance, Key Performance Indicators (KPIs) are being developed to identify good soil quality for agricultural practices and the World Bank has developed methods to monitor soil carbon and soil quality as an important component of carbon accounting and similar systems. All in all, that indicates the necessity of better understanding soil quality and related soil chemical properties.
We take the situation in which a comprehensive set of soil chemical properties was analysed. This dataset describes soil properties of more than 200 soils in the Amazon. Properties measured include the depth of the soil profile (till bedrock), electric conductivity (i.e. how many salts occur in the soil solution), pH (acidity), and concentrations of PO4 (phosphate), SO4 (sulphate), Ca (calcium), K (potassium), C (carbon/soil organic matter) and N (nitrogen). We want to understand the variation in these properties and the patterns among the various properties involved.
Aims
In this assignment, we will:
- Get acquainted with multivariate analysis and in particular with Principal Component Analysis (PCA).
Note: If the commands are not provided, they have been introduced in previous assignments.
1. Read the datafile into R and check whether the data have been read properly.
# loading data
soil <- read.table("data-unconstrained-ordination.txt", sep="\t", header=TRUE, row.names="plot")
# row.names sets the plot name row to the index row all in one
# overview of data
head(soil)## depth EC pH PO4 SO4 Ca K C N
## pp127 1.79 5.30 3.2 2.65 3.43 1.70 3.16 3.85 0.69
## pp128 2.77 4.44 4.0 2.65 2.92 1.16 2.62 3.61 0.92
## pp137 1.79 3.61 4.1 1.70 2.60 0.18 1.67 1.59 0.10
## pp138 1.79 5.38 2.7 2.14 2.92 0.99 3.15 4.03 0.59
## pp139 0.69 4.37 3.2 2.56 3.26 1.70 2.85 3.93 0.96
## pp143 0.69 3.89 4.2 2.14 2.60 0.47 2.52 2.24 0.26
## depth EC pH PO4 SO4 Ca K C N
## r94 4.11 2.27 4.6 0.47 2.71 0.69 1.22 1.10 0.09
## r95 3.76 2.01 6.0 0.53 3.30 0.47 0.83 0.58 0.04
## r96 2.20 2.00 6.1 0.79 2.89 0.18 0.99 0.39 0.03
## r97 3.14 2.21 6.6 0.88 3.09 0.10 0.96 0.46 0.03
## r98 3.04 2.94 5.2 1.03 3.09 0.41 1.16 0.83 0.08
## r99 3.58 2.43 6.1 0.79 3.53 0.26 0.83 0.30 0.04
## [1] 211 9
## depth EC pH PO4 SO4
## Min. :0.690 Min. :0.100 Min. :2.700 Min. :0.18 Min. :0.590
## 1st Qu.:2.080 1st Qu.:2.480 1st Qu.:4.400 1st Qu.:0.83 1st Qu.:2.675
## Median :2.770 Median :3.140 Median :4.900 Median :1.31 Median :3.220
## Mean :2.794 Mean :3.025 Mean :4.947 Mean :1.40 Mean :3.145
## 3rd Qu.:3.430 3rd Qu.:3.750 3rd Qu.:5.500 3rd Qu.:1.86 3rd Qu.:3.660
## Max. :5.300 Max. :6.210 Max. :6.600 Max. :4.57 Max. :5.500
## Ca K C N
## Min. :0.0500 Min. :0.100 Min. :0.170 Min. :0.010
## 1st Qu.:0.2600 1st Qu.:1.265 1st Qu.:0.810 1st Qu.:0.050
## Median :0.5300 Median :1.760 Median :1.280 Median :0.100
## Mean :0.6298 Mean :1.890 Mean :1.413 Mean :0.153
## 3rd Qu.:0.9000 3rd Qu.:2.480 3rd Qu.:1.785 3rd Qu.:0.180
## Max. :2.8900 Max. :3.970 Max. :4.030 Max. :0.960
# checking distributions of data
par(mfrow=c(3,3))
hist(soil$depth,main="Till bedrock (Depth)")
hist(soil$EC,main="electric conductivity (EC)")
hist(soil$pH,main="Acidity (pH)")
hist(soil$PO4,main="(Phosphate) PO4")
hist(soil$SO4,main="Sulphate (SO4)")
hist(soil$Ca,main="Calcium (Ca)")
hist(soil$K,main="Potassium (K)")
hist(soil$C,main="Carbon/soil organic matter (C)")
hist(soil$N,main="Nitrogen (N)")2. Check for any missing variables and interpret
Any multivariate analysis assumes a complete dataset. If one variable is missing, a complete row is eliminated from further analysis. This tests whether there are NAs in the dataset:
## [1] 0
## integer(0)
Answer: The results are 0 and integer(0), which implies that are no NAs in the dataset. Hence, we can proceed with the analysis.
3. Run the correlation matrix as explained in the assignment on multiple regression and interpret.
As always, we start with exploring the dataset. In a multivariate context, correlations are the most obvious way to preliminary evaluate patterns among the variables.
# defining correlation matrix
cor_matrix <- cor(soil, use="pairwise.complete.obs") # takes NAs into account (even if there are none)
# this only works if all variables are numeric!
# installing corrplot package
library(corrplot)
# visualizing the correlation matrix
corrplot(cor_matrix, method="circle")Answer: Very strong correlations between carbon and acidity, electric conductivity and acidity, potassium and electric conductivity, and nitrogen and carbon. Moderately strong correlations between nitrogen and acidity, potassium and acidity, as well as carbon and electric conductivity and potassium and phosphate, carbon and phosphate and potassium and carbon. Till bedrock appears to lack any sort of correlation with any of the other variables.
4. Explore the correlation and interpret the results more closely, also in relation to the previous question.
Another way to explore correlations is through:
# another way to explore correlations - installing required package
library(PerformanceAnalytics)
# creating correlation chart
chart.Correlation(soil[,c(1:9)], histogram=TRUE, pch=19)Answer: The correlations show the same patterns. In addition though, the plot suggests that nitrogen (especially) is not normally distributed. However, this skewed distribution does not seem to strongly affect the linearity of the relations with other variables. Therefore, no transformation is needed for this analysis (given that we focus on linear relationships among variables). Some other relationships show some curvatures. However, those are mostly related to individual relationships and not to a particular variable. Therefore, no transformations are necessary nor applied.
5. Run the PCA: Why do we refer to columns 2-10 and why do we scale?
We will now run the PCA:
## Loading required package: permute
Answer: The 2:10 range refers to the variables in the columns, essentially you are telling it which dimensions should be considered in the PCA analysis. Originally, the first column provided the site names, which should not be included in the analysis, but this was already taken care of when loading the data. Scaling is important to standardize the data given they each have very different units and ranges of values and most likely also variances.
The output of the PCA is given by:
##
## Call:
## rda(X = soil[, c(1:9)], scale = TRUE)
##
## Partitioning of correlations:
## Inertia Proportion
## Total 9 1
## Unconstrained 9 1
##
## Eigenvalues, and their contribution to the correlations
##
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Eigenvalue 4.7199 1.1868 1.0026 0.67973 0.55483 0.3762 0.22979
## Proportion Explained 0.5244 0.1319 0.1114 0.07553 0.06165 0.0418 0.02553
## Cumulative Proportion 0.5244 0.6563 0.7677 0.84323 0.90488 0.9467 0.97220
## PC8 PC9
## Eigenvalue 0.14423 0.10593
## Proportion Explained 0.01603 0.01177
## Cumulative Proportion 0.98823 1.00000
This gives all samples scores and species scores. We will walk through the various elements.The first element is the overall performance of the axes; i.e. the explained variance by the individual axes.
Note: To fully interpret a PCA, you need three pieces of information, but the summary(pcasoil) output in your screenshot only gives you one of them.
1. What you HAVE: “Explained Variance” (The “How Much”)
The table in your screenshot (under “Importance of components”) tells you how much information is captured by each axis (PC1, PC2, etc.).
- Eigenvalue: The raw amount of variation an axis represents.
- Proportion Explained: PC1 explains 0.5244 (or 52.44%) of the variation in your soil data.
- Cumulative Proportion: PC1 and PC2 together explain 0.6563 (or 65.63%) of the data.
What this tells you: “These axes are doing a good job of summarizing the data.” What this fails to tell you: “What do these axes actually represent biologically or chemically?”
2. What is MISSING: The “What” and the “Where”
To actually understand your soil data, you need the two parts that “unfortunately” aren’t shown in this specific table:
a. Loadings (Species/Variable Scores):
- The Question: “Which soil variables are driving PC1?”
- The Missing Data: You don’t know if PC1 represents a gradient of pH, Nitrogen, or Moisture. Without the loadings, PC1 is just a meaningless math construct.
b. Scores (Site/Sample Scores):
- The Question: “Which soil samples are similar to each other?”
- The Missing Data: You can’t see if Sample A and Sample B are close together on the graph.
6. How much of the soil variation is represented (explained) by the first and second axis?
Answer: The cumulative soil variation explained by PC1 and PC2 is 0.6563, or around 66%, whereof PC1 explains roughly 52% and PC2 explains an additional roughly 13%. This means that more than half of the variation in soil chemical properties is explained by the first axis PC1, and hence this is a good estimate.
The terminology of a multivariate analysis and of the vegan() package in particular may be a bit confusing. Many methods have originally been developed in vegetation science where one wanted to understand the patterns of plant species; which plant species do co-occur and which plant species exclude each other (e.g. due to competition or due to other habitat preferences). This terminology is still used in all multivariate analyses, also in cases where other objects are analysed. That implies that in the context of multivariate analysis, “species” refers to any object of which you want to analyse its relation to other objects. In this case study, “species” refers to the various soil chemical properties analyzed.
7. Retrieve the species scores of the PCA by the “scores” statement. Which elements determine mostly the variation along the first PCA axis? And which along the second axis?
# the following apparently does not give the scores for all axes - only the first two axes
scores(pcasoil, display="sp")## PC1 PC2
## depth 0.1155246 -1.3694327
## EC 1.9511760 -0.2335959
## pH -1.7900292 -0.8316702
## PO4 1.7350191 -0.3115872
## SO4 1.0079759 -1.4758483
## Ca 1.1624320 -0.3194442
## K 1.9162096 -0.2738249
## C 1.9856287 0.4858992
## N 1.6680836 0.6502840
## attr(,"const")
## [1] 6.593492
# scores(pcasoil, display="sites") # removing this one because it's very long, per site
# to have more than only the first two axes
scores(pcasoil, display="sp", choices=c(1:9))## PC1 PC2 PC3 PC4 PC5 PC6
## depth 0.1155246 -1.3694327 1.53868195 -0.7083424475 -0.2485411 -0.09047628
## EC 1.9511760 -0.2335959 0.14294505 -0.0006700999 0.4731292 0.56947167
## pH -1.7900292 -0.8316702 -0.45738859 0.2429270170 -0.3172370 -0.03866796
## PO4 1.7350191 -0.3115872 0.11537687 0.6608425267 0.2236169 -1.09141511
## SO4 1.0079759 -1.4758483 -0.88376270 0.5801038923 -0.5240379 0.28303767
## Ca 1.1624320 -0.3194442 -1.17769679 -1.3622408979 0.1891140 -0.30742577
## K 1.9162096 -0.2738249 0.12852765 0.2652172662 0.6475193 0.33711942
## C 1.9856287 0.4858992 0.20457789 0.0514432351 -0.4341713 0.03357189
## N 1.6680836 0.6502840 0.07896231 -0.1439777369 -1.1513712 0.04983905
## PC7 PC8 PC9
## depth -0.03261878 0.03737931 -0.007488184
## EC 0.19935589 -0.59381610 0.088659975
## pH 0.65297725 -0.16574529 -0.332228748
## PO4 0.11069088 -0.12654129 0.059997060
## SO4 -0.37994898 0.07992706 0.115250299
## Ca 0.04423058 0.01802693 -0.044046796
## K 0.41057448 0.53191781 -0.111001447
## C -0.31953617 -0.07485031 -0.556599613
## N 0.46213824 0.06296055 0.229012959
## attr(,"const")
## [1] 6.593492
This gives all samples scores and species scores. We will walk through the various elements below.
Answer: The first axis (the very important one) PC1 is mostly explained by variation in C, EC, K, pH, PO4 and N (in that order). Along the second axis PC2, SO4 and depth are strongly contributing to explaining variation.
8. Create a biplot and carefully interpret the pattern shown.
We plot the results in a so-called biplot, a specialized chart that allows you to see two different things at the same time, in this case the different soil chemical properties and test sites. The “text” is linked to species to ensure that the species names are provided, while “points” refer to the differen test sites.
# or a slightly prettier version of the biplot
biplot(pcasoil,col=c("black"), cex=c(2.0), display=c("sp"), pch=3,
xlab="PCA axis 1 (52% explained)", ylab="PCA axis 2 (13% explained")
points(pcasoil, cex=c(1.0), display=c("wa"), lwd=0.5, scaling=2)# the scaling statement determines the relative size of the site scores vs the species scores
# scaling can be given 0, 1, 2, or 3 and is mostly a matter of trial and error to find the optimal oneAnswer: The plot shows that N and C are strongly correlated and strongly negatively correlated to pH. These three soil characteristics may be driven by the same factor. EC, K, PO4 and Ca are also strongly correlated among each other, but slightly decoupled from N-C-pH. Depth varies completely independently (nearly orthogonal) from pH-N-C.
9. Evaluate the relationships between pH and N, between pH and SO4, and between N and SO4, with a type-II regression. How do the results compare to what you see in the PCA?
As indicated during the lecture, a PCA is a generalization of a type-II regression (also called a major axis regression). Evaluate these two relationships with PCA analysis.
## Loading required package: smatr
## Call: sma(formula = pH ~ N, data = soil)
##
## Fit using Standardized Major Axis
##
## ------------------------------------------------------------
## Coefficients:
## elevation slope
## estimate 5.670129 -4.722735
## lower limit 5.551334 -5.252119
## upper limit 5.788924 -4.246710
##
## H0 : variables uncorrelated
## R-squared : 0.3906819
## P-value : < 2.22e-16
## slope
## -4.722735
## Call: sma(formula = pH ~ SO4, data = soil)
##
## Fit using Standardized Major Axis
##
## ------------------------------------------------------------
## Coefficients:
## elevation slope
## estimate 8.006209 -0.9726039
## lower limit 7.564841 -1.1141323
## upper limit 8.447578 -0.8490538
##
## H0 : variables uncorrelated
## R-squared : 0.001328701
## P-value : 0.59853
## slope
## -0.9726039
## Call: sma(formula = N ~ SO4, data = soil)
##
## Fit using Standardized Major Axis
##
## ------------------------------------------------------------
## Coefficients:
## elevation slope
## estimate -0.4946457 0.2059408
## lower limit -0.5852332 0.1803365
## upper limit -0.4040582 0.2351805
##
## H0 : variables uncorrelated
## R-squared : 0.04650301
## P-value : 0.0016272
## slope
## 0.2059408
Answer: The results show that pH is indeed strongly negatively related to N (R2 of 0.39), while SO4 is almost not related to N (R2=0.04). This coincides with the patterns shown in the biplot, which will not correspond completely, since it is also based on the correlations with all other variables in the dataset.
10. Interpret each of the figures individually. What are the sample scores, what are the species scores? Which factors will probably be most important? What do you conclude for each of the figures?
Interpretation of the graphics of ordination is key to its understanding.
Therefore, take in addition these two figures from a PCA. Figure 1 relates species coming from different genera of plant species (like “bidens”) to plant traits (like “lignin”). Figure 2 shows microbial communities as associated to intact roots (“roots”), cells sloughed off from roots into the soil environments (“cells”) and controls (i.e. soil without any roots; “control”).
Answer - Figure 1: Describes the coordination of traits for 8 different genera.
- Trait variation across the genera (the samples scores) is mainly
determined by LDMC, phenols, SSD and leaf P (having the highest species
scores on the first axis). Of these, acer, solidago and fraxinus have
the highest values in LDMC, phenolics and SSD, but the lowest values for
leaf P. All traits together allowed clearly differentiating among the
genera, as exemplified by the clustering of the sample scores for each
genus.
- SRL, LS and MRD varied independently from LDMC and phenolics and
contributed most strongly to the second axis (and were also important in
the overall distinction. The extent to which these traits were
important, will depend on the explained variance for the first vs. the
second axis.
- Visually, these relationships are confirmed by the angles of the trait vectors. The arrow for Leaf P points in the exact opposite direction (180 degrees) to LDMC and Phenols, illustrating a direct trade-off where plants invest in one strategy at the expense of the other. In contrast, the traits driving the second axis, such as SRL, form a near 90-degree angle with the first axis, which visually proves that root structure varies independently from leaf toughness. The relative length of these arrows further indicates that these specific traits are the strongest drivers of variation in the dataset.
Answer - Figure 2: Only shows sample scores!
- Also here, the data show a strong clustering of sample scores by
treatment. However, only the ‘day 5 cells’ occupied a different position
in PCA-space compared to the other sample scores, i.e. there the
microbial community was different.
- Visually, this separation is confirmed by the position of the convex hulls (the colored polygons). The solid red polygon representing ‘Day 5 Cells’ is completely isolated on the far left of the plot, with no overlap with the other groups. In contrast, the heavy overlap among all the other polygons—regardless of whether they are ‘Control’, ‘Roots’, or ‘Day 45’—confirms that those microbial communities are statistically similar to one another.