Fri Nov 14 01:46:08 2025

Background

Soils form the foundation to many functions, from nature to agricultural systems and from one of the most important systems for carbon sequestration (in the context of climate change) to the location sustainable food production. Justifiably, soils are receiving increasing attention in the context of sustainability. For instance, Key Performance Indicators (KPIs) are being developed to identify good soil quality for agricultural practices and the World Bank has developed methods to monitor soil carbon and soil quality as an important component of carbon accounting and similar systems. All in all, that indicates the necessity of better understanding soil quality and related soil chemical properties.

We take the situation in which a comprehensive set of soil chemical properties was analysed. This dataset describes soil properties of more than 200 soils in the Amazon. Properties measured include the depth of the soil profile (till bedrock), electric conductivity (i.e. how many salts occur in the soil solution), pH (acidity), and concentrations of PO4 (phosphate), SO4 (sulphate), Ca (calcium), K (potassium), C (carbon/soil organic matter) and N (nitrogen). We want to understand the variation in these properties and the patterns among the various properties involved.

Aims

In this assignment, we will:

Note: If the commands are not provided, they have been introduced in previous assignments.


1. Read the datafile into R and check whether the data have been read properly.

# loading data
soil <- read.table("data-unconstrained-ordination.txt", sep="\t", header=TRUE, row.names="plot")
# row.names sets the plot name row to the index row all in one

# overview of data
head(soil)
##       depth   EC  pH  PO4  SO4   Ca    K    C    N
## pp127  1.79 5.30 3.2 2.65 3.43 1.70 3.16 3.85 0.69
## pp128  2.77 4.44 4.0 2.65 2.92 1.16 2.62 3.61 0.92
## pp137  1.79 3.61 4.1 1.70 2.60 0.18 1.67 1.59 0.10
## pp138  1.79 5.38 2.7 2.14 2.92 0.99 3.15 4.03 0.59
## pp139  0.69 4.37 3.2 2.56 3.26 1.70 2.85 3.93 0.96
## pp143  0.69 3.89 4.2 2.14 2.60 0.47 2.52 2.24 0.26
tail(soil)
##     depth   EC  pH  PO4  SO4   Ca    K    C    N
## r94  4.11 2.27 4.6 0.47 2.71 0.69 1.22 1.10 0.09
## r95  3.76 2.01 6.0 0.53 3.30 0.47 0.83 0.58 0.04
## r96  2.20 2.00 6.1 0.79 2.89 0.18 0.99 0.39 0.03
## r97  3.14 2.21 6.6 0.88 3.09 0.10 0.96 0.46 0.03
## r98  3.04 2.94 5.2 1.03 3.09 0.41 1.16 0.83 0.08
## r99  3.58 2.43 6.1 0.79 3.53 0.26 0.83 0.30 0.04
dim(soil)
## [1] 211   9
summary(soil)
##      depth             EC              pH             PO4            SO4       
##  Min.   :0.690   Min.   :0.100   Min.   :2.700   Min.   :0.18   Min.   :0.590  
##  1st Qu.:2.080   1st Qu.:2.480   1st Qu.:4.400   1st Qu.:0.83   1st Qu.:2.675  
##  Median :2.770   Median :3.140   Median :4.900   Median :1.31   Median :3.220  
##  Mean   :2.794   Mean   :3.025   Mean   :4.947   Mean   :1.40   Mean   :3.145  
##  3rd Qu.:3.430   3rd Qu.:3.750   3rd Qu.:5.500   3rd Qu.:1.86   3rd Qu.:3.660  
##  Max.   :5.300   Max.   :6.210   Max.   :6.600   Max.   :4.57   Max.   :5.500  
##        Ca               K               C               N        
##  Min.   :0.0500   Min.   :0.100   Min.   :0.170   Min.   :0.010  
##  1st Qu.:0.2600   1st Qu.:1.265   1st Qu.:0.810   1st Qu.:0.050  
##  Median :0.5300   Median :1.760   Median :1.280   Median :0.100  
##  Mean   :0.6298   Mean   :1.890   Mean   :1.413   Mean   :0.153  
##  3rd Qu.:0.9000   3rd Qu.:2.480   3rd Qu.:1.785   3rd Qu.:0.180  
##  Max.   :2.8900   Max.   :3.970   Max.   :4.030   Max.   :0.960
# checking distributions of data
par(mfrow=c(3,3))
hist(soil$depth,main="Till bedrock (Depth)")
hist(soil$EC,main="electric conductivity (EC)")
hist(soil$pH,main="Acidity (pH)")
hist(soil$PO4,main="(Phosphate) PO4")
hist(soil$SO4,main="Sulphate (SO4)")
hist(soil$Ca,main="Calcium (Ca)")
hist(soil$K,main="Potassium (K)")
hist(soil$C,main="Carbon/soil organic matter (C)")
hist(soil$N,main="Nitrogen (N)")


2. Check for any missing variables and interpret

Any multivariate analysis assumes a complete dataset. If one variable is missing, a complete row is eliminated from further analysis. This tests whether there are NAs in the dataset:

# checking for missing variables
sum(is.na(soil)) # there appear to be no NAs present
## [1] 0
# this would show which variables the NAs are in
which(is.na(soil))
## integer(0)

Answer: The results are 0 and integer(0), which implies that are no NAs in the dataset. Hence, we can proceed with the analysis.


3. Run the correlation matrix as explained in the assignment on multiple regression and interpret.

As always, we start with exploring the dataset. In a multivariate context, correlations are the most obvious way to preliminary evaluate patterns among the variables.

# defining correlation matrix
cor_matrix <- cor(soil, use="pairwise.complete.obs") # takes NAs into account (even if there are none)
# this only works if all variables are numeric!

# installing corrplot package 
library(corrplot)

# visualizing the correlation matrix
corrplot(cor_matrix, method="circle")

Answer: Very strong correlations between carbon and acidity, electric conductivity and acidity, potassium and electric conductivity, and nitrogen and carbon. Moderately strong correlations between nitrogen and acidity, potassium and acidity, as well as carbon and electric conductivity and potassium and phosphate, carbon and phosphate and potassium and carbon. Till bedrock appears to lack any sort of correlation with any of the other variables.


4. Explore the correlation and interpret the results more closely, also in relation to the previous question.

Another way to explore correlations is through:

# another way to explore correlations - installing required package
library(PerformanceAnalytics)

# creating correlation chart
chart.Correlation(soil[,c(1:9)], histogram=TRUE, pch=19)

Answer: The correlations show the same patterns. In addition though, the plot suggests that nitrogen (especially) is not normally distributed. However, this skewed distribution does not seem to strongly affect the linearity of the relations with other variables. Therefore, no transformation is needed for this analysis (given that we focus on linear relationships among variables). Some other relationships show some curvatures. However, those are mostly related to individual relationships and not to a particular variable. Therefore, no transformations are necessary nor applied.


5. Run the PCA: Why do we refer to columns 2-10 and why do we scale?

We will now run the PCA:

# installing vegan package
library(vegan)
## Loading required package: permute
# running PCA analysis
pcasoil <- rda(soil[,c(1:9)], scale=TRUE)

Answer: The 2:10 range refers to the variables in the columns, essentially you are telling it which dimensions should be considered in the PCA analysis. Originally, the first column provided the site names, which should not be included in the analysis, but this was already taken care of when loading the data. Scaling is important to standardize the data given they each have very different units and ranges of values and most likely also variances.

The output of the PCA is given by:

# summary of PCA output
summary(pcasoil)
## 
## Call:
## rda(X = soil[, c(1:9)], scale = TRUE) 
## 
## Partitioning of correlations:
##               Inertia Proportion
## Total               9          1
## Unconstrained       9          1
## 
## Eigenvalues, and their contribution to the correlations 
## 
## Importance of components:
##                          PC1    PC2    PC3     PC4     PC5    PC6     PC7
## Eigenvalue            4.7199 1.1868 1.0026 0.67973 0.55483 0.3762 0.22979
## Proportion Explained  0.5244 0.1319 0.1114 0.07553 0.06165 0.0418 0.02553
## Cumulative Proportion 0.5244 0.6563 0.7677 0.84323 0.90488 0.9467 0.97220
##                           PC8     PC9
## Eigenvalue            0.14423 0.10593
## Proportion Explained  0.01603 0.01177
## Cumulative Proportion 0.98823 1.00000
# unfortunately this only gives the explained variance by the axes

This gives all samples scores and species scores. We will walk through the various elements.The first element is the overall performance of the axes; i.e. the explained variance by the individual axes.

Note: To fully interpret a PCA, you need three pieces of information, but the summary(pcasoil) output in your screenshot only gives you one of them.

1. What you HAVE: “Explained Variance” (The “How Much”)
The table in your screenshot (under “Importance of components”) tells you how much information is captured by each axis (PC1, PC2, etc.).
- Eigenvalue: The raw amount of variation an axis represents.
- Proportion Explained: PC1 explains 0.5244 (or 52.44%) of the variation in your soil data.
- Cumulative Proportion: PC1 and PC2 together explain 0.6563 (or 65.63%) of the data.
What this tells you: “These axes are doing a good job of summarizing the data.” What this fails to tell you: “What do these axes actually represent biologically or chemically?”

2. What is MISSING: The “What” and the “Where”
To actually understand your soil data, you need the two parts that “unfortunately” aren’t shown in this specific table:
a. Loadings (Species/Variable Scores):
- The Question: “Which soil variables are driving PC1?”
- The Missing Data: You don’t know if PC1 represents a gradient of pH, Nitrogen, or Moisture. Without the loadings, PC1 is just a meaningless math construct.
b. Scores (Site/Sample Scores):
- The Question: “Which soil samples are similar to each other?”
- The Missing Data: You can’t see if Sample A and Sample B are close together on the graph.


6. How much of the soil variation is represented (explained) by the first and second axis?

Answer: The cumulative soil variation explained by PC1 and PC2 is 0.6563, or around 66%, whereof PC1 explains roughly 52% and PC2 explains an additional roughly 13%. This means that more than half of the variation in soil chemical properties is explained by the first axis PC1, and hence this is a good estimate.

The terminology of a multivariate analysis and of the vegan() package in particular may be a bit confusing. Many methods have originally been developed in vegetation science where one wanted to understand the patterns of plant species; which plant species do co-occur and which plant species exclude each other (e.g. due to competition or due to other habitat preferences). This terminology is still used in all multivariate analyses, also in cases where other objects are analysed. That implies that in the context of multivariate analysis, “species” refers to any object of which you want to analyse its relation to other objects. In this case study, “species” refers to the various soil chemical properties analyzed.


7. Retrieve the species scores of the PCA by the “scores” statement. Which elements determine mostly the variation along the first PCA axis? And which along the second axis?

# the following apparently does not give the scores for all axes - only the first two axes
scores(pcasoil, display="sp")
##              PC1        PC2
## depth  0.1155246 -1.3694327
## EC     1.9511760 -0.2335959
## pH    -1.7900292 -0.8316702
## PO4    1.7350191 -0.3115872
## SO4    1.0079759 -1.4758483
## Ca     1.1624320 -0.3194442
## K      1.9162096 -0.2738249
## C      1.9856287  0.4858992
## N      1.6680836  0.6502840
## attr(,"const")
## [1] 6.593492
# scores(pcasoil, display="sites") # removing this one because it's very long, per site

# to have more than only the first two axes
scores(pcasoil, display="sp", choices=c(1:9))
##              PC1        PC2         PC3           PC4        PC5         PC6
## depth  0.1155246 -1.3694327  1.53868195 -0.7083424475 -0.2485411 -0.09047628
## EC     1.9511760 -0.2335959  0.14294505 -0.0006700999  0.4731292  0.56947167
## pH    -1.7900292 -0.8316702 -0.45738859  0.2429270170 -0.3172370 -0.03866796
## PO4    1.7350191 -0.3115872  0.11537687  0.6608425267  0.2236169 -1.09141511
## SO4    1.0079759 -1.4758483 -0.88376270  0.5801038923 -0.5240379  0.28303767
## Ca     1.1624320 -0.3194442 -1.17769679 -1.3622408979  0.1891140 -0.30742577
## K      1.9162096 -0.2738249  0.12852765  0.2652172662  0.6475193  0.33711942
## C      1.9856287  0.4858992  0.20457789  0.0514432351 -0.4341713  0.03357189
## N      1.6680836  0.6502840  0.07896231 -0.1439777369 -1.1513712  0.04983905
##               PC7         PC8          PC9
## depth -0.03261878  0.03737931 -0.007488184
## EC     0.19935589 -0.59381610  0.088659975
## pH     0.65297725 -0.16574529 -0.332228748
## PO4    0.11069088 -0.12654129  0.059997060
## SO4   -0.37994898  0.07992706  0.115250299
## Ca     0.04423058  0.01802693 -0.044046796
## K      0.41057448  0.53191781 -0.111001447
## C     -0.31953617 -0.07485031 -0.556599613
## N      0.46213824  0.06296055  0.229012959
## attr(,"const")
## [1] 6.593492

This gives all samples scores and species scores. We will walk through the various elements below.

Answer: The first axis (the very important one) PC1 is mostly explained by variation in C, EC, K, pH, PO4 and N (in that order). Along the second axis PC2, SO4 and depth are strongly contributing to explaining variation.


8. Create a biplot and carefully interpret the pattern shown.

We plot the results in a so-called biplot, a specialized chart that allows you to see two different things at the same time, in this case the different soil chemical properties and test sites. The “text” is linked to species to ensure that the species names are provided, while “points” refer to the differen test sites.

# creating a biplot
biplot(pcasoil, display=c("species","sites"), type = c("text", "points"))

# or a slightly prettier version of the biplot
biplot(pcasoil,col=c("black"), cex=c(2.0), display=c("sp"), pch=3, 
       xlab="PCA axis 1 (52% explained)", ylab="PCA axis 2 (13% explained")
points(pcasoil, cex=c(1.0), display=c("wa"), lwd=0.5, scaling=2)

# the scaling statement determines the relative size of the site scores vs the species scores
# scaling can be given 0, 1, 2, or 3 and is mostly a matter of trial and error to find the optimal one

Answer: The plot shows that N and C are strongly correlated and strongly negatively correlated to pH. These three soil characteristics may be driven by the same factor. EC, K, PO4 and Ca are also strongly correlated among each other, but slightly decoupled from N-C-pH. Depth varies completely independently (nearly orthogonal) from pH-N-C.


9. Evaluate the relationships between pH and N, between pH and SO4, and between N and SO4, with a type-II regression. How do the results compare to what you see in the PCA?

As indicated during the lecture, a PCA is a generalization of a type-II regression (also called a major axis regression). Evaluate these two relationships with PCA analysis.

# running a major axis regression for both relationships
require(smatr)
## Loading required package: smatr
# acidity (pH) and nitrogen (N)
ma1 <- sma(formula=pH~N, data=soil)
summary(ma1)
## Call: sma(formula = pH ~ N, data = soil) 
## 
## Fit using Standardized Major Axis 
## 
## ------------------------------------------------------------
## Coefficients:
##             elevation     slope
## estimate     5.670129 -4.722735
## lower limit  5.551334 -5.252119
## upper limit  5.788924 -4.246710
## 
## H0 : variables uncorrelated
## R-squared : 0.3906819 
## P-value : < 2.22e-16
plot(ma1)

coef(ma1)[2] # slope
##     slope 
## -4.722735
# acidity (pH) and sulphate (SO4)
ma2 <- sma(formula=pH~SO4, data=soil)
plot(ma2)

summary(ma2)
## Call: sma(formula = pH ~ SO4, data = soil) 
## 
## Fit using Standardized Major Axis 
## 
## ------------------------------------------------------------
## Coefficients:
##             elevation      slope
## estimate     8.006209 -0.9726039
## lower limit  7.564841 -1.1141323
## upper limit  8.447578 -0.8490538
## 
## H0 : variables uncorrelated
## R-squared : 0.001328701 
## P-value : 0.59853
coef(ma2)[2] # slope
##      slope 
## -0.9726039
# nitrogen (N) and sulphate (SO4)
ma3 <- sma(formula=N~SO4, data=soil)
plot(ma3)

summary(ma3)
## Call: sma(formula = N ~ SO4, data = soil) 
## 
## Fit using Standardized Major Axis 
## 
## ------------------------------------------------------------
## Coefficients:
##              elevation     slope
## estimate    -0.4946457 0.2059408
## lower limit -0.5852332 0.1803365
## upper limit -0.4040582 0.2351805
## 
## H0 : variables uncorrelated
## R-squared : 0.04650301 
## P-value : 0.0016272
coef(ma3)[2] # slope
##     slope 
## 0.2059408

Answer: The results show that pH is indeed strongly negatively related to N (R2 of 0.39), while SO4 is almost not related to N (R2=0.04). This coincides with the patterns shown in the biplot, which will not correspond completely, since it is also based on the correlations with all other variables in the dataset.


10. Interpret each of the figures individually. What are the sample scores, what are the species scores? Which factors will probably be most important? What do you conclude for each of the figures?

Interpretation of the graphics of ordination is key to its understanding.

Therefore, take in addition these two figures from a PCA. Figure 1 relates species coming from different genera of plant species (like “bidens”) to plant traits (like “lignin”). Figure 2 shows microbial communities as associated to intact roots (“roots”), cells sloughed off from roots into the soil environments (“cells”) and controls (i.e. soil without any roots; “control”).

Figure 1
Figure 1

Answer - Figure 1: Describes the coordination of traits for 8 different genera.

Figure 2
Figure 2

Answer - Figure 2: Only shows sample scores!