This document is a supplement to the article “Ordination analysis in sedimentology, geochemistry and paleoenvironment - background, current trends and recommendations” https://doi.org/10.31223/X5N31Q

Preparing and cleaning the data

One of the common trivial causes for errors is that variable classes are not correctly assigned. Categorical variables should have the class “factor”, numeric variables should be “numeric” etc. If the original dataset contains text annotations in variables that are otherwise meant to be numeric (e.g. “not measured” or “amount too small” or “<0.01”), the entire variable will be imported as character and the calculations won’t work. This problem also applies to Excel spreadsheets, only variable format cannot be controlled in a reproducible, platform-independent way. In R the best practice is to explicitly state the format of the variable at import - see the example below.

The dataset is first cleaned from columns containing redundant variables, e.g. standard deviations etc. and then of rows containing empty cells. Two categorical variables (Lithological description and packing) are relegated to a separate list describing sample properties.

original <- read.csv(file="S4._Dataset_Bialik_et_al_2018.csv", header=T, colClasses=c("character",rep("factor",2), rep("numeric",20))) #stating variable format
rownames(original)<-original$Sample.ID #the first column is used as row names
original<-original[,-1] #and the dropped from the dataset

colnames(original)<-c("Lithological descripton","Packing",  "Calcite (%)",  "MgCO3 in dolomite (%mol)", "Sr(ppm)",  "IC (%Wt)", "Ti(ppm)",  "Zr(ppm)",  "Mn(ppm)",  "V/Cr", "Ni/Co",    "Ce/Ce*",   "d13C (VPDB)",  "1σ",   "d18O (VPDB)",  "1σ",   "87Sr/86Sr Initial Value",  "1 SE", "d26MgDSM3",    "2σ",   "d25MgDSM3",    "2σ") 

Turns out some R packages have difficulty with Greek letters in variable names so we use d instead of δ. Also it turns out there are only 6 cases that are complete in all variables. This is largely driven by two variables: δ25MgDSM3 and IC (%Wt) that have complementary measurement patterns so let’s try to visualise how they are distributed.

reduced <- original[,-c(14,16,18,20,22)] #dropping SDs
completeCases <- reduced[complete.cases(reduced), ] 
pheatmap(reduced[,3:17], cluster_rows=FALSE, cluster_cols=FALSE)

# Keeping only rows without empty cells

As the variables are not normalized, the heat map doesn’t tell us anything about their relationships (yet) but it shows empty cells. Let’s take δ25MgDSM3 as the main offender and kick it out.

reduced2 <- reduced[,-c(16:17)]
pheatmap(reduced2[,3:15], cluster_rows=FALSE, cluster_cols=FALSE)

This looks much better, but if we now exclude rows containing NAs, we will end up with 14 rows only (14 x 15 matrix), whereas if we kick out IC (%Wt), we will have the following dimensions:

reduced3 <- reduced2[,-6]
dim(reduced3[complete.cases(reduced3),])
## [1] 61 14

We keep this subset for further analysis. The first two variables are categorical and will be used as sample properties to identify groups in ordination plots, but not used in the ordination itself. We can immediately visualise how the variables are distributed between the lithological classes.

reduced3<-reduced3[complete.cases(reduced3),]
sample.prop<-reduced3[,1:2]
reduced3<-reduced3[,3:14]
pairs(reduced3,
      lower.panel = NULL, 
      col = brewer.pal(9, name="RdBu")[sample.prop$`Lithological descripton`])
par(xpd = TRUE)
legend("bottomleft", fill = brewer.pal(9, name="RdBu"), legend = c( levels(sample.prop$`Lithological descripton`)))

Here we see some scientific insights, e.g. that marls seems to be an outlier with respect to most variables, but we also see another common causes for analysis complications: the same levels of the Lithological description factor are spelled in two different was and thus treated as different. We can also order the levels of the factor to impose color coding (similar lithologies will be plotted in similar colors).

levels(sample.prop$`Lithological descripton`)[levels(sample.prop$`Lithological descripton`)=="dolomite"] = "Dolomite"
levels(sample.prop$`Lithological descripton`)[levels(sample.prop$`Lithological descripton`)=="Dolomitic Limestone"] = "Dolomitic limestone"
levels(sample.prop$`Lithological descripton`)[levels(sample.prop$`Lithological descripton`)=="marl"] = "Marl" #For consistency
sample.prop$`Lithological descripton` <- factor(sample.prop$`Lithological descripton`, levels=c("Marl","Marly limestone","Limestone","Dolomitic limestone","Dolomite","Marly dolomite","Dolomitic marl"))

Hopefully the data overview will be more legible now:

We can do the same for the Embry & Klovan categories:

Transforming the data

First we need to see how the variables are distributed.

It’s as bad as it gets! If we run tests for multivariate and univariate normality, we will see that none of the variables neither the dataset as a whole is normally distributed. Let us only display Royston’s test here, but other tests are available in this package.

mvn(reduced3, mvnTest = "royston")
## $multivariateNormality
##      Test        H      p value MVN
## 1 Royston 223.7432 3.459291e-42  NO
## 
## $univariateNormality
##            Test                 Variable Statistic   p value Normality
## 1  Shapiro-Wilk       Calcite (%)           0.5666  <0.001      NO    
## 2  Shapiro-Wilk MgCO3 in dolomite (%mol)    0.8747  <0.001      NO    
## 3  Shapiro-Wilk         Sr(ppm)             0.9172   5e-04      NO    
## 4  Shapiro-Wilk         Ti(ppm)             0.9583  0.0363      NO    
## 5  Shapiro-Wilk         Zr(ppm)             0.9048   2e-04      NO    
## 6  Shapiro-Wilk         Mn(ppm)             0.8511  <0.001      NO    
## 7  Shapiro-Wilk           V/Cr              0.9571  0.0319      NO    
## 8  Shapiro-Wilk          Ni/Co              0.2006  <0.001      NO    
## 9  Shapiro-Wilk          Ce/Ce*             0.8332  <0.001      NO    
## 10 Shapiro-Wilk       d13C (VPDB)           0.9161   5e-04      NO    
## 11 Shapiro-Wilk       d18O (VPDB)           0.9235   0.001      NO    
## 12 Shapiro-Wilk 87Sr/86Sr Initial Value     0.9261  0.0012      NO    
## 
## $Descriptives
##                           n        Mean      Std.Dev     Median        Min
## Calcite (%)              61   0.1424590 2.832529e-01   0.000000   0.000000
## MgCO3 in dolomite (%mol) 61  49.9942623 2.294670e+00  50.430000  44.470000
## Sr(ppm)                  61 218.7191803 9.083095e+01 197.890000 107.990000
## Ti(ppm)                  61 782.9603279 3.755812e+02 741.790000  86.280000
## Zr(ppm)                  61  37.4554098 2.831086e+01  29.280000   0.830000
## Mn(ppm)                  61 590.8063934 3.536703e+02 523.030000 144.460000
## V/Cr                     61   2.4416393 1.138201e+00   2.500000   0.300000
## Ni/Co                    61  11.3395082 2.455367e+01   7.180000   3.320000
## Ce/Ce*                   61   1.1780328 2.847269e-01   1.290000   0.120000
## d13C (VPDB)              61   0.8752459 1.447708e+00   0.840000  -5.330000
## d18O (VPDB)              61  -0.4129508 2.017472e+00  -0.380000  -6.430000
## 87Sr/86Sr Initial Value  61   0.7072992 4.195248e-04   0.707318   0.705767
##                                  Max       25th       75th       Skew
## Calcite (%)                 0.970000   0.000000   0.080000  1.8497380
## MgCO3 in dolomite (%mol)   53.470000  48.000000  51.530000  0.1233786
## Sr(ppm)                   486.120000 138.560000 282.390000  0.6929325
## Ti(ppm)                  2048.990000 584.840000 975.780000  0.5684699
## Zr(ppm)                   131.300000  19.630000  49.290000  1.1268288
## Mn(ppm)                  2266.560000 363.950000 748.390000  1.9219183
## V/Cr                        5.560000   1.580000   3.410000  0.2472945
## Ni/Co                     197.870000   6.010000  10.280000  7.1848876
## Ce/Ce*                      1.530000   1.050000   1.360000 -1.6189409
## d13C (VPDB)                 3.130000   0.140000   1.890000 -1.1681827
## d18O (VPDB)                 4.450000  -1.040000   0.910000 -0.7078789
## 87Sr/86Sr Initial Value     0.708469   0.707126   0.707512 -0.5893981
##                            Kurtosis
## Calcite (%)               1.9943943
## MgCO3 in dolomite (%mol) -1.1520514
## Sr(ppm)                  -0.4073033
## Ti(ppm)                   1.2183028
## Zr(ppm)                   0.9726103
## Mn(ppm)                   6.2905269
## V/Cr                     -0.7638637
## Ni/Co                    51.6058952
## Ce/Ce*                    2.5741393
## d13C (VPDB)               3.5592559
## d18O (VPDB)               1.5210377
## 87Sr/86Sr Initial Value   2.7959107
reduced3t <- reduced3
# Positively skewed variables
reduced3t$`Calcite (%)`<-(reduced3$`Calcite (%)`)^(1/3)
reduced3t$`Sr(ppm)`<-sqrt(reduced3$`Sr(ppm)`)
reduced3t$`Zr(ppm)`<-sqrt(reduced3$`Zr(ppm)`)
reduced3t$`Mn(ppm)`<-sqrt(reduced3$`Mn(ppm)`)
reduced3t$`Ni/Co`<-(reduced3$`Ni/Co`)^(1/3)

par(mfrow = c(3,4))
for (i in c(1,3,5,6,8)){
  plot(density(reduced3[,i]), col="blue", main = colnames(reduced3)[i], xlab='')
  plot(density(reduced3t[,i]), col="red", main = c(colnames(reduced3)[i], "transformed"), xlab='')
}

Did it improve test results?

mvn(reduced3t, mvnTest = "royston")
## $multivariateNormality
##      Test        H      p value MVN
## 1 Royston 166.3988 1.672763e-30  NO
## 
## $univariateNormality
##            Test                 Variable Statistic   p value Normality
## 1  Shapiro-Wilk       Calcite (%)           0.6707  <0.001      NO    
## 2  Shapiro-Wilk MgCO3 in dolomite (%mol)    0.8747  <0.001      NO    
## 3  Shapiro-Wilk         Sr(ppm)             0.9390  0.0045      NO    
## 4  Shapiro-Wilk         Ti(ppm)             0.9583  0.0363      NO    
## 5  Shapiro-Wilk         Zr(ppm)             0.9839  0.6044      YES   
## 6  Shapiro-Wilk         Mn(ppm)             0.9589  0.0389      NO    
## 7  Shapiro-Wilk           V/Cr              0.9571  0.0319      NO    
## 8  Shapiro-Wilk          Ni/Co              0.5075  <0.001      NO    
## 9  Shapiro-Wilk          Ce/Ce*             0.8332  <0.001      NO    
## 10 Shapiro-Wilk       d13C (VPDB)           0.9161   5e-04      NO    
## 11 Shapiro-Wilk       d18O (VPDB)           0.9235   0.001      NO    
## 12 Shapiro-Wilk 87Sr/86Sr Initial Value     0.9261  0.0012      NO    
## 
## $Descriptives
##                           n        Mean      Std.Dev     Median        Min
## Calcite (%)              61   0.2276147 3.507496e-01   0.000000  0.0000000
## MgCO3 in dolomite (%mol) 61  49.9942623 2.294670e+00  50.430000 44.4700000
## Sr(ppm)                  61  14.4884418 2.991818e+00  14.067338 10.3918237
## Ti(ppm)                  61 782.9603279 3.755812e+02 741.790000 86.2800000
## Zr(ppm)                  61   5.6584663 2.351125e+00   5.411100  0.9110434
## Mn(ppm)                  61  23.3885394 6.671756e+00  22.869849 12.0191514
## V/Cr                     61   2.4416393 1.138201e+00   2.500000  0.3000000
## Ni/Co                    61   2.0471333 5.565037e-01   1.929189  1.4918072
## Ce/Ce*                   61   1.1780328 2.847269e-01   1.290000  0.1200000
## d13C (VPDB)              61   0.8752459 1.447708e+00   0.840000 -5.3300000
## d18O (VPDB)              61  -0.4129508 2.017472e+00  -0.380000 -6.4300000
## 87Sr/86Sr Initial Value  61   0.7072992 4.195248e-04   0.707318  0.7057670
##                                   Max       25th        75th       Skew
## Calcite (%)                 0.9898983   0.000000   0.4308869  1.1132226
## MgCO3 in dolomite (%mol)   53.4700000  48.000000  51.5300000  0.1233786
## Sr(ppm)                    22.0481292  11.771151  16.8044637  0.4095604
## Ti(ppm)                  2048.9900000 584.840000 975.7800000  0.5684699
## Zr(ppm)                    11.4586212   4.430576   7.0206837  0.1297347
## Mn(ppm)                    47.6084026  19.077474  27.3567176  0.7440215
## V/Cr                        5.5600000   1.580000   3.4100000  0.2472945
## Ni/Co                       5.8272008   1.818130   2.1743579  5.1621686
## Ce/Ce*                      1.5300000   1.050000   1.3600000 -1.6189409
## d13C (VPDB)                 3.1300000   0.140000   1.8900000 -1.1681827
## d18O (VPDB)                 4.4500000  -1.040000   0.9100000 -0.7078789
## 87Sr/86Sr Initial Value     0.7084690   0.707126   0.7075120 -0.5893981
##                            Kurtosis
## Calcite (%)              -0.4157119
## MgCO3 in dolomite (%mol) -1.1520514
## Sr(ppm)                  -0.9489085
## Ti(ppm)                   1.2183028
## Zr(ppm)                  -0.2702613
## Mn(ppm)                   1.2785022
## V/Cr                     -0.7638637
## Ni/Co                    32.1130943
## Ce/Ce*                    2.5741393
## d13C (VPDB)               3.5592559
## d18O (VPDB)               1.5210377
## 87Sr/86Sr Initial Value   2.7959107

The data does not follow a multinormal distribution and for only one variable (Zr content) univariate normality cannot be rejected. However, we’ve reduced the skewnesses of these variables, which can lead to artefacts in PCA. Skewness in the original variables:

skewness(reduced3[,c(1,3,5,6,8)])
## Calcite (%)     Sr(ppm)     Zr(ppm)     Mn(ppm)       Ni/Co 
##   1.8961736   0.7103278   1.1551166   1.9701659   7.3652561

Skewness in the transformed variables:

skewness(reduced3t[,c(1,3,5,6,8)])
## Calcite (%)     Sr(ppm)     Zr(ppm)     Mn(ppm)       Ni/Co 
##   1.1411688   0.4198420   0.1329915   0.7626993   5.2917590

PCA

Finally we can run PCA on this dataset but it doesn’t look great…

pca1<-princomp(reduced3t, cor = T, scores = T)
ggbiplot(pca1, obs.scale = 1, var.scale = 1,
         groups = sample.prop$`Lithological descripton`, ellipse = TRUE, circle = TRUE) +
  scale_colour_brewer(palette="RdBu") +
  theme(legend.direction = 'horizontal', legend.position = 'top')+
  theme_bw()

The ellipses are a variation of convex hulls that is less sensitive to outliers. The groups are not well separatated using this set of variables, except for Limestone.

ggbiplot(pca1, obs.scale = 1, var.scale = 1,
         groups = sample.prop$Packing, ellipse = TRUE, circle = TRUE) +
  scale_colour_brewer(palette="PuOr") +
  theme(legend.direction = 'horizontal', legend.position = 'top')+
  theme_bw()

Also the first two principal components explain merely ´r 31.7+19.1` of the total variance in the dataset.

normal <- ggscreeplot(pca1)
ggscreeplot(pca1, type="cev")+
  geom_line(aes(y=normal$data$yvar), col ="darkred")+
  geom_point(aes(y=normal$data$yvar), col ="darkred")

And we can use the loadings to interpret the first two components:

save(reduced3, file = "reduced3.Rdata")
save(sample.prop, file = "sample.prop.Rdata")
save(reduced3t, file = "reduced3t.Rdata")