This document is a supplement to the article “Ordination analysis in sedimentology, geochemistry and paleoenvironment - background, current trends and recommendations” https://doi.org/10.31223/X5N31Q
One of the common trivial causes for errors is that variable classes are not correctly assigned. Categorical variables should have the class “factor”, numeric variables should be “numeric” etc. If the original dataset contains text annotations in variables that are otherwise meant to be numeric (e.g. “not measured” or “amount too small” or “<0.01”), the entire variable will be imported as character and the calculations won’t work. This problem also applies to Excel spreadsheets, only variable format cannot be controlled in a reproducible, platform-independent way. In R the best practice is to explicitly state the format of the variable at import - see the example below.
The dataset is first cleaned from columns containing redundant variables, e.g. standard deviations etc. and then of rows containing empty cells. Two categorical variables (Lithological description and packing) are relegated to a separate list describing sample properties.
original <- read.csv(file="S4._Dataset_Bialik_et_al_2018.csv", header=T, colClasses=c("character",rep("factor",2), rep("numeric",20))) #stating variable format
rownames(original)<-original$Sample.ID #the first column is used as row names
original<-original[,-1] #and the dropped from the dataset
colnames(original)<-c("Lithological descripton","Packing", "Calcite (%)", "MgCO3 in dolomite (%mol)", "Sr(ppm)", "IC (%Wt)", "Ti(ppm)", "Zr(ppm)", "Mn(ppm)", "V/Cr", "Ni/Co", "Ce/Ce*", "d13C (VPDB)", "1σ", "d18O (VPDB)", "1σ", "87Sr/86Sr Initial Value", "1 SE", "d26MgDSM3", "2σ", "d25MgDSM3", "2σ")
Turns out some R packages have difficulty with Greek letters in variable names so we use d instead of δ. Also it turns out there are only 6 cases that are complete in all variables. This is largely driven by two variables: δ25MgDSM3 and IC (%Wt) that have complementary measurement patterns so let’s try to visualise how they are distributed.
reduced <- original[,-c(14,16,18,20,22)] #dropping SDs
completeCases <- reduced[complete.cases(reduced), ]
pheatmap(reduced[,3:17], cluster_rows=FALSE, cluster_cols=FALSE)
# Keeping only rows without empty cells
As the variables are not normalized, the heat map doesn’t tell us anything about their relationships (yet) but it shows empty cells. Let’s take δ25MgDSM3 as the main offender and kick it out.
reduced2 <- reduced[,-c(16:17)]
pheatmap(reduced2[,3:15], cluster_rows=FALSE, cluster_cols=FALSE)
This looks much better, but if we now exclude rows containing NAs, we will end up with 14 rows only (14 x 15 matrix), whereas if we kick out IC (%Wt), we will have the following dimensions:
reduced3 <- reduced2[,-6]
dim(reduced3[complete.cases(reduced3),])
## [1] 61 14
We keep this subset for further analysis. The first two variables are categorical and will be used as sample properties to identify groups in ordination plots, but not used in the ordination itself. We can immediately visualise how the variables are distributed between the lithological classes.
reduced3<-reduced3[complete.cases(reduced3),]
sample.prop<-reduced3[,1:2]
reduced3<-reduced3[,3:14]
pairs(reduced3,
lower.panel = NULL,
col = brewer.pal(9, name="RdBu")[sample.prop$`Lithological descripton`])
par(xpd = TRUE)
legend("bottomleft", fill = brewer.pal(9, name="RdBu"), legend = c( levels(sample.prop$`Lithological descripton`)))
Here we see some scientific insights, e.g. that marls seems to be an outlier with respect to most variables, but we also see another common causes for analysis complications: the same levels of the Lithological description factor are spelled in two different was and thus treated as different. We can also order the levels of the factor to impose color coding (similar lithologies will be plotted in similar colors).
levels(sample.prop$`Lithological descripton`)[levels(sample.prop$`Lithological descripton`)=="dolomite"] = "Dolomite"
levels(sample.prop$`Lithological descripton`)[levels(sample.prop$`Lithological descripton`)=="Dolomitic Limestone"] = "Dolomitic limestone"
levels(sample.prop$`Lithological descripton`)[levels(sample.prop$`Lithological descripton`)=="marl"] = "Marl" #For consistency
sample.prop$`Lithological descripton` <- factor(sample.prop$`Lithological descripton`, levels=c("Marl","Marly limestone","Limestone","Dolomitic limestone","Dolomite","Marly dolomite","Dolomitic marl"))
Hopefully the data overview will be more legible now:
We can do the same for the Embry & Klovan categories:
First we need to see how the variables are distributed.
It’s as bad as it gets! If we run tests for multivariate and univariate normality, we will see that none of the variables neither the dataset as a whole is normally distributed. Let us only display Royston’s test here, but other tests are available in this package.
mvn(reduced3, mvnTest = "royston")
## $multivariateNormality
## Test H p value MVN
## 1 Royston 223.7432 3.459291e-42 NO
##
## $univariateNormality
## Test Variable Statistic p value Normality
## 1 Shapiro-Wilk Calcite (%) 0.5666 <0.001 NO
## 2 Shapiro-Wilk MgCO3 in dolomite (%mol) 0.8747 <0.001 NO
## 3 Shapiro-Wilk Sr(ppm) 0.9172 5e-04 NO
## 4 Shapiro-Wilk Ti(ppm) 0.9583 0.0363 NO
## 5 Shapiro-Wilk Zr(ppm) 0.9048 2e-04 NO
## 6 Shapiro-Wilk Mn(ppm) 0.8511 <0.001 NO
## 7 Shapiro-Wilk V/Cr 0.9571 0.0319 NO
## 8 Shapiro-Wilk Ni/Co 0.2006 <0.001 NO
## 9 Shapiro-Wilk Ce/Ce* 0.8332 <0.001 NO
## 10 Shapiro-Wilk d13C (VPDB) 0.9161 5e-04 NO
## 11 Shapiro-Wilk d18O (VPDB) 0.9235 0.001 NO
## 12 Shapiro-Wilk 87Sr/86Sr Initial Value 0.9261 0.0012 NO
##
## $Descriptives
## n Mean Std.Dev Median Min
## Calcite (%) 61 0.1424590 2.832529e-01 0.000000 0.000000
## MgCO3 in dolomite (%mol) 61 49.9942623 2.294670e+00 50.430000 44.470000
## Sr(ppm) 61 218.7191803 9.083095e+01 197.890000 107.990000
## Ti(ppm) 61 782.9603279 3.755812e+02 741.790000 86.280000
## Zr(ppm) 61 37.4554098 2.831086e+01 29.280000 0.830000
## Mn(ppm) 61 590.8063934 3.536703e+02 523.030000 144.460000
## V/Cr 61 2.4416393 1.138201e+00 2.500000 0.300000
## Ni/Co 61 11.3395082 2.455367e+01 7.180000 3.320000
## Ce/Ce* 61 1.1780328 2.847269e-01 1.290000 0.120000
## d13C (VPDB) 61 0.8752459 1.447708e+00 0.840000 -5.330000
## d18O (VPDB) 61 -0.4129508 2.017472e+00 -0.380000 -6.430000
## 87Sr/86Sr Initial Value 61 0.7072992 4.195248e-04 0.707318 0.705767
## Max 25th 75th Skew
## Calcite (%) 0.970000 0.000000 0.080000 1.8497380
## MgCO3 in dolomite (%mol) 53.470000 48.000000 51.530000 0.1233786
## Sr(ppm) 486.120000 138.560000 282.390000 0.6929325
## Ti(ppm) 2048.990000 584.840000 975.780000 0.5684699
## Zr(ppm) 131.300000 19.630000 49.290000 1.1268288
## Mn(ppm) 2266.560000 363.950000 748.390000 1.9219183
## V/Cr 5.560000 1.580000 3.410000 0.2472945
## Ni/Co 197.870000 6.010000 10.280000 7.1848876
## Ce/Ce* 1.530000 1.050000 1.360000 -1.6189409
## d13C (VPDB) 3.130000 0.140000 1.890000 -1.1681827
## d18O (VPDB) 4.450000 -1.040000 0.910000 -0.7078789
## 87Sr/86Sr Initial Value 0.708469 0.707126 0.707512 -0.5893981
## Kurtosis
## Calcite (%) 1.9943943
## MgCO3 in dolomite (%mol) -1.1520514
## Sr(ppm) -0.4073033
## Ti(ppm) 1.2183028
## Zr(ppm) 0.9726103
## Mn(ppm) 6.2905269
## V/Cr -0.7638637
## Ni/Co 51.6058952
## Ce/Ce* 2.5741393
## d13C (VPDB) 3.5592559
## d18O (VPDB) 1.5210377
## 87Sr/86Sr Initial Value 2.7959107
reduced3t <- reduced3
# Positively skewed variables
reduced3t$`Calcite (%)`<-(reduced3$`Calcite (%)`)^(1/3)
reduced3t$`Sr(ppm)`<-sqrt(reduced3$`Sr(ppm)`)
reduced3t$`Zr(ppm)`<-sqrt(reduced3$`Zr(ppm)`)
reduced3t$`Mn(ppm)`<-sqrt(reduced3$`Mn(ppm)`)
reduced3t$`Ni/Co`<-(reduced3$`Ni/Co`)^(1/3)
par(mfrow = c(3,4))
for (i in c(1,3,5,6,8)){
plot(density(reduced3[,i]), col="blue", main = colnames(reduced3)[i], xlab='')
plot(density(reduced3t[,i]), col="red", main = c(colnames(reduced3)[i], "transformed"), xlab='')
}
Did it improve test results?
mvn(reduced3t, mvnTest = "royston")
## $multivariateNormality
## Test H p value MVN
## 1 Royston 166.3988 1.672763e-30 NO
##
## $univariateNormality
## Test Variable Statistic p value Normality
## 1 Shapiro-Wilk Calcite (%) 0.6707 <0.001 NO
## 2 Shapiro-Wilk MgCO3 in dolomite (%mol) 0.8747 <0.001 NO
## 3 Shapiro-Wilk Sr(ppm) 0.9390 0.0045 NO
## 4 Shapiro-Wilk Ti(ppm) 0.9583 0.0363 NO
## 5 Shapiro-Wilk Zr(ppm) 0.9839 0.6044 YES
## 6 Shapiro-Wilk Mn(ppm) 0.9589 0.0389 NO
## 7 Shapiro-Wilk V/Cr 0.9571 0.0319 NO
## 8 Shapiro-Wilk Ni/Co 0.5075 <0.001 NO
## 9 Shapiro-Wilk Ce/Ce* 0.8332 <0.001 NO
## 10 Shapiro-Wilk d13C (VPDB) 0.9161 5e-04 NO
## 11 Shapiro-Wilk d18O (VPDB) 0.9235 0.001 NO
## 12 Shapiro-Wilk 87Sr/86Sr Initial Value 0.9261 0.0012 NO
##
## $Descriptives
## n Mean Std.Dev Median Min
## Calcite (%) 61 0.2276147 3.507496e-01 0.000000 0.0000000
## MgCO3 in dolomite (%mol) 61 49.9942623 2.294670e+00 50.430000 44.4700000
## Sr(ppm) 61 14.4884418 2.991818e+00 14.067338 10.3918237
## Ti(ppm) 61 782.9603279 3.755812e+02 741.790000 86.2800000
## Zr(ppm) 61 5.6584663 2.351125e+00 5.411100 0.9110434
## Mn(ppm) 61 23.3885394 6.671756e+00 22.869849 12.0191514
## V/Cr 61 2.4416393 1.138201e+00 2.500000 0.3000000
## Ni/Co 61 2.0471333 5.565037e-01 1.929189 1.4918072
## Ce/Ce* 61 1.1780328 2.847269e-01 1.290000 0.1200000
## d13C (VPDB) 61 0.8752459 1.447708e+00 0.840000 -5.3300000
## d18O (VPDB) 61 -0.4129508 2.017472e+00 -0.380000 -6.4300000
## 87Sr/86Sr Initial Value 61 0.7072992 4.195248e-04 0.707318 0.7057670
## Max 25th 75th Skew
## Calcite (%) 0.9898983 0.000000 0.4308869 1.1132226
## MgCO3 in dolomite (%mol) 53.4700000 48.000000 51.5300000 0.1233786
## Sr(ppm) 22.0481292 11.771151 16.8044637 0.4095604
## Ti(ppm) 2048.9900000 584.840000 975.7800000 0.5684699
## Zr(ppm) 11.4586212 4.430576 7.0206837 0.1297347
## Mn(ppm) 47.6084026 19.077474 27.3567176 0.7440215
## V/Cr 5.5600000 1.580000 3.4100000 0.2472945
## Ni/Co 5.8272008 1.818130 2.1743579 5.1621686
## Ce/Ce* 1.5300000 1.050000 1.3600000 -1.6189409
## d13C (VPDB) 3.1300000 0.140000 1.8900000 -1.1681827
## d18O (VPDB) 4.4500000 -1.040000 0.9100000 -0.7078789
## 87Sr/86Sr Initial Value 0.7084690 0.707126 0.7075120 -0.5893981
## Kurtosis
## Calcite (%) -0.4157119
## MgCO3 in dolomite (%mol) -1.1520514
## Sr(ppm) -0.9489085
## Ti(ppm) 1.2183028
## Zr(ppm) -0.2702613
## Mn(ppm) 1.2785022
## V/Cr -0.7638637
## Ni/Co 32.1130943
## Ce/Ce* 2.5741393
## d13C (VPDB) 3.5592559
## d18O (VPDB) 1.5210377
## 87Sr/86Sr Initial Value 2.7959107
The data does not follow a multinormal distribution and for only one variable (Zr content) univariate normality cannot be rejected. However, we’ve reduced the skewnesses of these variables, which can lead to artefacts in PCA. Skewness in the original variables:
skewness(reduced3[,c(1,3,5,6,8)])
## Calcite (%) Sr(ppm) Zr(ppm) Mn(ppm) Ni/Co
## 1.8961736 0.7103278 1.1551166 1.9701659 7.3652561
Skewness in the transformed variables:
skewness(reduced3t[,c(1,3,5,6,8)])
## Calcite (%) Sr(ppm) Zr(ppm) Mn(ppm) Ni/Co
## 1.1411688 0.4198420 0.1329915 0.7626993 5.2917590
Finally we can run PCA on this dataset but it doesn’t look great…
pca1<-princomp(reduced3t, cor = T, scores = T)
ggbiplot(pca1, obs.scale = 1, var.scale = 1,
groups = sample.prop$`Lithological descripton`, ellipse = TRUE, circle = TRUE) +
scale_colour_brewer(palette="RdBu") +
theme(legend.direction = 'horizontal', legend.position = 'top')+
theme_bw()
The ellipses are a variation of convex hulls that is less sensitive to outliers. The groups are not well separatated using this set of variables, except for Limestone.
ggbiplot(pca1, obs.scale = 1, var.scale = 1,
groups = sample.prop$Packing, ellipse = TRUE, circle = TRUE) +
scale_colour_brewer(palette="PuOr") +
theme(legend.direction = 'horizontal', legend.position = 'top')+
theme_bw()
Also the first two principal components explain merely ´r 31.7+19.1` of the total variance in the dataset.
normal <- ggscreeplot(pca1)
ggscreeplot(pca1, type="cev")+
geom_line(aes(y=normal$data$yvar), col ="darkred")+
geom_point(aes(y=normal$data$yvar), col ="darkred")
And we can use the loadings to interpret the first two components:
save(reduced3, file = "reduced3.Rdata")
save(sample.prop, file = "sample.prop.Rdata")
save(reduced3t, file = "reduced3t.Rdata")