Literature survey of the use of ordination analyses in geosciences

This document is a supplement to the article “Ordination analysis in sedimentology, geochemistry and paleoenvironment - background, current trends and recommendations” https://doi.org/10.31223/X5N31Q

Importing and cleaning the data

Here we are only importing the survey results, devoid of information allowing to identify the authors (names and DOIs).

articles <- read.csv(file="S3.Survey.csv", header=T)
articles$Number.of.data.points..np.<-as.numeric(articles$Number.of.data.points..np.)
articles$Number.of.groups<-as.numeric(articles$Number.of.groups)
articles$Number.of.variables..nv.<-as.numeric(articles$Number.of.variables..nv.)
articles$nv.np<-as.logical(articles$nv.np)
#We can also analyse the distribution of this relationship by creating a new variable
articles$nv_to_np<-articles$Number.of.variables..nv./articles$Number.of.data.points..np.

#Changing wariables coded as Yes/No to logical:
articles$Test.for.normality<-recode(articles$Test.for.normality, "No" = "FALSE", "Yes" = "TRUE", .default = NA_character_)
articles$Test.for.normality<-as.logical(articles$Test.for.normality)
articles$Groups.are.a.priori<-recode(articles$Groups.are.a.priori, "No" = "FALSE", "Yes" = "TRUE", .default = NA_character_)
articles$Groups.are.a.priori<-as.logical(articles$Groups.are.a.priori)
# First checking what categories we have in ordination methods
levels(as.factor(articles$Ordination.method))

## [1] "DCA"  "MNDS" "NMDS" "PCA"

Accidental typos are everywhere and it is always a good idea to screen for them. Here our factor variable would have four categories (referred to as levels) because NMDS is somewhere spelled as MNDS. Here is how this can be recoded:

articles$Ordination.method[which(articles$Ordination.method == "MNDS")] <- "NMDS"
articles$Ordination.method<-as.factor(articles$Ordination.method) #Now a factor with 3 levels

Data cleaning is not finished but it highlights some typical issues. This will slightly vary between packages, but R Software requires a careful decision on what TYPE of data each variable is. This enforces clean coding of categorical variables (see e.g. “MNDS” vs. “NMDS”) and decisions as to what data is missing vs. not applicable. Missing data is coded as NA and it is not the same as “measured but not applicable” such as distance measure would be in PCA. Let’s do some overview analyses of the data we have so far.

ggplot(articles, aes(x=Year.of.publication)) + 
  geom_histogram(color="black", fill="white")+
  scale_x_continuous(breaks=seq(from=min(articles$Year.of.publication),to=max(articles$Year.of.publication), by=2))+
  geom_vline(aes(xintercept=mean(Year.of.publication)),
            color="blue", linetype="dashed", size=1)+
  geom_vline(aes(xintercept=median(Year.of.publication)),
            color="red", linetype="dotted", size=1)+
  labs(x = "Year of publication")+
  theme_bw()+
  theme(axis.text.x = element_text(angle = 90, hjust=0.95,vjust=0.2))

#ggsave("Fig.1.eps", width = 80, height = 40, units = "mm")
# All ggsave lines have been kept in RMarkdown to show how figures are exported.
# Anyone wanting to reproduce our figures directly in R and save them should
# remove the comment (#) sign from the line start.

The blue line indicates the mean (2011.5689655) and the red line - the median (2013).

ggplot(articles, aes(x=Year.of.publication)) + 
  geom_histogram(color="black", fill="white")+
  scale_x_continuous(breaks=seq(from=min(articles$Year.of.publication),to=max(articles$Year.of.publication), by=2))+
  labs(x = "Year of publication")+
  facet_wrap(~Journal.type)+
  theme_bw()+
  theme(axis.text.x = element_text(angle = 90, hjust=0.95,vjust=0.2))

#ggsave("trend_by_journal.eps", width = 80, height = 40, units = "mm")

Puroposes of ordination

articles$Ordination.use.objective<-as.factor(articles$Ordination.use.objective)
ggplot(data = articles, aes(x = reorder(Ordination.use.objective, Ordination.use.objective, function(x)-length(x)))) +
geom_bar(stat = "count", fill="lightblue") + 
stat_count(geom = "text", colour = "black", size = 3.5,
aes(label = ..count..),position=position_stack(vjust=0.5))+
  theme_bw()+
  theme(axis.text.x = element_text(angle = 90, hjust=0.95,vjust=0.2), axis.title.y=element_blank())+
  coord_flip()

#ggsave("Fig.2.eps", width = 112, height = 70, units = "mm")

What was analysed (types of variables)

Types of variables is a nominal (also referred to as categorical) variable, here coded as an unordered factor. First we need to check what levels of the factor we find in the dataset.

articles$Types.of.data <- as.factor(articles$Types.of.data)
levels(articles$Types.of.data)

##  [1] "Ages"                                            
##  [2] "Ages, Minerology"                                
##  [3] "Angles, Ranks"                                   
##  [4] "Continuous geomorphological measurements"        
##  [5] "Crystallography"                                 
##  [6] "Crystallography, Isotopes"                       
##  [7] "DNA assemblages"                                 
##  [8] "Elemental chemistry"                             
##  [9] "Elemental chemistry, Climate data"               
## [10] "Elemental chemistry, Grain size bins"            
## [11] "Elemental chemistry, Isotopes"                   
## [12] "Elemental chemistry, Isotopes, Organic molecules"
## [13] "Elemental chemistry, Mineralogy"                 
## [14] "Elemental chemistry, Organic molecules "         
## [15] "Elemental chemistry, Physical properties"        
## [16] "Fossil assemblage"                               
## [17] "Fossil assemblage, Climate data"                 
## [18] "Fossil assemblage, Elemental chemistry"          
## [19] "Grain size bins"                                 
## [20] "Grain size statistics"                           
## [21] "Grain type"                                      
## [22] "Isotopes"                                        
## [23] "Isotopes, Climate data"                          
## [24] "Luminescence"                                    
## [25] "Magnetic properties "                            
## [26] "Mineralogy"                                      
## [27] "Mixed"                                           
## [28] "Morphometrics"                                   
## [29] "Organic molecules "                              
## [30] "Organic molecules, Physical properties"          
## [31] "Physical properties"                             
## [32] "Physical properties, Isotopes"                   
## [33] "Remote sensing spectral data"                    
## [34] "Satellite images"                                
## [35] "Spectra"                                         
## [36] "Taphonomy"                                       
## [37] "Values of Fourier harmonics"                     
## [38] "Wavenumbers "

Here again we might want to correct a typo (“Minerology”) and group categories which occur only once into a wastebasket level “Other”.

levels(articles$Types.of.data)[levels(articles$Types.of.data)=="Ages, Minerology"] <- "Ages, Mineralogy"

rare.types <- rownames(table(articles$Types.of.data)[which(table(articles$Types.of.data)<2)])
articles$Types_selected<-articles$Types.of.data
levels(articles$Types_selected) <- c(levels(articles$Types_selected),"Other")
articles$Types_selected[which(articles$Types_selected %in% rare.types)] <- "Other"

ggplot(data = articles, aes(x = reorder(Types_selected, Types_selected, function(x)-length(x)))) +
geom_bar(stat = "count", colour="black") + 
stat_count(geom = "text", colour = "gray", size = 3.5,
aes(label = ..count..),position=position_stack(vjust=0.5))+
  theme_bw()+
  theme(axis.text.x = element_text(angle = 90, hjust=0.95,vjust=0.2))+
  labs(x = "Types of data")+
  coord_flip()

#ggsave("Fig.3.eps", width = 112, height = 90, units = "mm")

Compositional data

ggplot(data = articles, aes(x = reorder(Compositional.data, Compositional.data, function(x)-length(x)))) +
geom_bar(stat = "count", fill="lightblue") + 
stat_count(geom = "text", colour = "black", size = 3.5,
aes(label = ..count..),position=position_stack(vjust=0.5))+
  theme_bw()+
  labs(x = "Frequency of compositional datasets")

Type of ordination analysis

If we want to see types of ordination analyses by journal, we need to pool together journals that appear in our survey only once, because otherwise the list would be too long:

articles$Journal<-as.factor(articles$Journal)
articles$Journal.type<-as.factor(articles$Journal.type)
rare.journals <- rownames(table(articles$Journal)[which(table(articles$Journal)<2)])
articles$Journal_selected<-articles$Journal
levels(articles$Journal_selected) <- c(levels(articles$Journal_selected), "Other")
articles$Journal_selected[which(articles$Journal_selected %in% rare.journals)] <- "Other"

After that we can see the use of analysis types by journals, together with frequencies of journals covered by the survey:

ggplot(data = articles, aes(x = reorder(Journal_selected, Journal_selected, function(x)-length(x)))) +
  geom_bar(aes(fill = factor(Ordination.method)))+
  theme_bw()+
  theme(axis.text.x = element_text(angle = 90, hjust=0.95,vjust=0.2))+
  labs(x = "Frequencies of ordination analysis types")

This is still quite long so we can make a breakdown by journal type:

ggplot(data = articles, aes(x = reorder(Journal.type, Journal.type, function(x)-length(x)))) +
  geom_bar(aes(fill = factor(Ordination.method)), color="black")+
  theme_bw()+
  scale_x_discrete(labels = wrap_format(10))+
  scale_fill_brewer(palette="RdYlBu")+
  labs(x = "Type of journal")+
  theme(legend.position = c(0.85, 0.75))+
  theme(legend.title = element_blank())

#ggsave("Fig.3.eps", width = 112, height = 80, units = "mm")

Dataset dimensions

There should be more observations than variables in the analysis ideally. This threshold corresponds to the blue line at the histogram below. We still have 24 studies where there are more variables than observations. This is 0.138 of all studies in our compilation.

ggplot(data=articles, aes(x=nv_to_np)) + 
  geom_histogram() + 
  labs(x = "Number of variables to Number of observations")+
  geom_vline(aes(xintercept=1),
            color="blue", size=1)+
  theme_bw()+
  theme(axis.title=element_text(size=8))

#ggsave("Fig.4.eps", width = 80, height = 70, units = "mm")

Similarity indices

For studies using NMDS, similarity index should also be reported.

ggplot(data = articles[which(articles$Ordination.method=="NMDS"),], aes(x = Distance.metric, fill=Distance.metric)) +
geom_bar(stat = "count") + 
stat_count(geom = "text", colour = "black", size = 3.5,
aes(label = ..count..),position=position_stack(vjust=0.5))

Software

Software packages are also an unordered factor. These are allsoftware packages as in the compilation:

ggplot(data = articles, aes(x = reorder(Software.package, Software.package, function(x)-length(x))))+
  geom_bar(stat = "count", fill="lightblue")+
stat_count(geom = "text", colour = "black", size = 3.5,
aes(label = ..count..),position=position_stack(vjust=0.5))+
  theme_bw()+
  theme(axis.text.x = element_text(angle = 90, hjust=0.95,vjust=0.2))+
  labs(x = "Software package")

Again some cleaning is necessary owing to differences in spelling and the fact that we have “R, PAST” as well as “PAST, R”.

articles$Software.package[which(articles$Software.package == "CANOCO ")] <- "CANOCO"
articles$Software.package[which(articles$Software.package == "MVSP ")] <- "MVSP"
articles$Software.package[which(articles$Software.package == "STATISTICA ")] <- "STATISTICA"
articles$Software.package[which(articles$Software.package == "R, PAST")] <- "PAST, R"
ggplot(data = articles, aes(x = reorder(Software.package, Software.package, function(x)-length(x)))) +
geom_bar(stat = "count", fill="lightblue") + 
stat_count(geom = "text", colour = "black", size = 3.5,
aes(label = ..count..),position=position_stack(vjust=0.5))+
  theme_bw()+
  theme(axis.text.x = element_text(angle = 90, hjust=0.95,vjust=0.2))+
  labs(x = "Distribution of software used")

Now the data is cleaned we can collect the counts and pool all packages that appear only once on our list as “Other”. This unfortunately includes also combinations such as “R, PAST”.

rare.software <- rownames(table(articles$Software.package)[which(table(articles$Software.package)<2)])
articles$Soft_selected<-articles$Software.package
articles$Soft_selected[which(articles$Soft_selected %in% rare.software)] <- "Other"
ggplot(data = articles, aes(x = reorder(Soft_selected, Soft_selected, function(x)-length(x)))) +
geom_bar(stat = "count", colour="black") + 
stat_count(geom = "text", colour = "gray", size = 3.5,
aes(label = ..count..),position=position_stack(vjust=0.5))+
  theme_bw()+
  theme(axis.text.x = element_text(angle = 90, hjust=0.95,vjust=0.2))+
  labs(x = "Software package")

#ggsave("Fig.5.eps", width = 170, height = 80, units = "mm")

Data handling

See explanations of the data handling scores in the article body and in the dataset descriptions.

ggplot(data = articles, aes(x = Data.handling.score)) +
geom_bar(stat = "count", fill="lightblue", colour="black") + 
stat_count(geom = "text", colour = "black", size = 3.5,
aes(label = ..count..),position=position_stack(vjust=0.5))+
  theme_bw()+
  labs(x = "Data handling score")

#ggsave("Fig.6.eps", width = 80, height = 80, units = "mm")

Now we can check whether data handling is improving over time:

score_by_year<-aggregate(x = articles$Data.handling.score, by = list(articles$Year.of.publication), FUN = median)  

ggplot(articles) + 
  geom_bar(aes(x = factor(Year.of.publication), fill = factor(Data.handling.score)), colour="black")+
  theme_bw()+
  scale_fill_brewer(palette="RdYlBu")+
  labs(x = "Year of publication")+
  theme(legend.position = c(0.2, 0.65))+
  theme(axis.text.x = element_text(angle = 90, hjust=0.95,vjust=0.2))

#ggsave("Fig.7.eps", width = 170, height = 80, units = "mm")

And how does it depend on the journal. We single out all the journals that appear only once and categorize them as “Other”.

ggplot(data = articles, aes(x = reorder(Journal_selected, Journal_selected, function(x)-length(x)))) +
  geom_bar(aes(fill = factor(Data.handling.score)))+
  theme_bw()+
  theme(axis.text.x = element_text(angle = 90, hjust=0.95,vjust=0.2))

ggplot(data = articles, aes(x = reorder(Journal.type, Journal.type, function(x)-length(x)))) +
  geom_bar(aes(fill = factor(Data.handling.score)), colour="black")+
  theme_bw()+
  scale_fill_brewer(palette="RdYlBu")+
  theme(legend.position="none")+
  labs(x = "Type of journal")+
  scale_x_discrete(labels = wrap_format(10))

#ggsave("Fig.8.eps", width = 112, height = 80, units = "mm")

Code availability

Now we can also check when code was available. If the software was code-based (“R”,“MatLab”,“SAS”, “SPSS”) but there was no mention of the code, it was coded as absent (FALSE).

#Changing wariables coded as Yes/No to logical:
articles$Code.avilable<-recode(articles$Code.avilable, "No" = "FALSE", "Yes" = "TRUE", .default = NA_character_)
articles$Code.avilable<-as.logical(articles$Code.avilable)
#We create a variable listing those packages where code would make sense to be published
code.relevant <- c("R","MatLab","SAS","PAST, R", "SPSS") 
articles$Code.avilable[which((articles$Software.package %in% code.relevant) & (is.na(articles$Code.avilable)==TRUE))]<-FALSE
ggplot(data = articles[which(articles$Software.package %in% code.relevant),], aes(x = Code.avilable, fill=Code.avilable)) +
geom_bar(stat = "count") + 
stat_count(geom = "text", colour = "black", size = 3.5,
aes(label = ..count..),position=position_stack(vjust=0.5))

That is a rather sad graph.

Test for normality

ggplot(data = articles[which(articles$Ordination.method=="PCA"),], aes(x = Test.for.normality, fill=Test.for.normality)) +
geom_bar(stat = "count") + 
stat_count(geom = "text", colour = "black", size = 3.5,
aes(label = ..count..),position=position_stack(vjust=0.5))