The development of methods to detect how a species’ occurrence probability covaries with the occurrences of other species (e.g. co-occurrence) is a booming and somewhat divisive area of statistical ecology. Many authors (including, sadly, myself) have jumped the gun when interpreting co-occurrence patterns by grandiosely stating that they reflect ‘biotic interactions’. However, as numerous papers have recently pointed out (including Harris 2015 and Dormann et al 2018), non-random co-occurrences could be an important precursor for direct biotic interactions, such as competition or facilitation, but they may also merely be proxies for untested environmental variation. In other words, these methods could tell us a lot about how species’ fundamental niches are similar or dissimilar, but they will not necessarily reflect direct biotic interactions between pairs of species.
This possible discrepancy between evidence of biotic interactions and co-occurrence patterns was prominently pointed out in a recent study published in Ecology by Barner et al. The authors used binary presence-absence data (the primary data structure needed for co-occurrence analyses), calculated co-occurrence probabilities using a broad range of different methods and then compared these to interactions that were identified empirically (based on a literature search). Their results were worrying: none of the nine tested methods performed very well when identifying interactions. This finding has big implications in community ecology, and was hailed by Timothée Poisot as ‘One of the most important papers for community ecology in recent memory’.
Although the authors should be commended for their work (they made their datasets freely available here), I can’t help wondering whether methods that adequately account for indirect associations would perform slightly better. Detecting co-occurrences has primarily been restricted to methods incorporating unrealistic null model randomisations or simple two-species correlations, which have been shown time and again to yield poor results on multivariate community datasets (see David Harris’ nice explanation here). It is hard to know exactly if any of the methods used by Barner et al adequately account for indirect associations, as they did not post their actual step-by-step analysis code.
Here, I will compare some of the more popular co-occurrence methods against likelihood-based methods that do account for indirect associations using the datasets provided by Barner et al. The aim of this is to showcase how different co-occurrence analyses work, not to contradict or question the results put forward by the authors. The workhorse function to fit these analyses is shown in full at the end of this post.
First, we load all of the freely available datasets. I have downloaded these and stored them in a Intertidal/Raw_data directory
# Environmental data
env.data <- read.csv("Intertidal/Raw_data/environmental_variables.csv",
as.is = T)
# Community data (5x5m)
comm.data.5.5 <- read.csv("Intertidal/Raw_data/community_survey_5x5.csv",
as.is = T)
# Community data (10x10m)
comm.data.10.10 <- read.csv("Intertidal/Raw_data/community_survey_10x10.csv",
as.is = T)
# Community data (15x15m)
comm.data.15.15 <- read.csv("Intertidal/Raw_data/community_survey_15x15.csv",
as.is = T)
# Community data (20x20m)
comm.data.20.20 <- read.csv("Intertidal/Raw_data/community_survey_20x20.csv",
as.is = T)
# Community data (25x25m)
comm.data.25.25 <- read.csv("Intertidal/Raw_data/community_survey_25x25.csv",
as.is = T)
# Taxon names and 'true' interactions
taxon.names <- read.csv("Intertidal/Raw_data/taxon_names_key.csv")
true.ints <- read.csv("Intertidal/Raw_data/experimental_interactions.csv",
as.is = T)
We should check out the dimesions of each of these datasets
dim(comm.data.5.5)
## [1] 3325 197
dim(comm.data.10.10)
## [1] 2128 197
dim(comm.data.15.15)
## [1] 1197 197
dim(comm.data.20.20)
## [1] 532 197
dim(comm.data.25.25)
## [1] 133 196
As you can see above, the 25 x 25m plot dataset only has 133 observations, which is not much for detecting co-occurrences. This is especially true when we consider the prevalences of species in each dataset. Here we visualise the prevelance histogram for the largest (5 x 5m) and smallest (25 x 25m) datasets. Note that site-level information columns are removed here
hist(colSums(comm.data.5.5[, 3:ncol(comm.data.5.5)])/nrow(comm.data.5.5[,
3:ncol(comm.data.5.5)]), main = "", xlab = "",
breaks = seq(0, 1, 0.025))
hist(colSums(comm.data.25.25[, 2:ncol(comm.data.25.25)])/nrow(comm.data.25.25[,
2:ncol(comm.data.25.25)]), main = "",
xlab = "", breaks = seq(0, 1, 0.025))
Even when aggregating occurrences at the larger plot level, most species have very low frequencies of occurrence. This should be a warning flag, especially for interpreting co-occurrences from a dataset with only 133 observations. But the authors should not be judged harshly on this point, as they are using exactly the type of data that many, many (many) papers are using to discover biotic ‘interactions’!!
Now we are ready for some analysis. A very key point of difference here is that I’m only including species that have empirical interaction data in my analyses. Barner et al used the full suite of species (including separate columns for adults and juveniles), which is considerably larger. But I’m hoping to make examples that are of reasonable size and computational complexity to facilitate their use by people that are learning about the methods. Here, I create a vector taxa.to.test that includes these key species, which is used by the function co_occur_test to filter out the datasets
taxa.to.test <- unique(c(true.ints$sp_i_name,
true.ints$sp_j_name))
Load the function (again, this is shown in full below) and run the analyses. This function will five different co-occurrence analyses. First, we test Martin Westgate’s Odds-Ratio analysis, which calculates the odds that species 1 will be present given the presence-absence of species 2 (e.g., a pairwise test only). Next, we use the popular null model randomisation test to count frequencies of co-occurrences for each pair and compare those with ‘expected’ co-occurrences from shuffled null matrices. I’ve followed code provided by Barner et al, except that I’ve eliminated some nested for-loops to speed up the analysis. The third test is a Pearson correlation analysis, keeping only those correlations that are found to be ‘significant’. After fitting these three methods that all ignore indirect associations, I test a few that do account for these processes. The first is an approximate Markov Random Fields graph, using functions in my own package MRFcov, which is described in detail in Clark et al 2018. The next is a regularized Partial Correlations analysis where we take the covariance matrix as Barner et al did, but we penalize this matrix using ridge regression to regularize unsupported partial correlations to zero.
I perform two tests overall, first by using the raw data that is filtered to only include key species. Because many of these species have very low prevalence in all datasets, MRFcov may reduce to fitting intercept-only models (this is because the package uses cross-validation to optimise fits, which is difficult or impossible to do when species are very rare). For the first test, I use all species, but I remove species occurring in less than 3 percent of observations for the second test. For each test, models are performed and the proportion of ‘true’ interactions that each analysis identifies is returned. Note, these analyses can take some time, depending on your machine. I’ve elected to run them using 10 processing cores on a supercomputer, which took ~ 10 minutes to complete
source("Intertidal/co_occur_test.R")
# All species included
co.occur.5.5 <- co_occur_test(comm.data = comm.data.5.5,
n.cores = 10)
co.occur.10.10 <- co_occur_test(comm.data = comm.data.10.10,
n.cores = 10)
co.occur.15.15 <- co_occur_test(comm.data = comm.data.15.15,
n.cores = 10)
co.occur.20.20 <- co_occur_test(comm.data = comm.data.20.20,
n.cores = 10)
co.occur.25.25 <- co_occur_test(comm.data = comm.data.25.25,
n.cores = 10)
# Removing rare species
co.occur.5.5.filtered <- co_occur_test(comm.data = comm.data.5.5,
prevalence.cutoff = 0.03, n.cores = 10)
co.occur.10.10.filtered <- co_occur_test(comm.data = comm.data.10.10,
prevalence.cutoff = 0.03, n.cores = 10)
co.occur.15.15.filtered <- co_occur_test(comm.data = comm.data.15.15,
prevalence.cutoff = 0.03, n.cores = 10)
co.occur.20.20.filtered <- co_occur_test(comm.data = comm.data.20.20,
prevalence.cutoff = 0.03, n.cores = 10)
co.occur.25.25.filtered <- co_occur_test(comm.data = comm.data.25.25,
prevalence.cutoff = 0.03, n.cores = 10)
Once run, I gather data on the true predictions for each test for plotting
plot.size <- c(rep(5, 5), rep(10, 5), rep(15,
5), rep(20, 5), rep(25, 5))
model <- rep(c("OR", "Null Randomisation",
"Pearson Correlation", "MRF", "Partial Correlation"),
5)
metrics <- c(unlist(purrr::map(co.occur.5.5,
"prop.correct")), unlist(purrr::map(co.occur.10.10,
"prop.correct")), unlist(purrr::map(co.occur.15.15,
"prop.correct")), unlist(purrr::map(co.occur.20.20,
"prop.correct")), unlist(purrr::map(co.occur.25.25,
"prop.correct")))
plot.data <- data.frame(plot.size = plot.size,
model = model, prop.correct = metrics *
100)
Repeat for the filtered dataset tests
plot.size <- c(rep(5, 5), rep(10, 5), rep(15,
5), rep(20, 5), rep(25, 5))
model <- rep(c("OR", "Null Randomisation",
"Pearson Correlation", "MRF", "Partial Correlation"),
5)
metrics <- c(unlist(purrr::map(co.occur.5.5.filtered,
"prop.correct")), unlist(purrr::map(co.occur.10.10.filtered,
"prop.correct")), unlist(purrr::map(co.occur.15.15.filtered,
"prop.correct")), unlist(purrr::map(co.occur.20.20.filtered,
"prop.correct")), unlist(purrr::map(co.occur.25.25.filtered,
"prop.correct")))
plot.data.filtered <- data.frame(plot.size = plot.size,
model = model, prop.correct = metrics *
100)
Now let’s plot the results. First, the analysis that used the filtered dataset. I also add labels for each plot size to show the number of interactions that were compared (predicted vs truth) for each model
library(ggplot2)
n.interactions.filtered <- c(co.occur.5.5.filtered$num.interactions.tested,
co.occur.10.10.filtered$num.interactions.tested,
co.occur.15.15.filtered$num.interactions.tested,
co.occur.20.20.filtered$num.interactions.tested,
co.occur.25.25.filtered$num.interactions.tested)
ggplot(data = plot.data.filtered, aes(x = plot.size,
y = prop.correct, colour = model)) +
theme_bw() + geom_point() + geom_line() +
annotate("text", x = unique(plot.data.filtered$plot.size),
y = min(plot.data.filtered$prop.correct) -
0.025, label = n.interactions.filtered,
size = 3) + labs(y = "Correctly predicted interactions\n (%)",
x = "Plot size\n (meters)", title = "Including only species with >3% prevalence")
The analyses that account for indirect associations (MRF and Partial Correlations) generally perform better across the board until we get to the 25 x 25m plot level. This is not surprising to me, as MRFcov is a bit data-hungry (it uses cross-validation to optimise the graph and needs more observations when the number of species is this large). MRFcov also uses LASSO penalty, as opposed to the ridge penalty used by the Partial Correlations. LASSO is more conservative generally, which makes for a sparser graph. Interestingly, Pearson correlations do well when the number of observations is low (larger plots). Of course, we should not ignore that the number of tested interactions is fairly small for the smaller plot levels, as we have removed all species that have less than 3 percent prevalence (this removes most species from the 5 x 5m data). Nevertheless, it is encouraging that over 40 percent of ‘true’ interactions can be isolated using nothing but presence-absence data
Repeat for the larger datasets that were not filtered to remove rare species. Note, all of these datasets used the same candidate species, and so the number of tested interactions was identical (189) for each test
ggplot(data = plot.data, aes(x = plot.size,
y = prop.correct, colour = model)) +
theme_bw() + geom_point() + geom_line() +
labs(y = "Correctly predicted interactions\n (%)",
x = "Plot size\n (meters)", title = "Including all species",
subtitle = "189 interactions tested in each plot")
Here we realize the full power of regularized multivariate algorithms. The regularized Partial Correlation is by far the best-performing model across the board, achieving over 50 percent prediction accuracy at the 5 x 5m plot size. The MRF also fits well, until we reach the larger plot sizes. Null randomisations, arguably the most commonly-used co-occurrence metrics, are the worst performing model. This evidence indicates that indirect associations should be accounted for wherever possible in co-occurrence analysis. As the number of species goes up, likelihood-based approaches such as this become more useful for isolating associations from sparse multivariate data.
This function also stores information on model precision, specificity and recall, which are worth exploring further to highlight systematic biases from each method. Here I plot the specificity of each model (e.g. the proportion of observed negatives that were predicted to be negative), which is a very useful indicator since Barner et al found that most methods fail to accurately capture negative associations
specificity <- c(unlist(purrr::map(co.occur.5.5.filtered,
"specificity")), unlist(purrr::map(co.occur.10.10.filtered,
"specificity")), unlist(purrr::map(co.occur.15.15.filtered,
"specificity")), unlist(purrr::map(co.occur.20.20.filtered,
"specificity")), unlist(purrr::map(co.occur.25.25.filtered,
"specificity")))
plot.size <- c(rep(5, 5), rep(10, 5), rep(15,
5), rep(20, 5), rep(25, 5))
model <- rep(c("OR", "Null Randomisation",
"Pearson Correlation", "MRF", "Partial Correlation"),
5)
plot.data <- data.frame(plot.size = plot.size,
model = model, specificity = specificity)
ggplot(data = plot.data, aes(x = plot.size,
y = specificity, colour = model)) + theme_bw() +
geom_point() + geom_line() + labs(y = "Model specificity",
x = "Plot size\n (meters)", title = "Including only species with >3% prevalence")
Is interesting here that the Odds-Ratio test is generally good for identifying negative associations on these smaller datasets that have more frequently-occurring species. What about for the full dataset?
specificity <- c(unlist(purrr::map(co.occur.5.5,
"specificity")), unlist(purrr::map(co.occur.10.10,
"specificity")), unlist(purrr::map(co.occur.15.15,
"specificity")), unlist(purrr::map(co.occur.20.20,
"specificity")), unlist(purrr::map(co.occur.25.25,
"specificity")))
plot.size <- c(rep(5, 5), rep(10, 5), rep(15,
5), rep(20, 5), rep(25, 5))
model <- rep(c("OR", "Null Randomisation",
"Pearson Correlation", "MRF", "Partial Correlation"),
5)
plot.data <- data.frame(plot.size = plot.size,
model = model, specificity = specificity)
ggplot(data = plot.data, aes(x = plot.size,
y = specificity, colour = model)) + theme_bw() +
geom_point() + geom_line() + labs(y = "Model specificity",
x = "Plot size\n (meters)")
Here, the pairwise Odds-Ratio method again does very well at identifying negative associations, while the Partial and Pearson Correlation methods also do fairly well. More exploration can of course be done by investigating precision, recall, positive predictive values etc….
Hopefully the tone of this post shows that I very much agree with Barner et al and Timothée Poisot that assuming co-occurrences always reflect biotic interactions is a bad idea that limits our understanding of community ecology. Non-random co-occurrences may be an important precursor for direct biotic interactions, such as competition or facilitation. Co-occurrence patterns can also point to major mechanisms influencing aspects of community assembly, such as environmental filters that only accommodate species with particular traits or systematic drivers of local abundance variation. But at their most basic level, co-occurrence networks should be thought of as highlighting species that have similar local niches. This can help improve our understanding of which species may respond most strongly to habitat change, for example, but not necessarily define which species are competing directly. As a take-home from this post, users should attempt to account for indirect associations when assessing co-occurrences, either with MRFs (if enough data is available) or regularized Partial Correlations. A huge thanks goes out to Allison Barner and her team for making this dataset open-source. This is how science should always be carried out in my opinion.
References
Barner, A.K., Coblentz, K.E., Hacker, S.D. & Menge, B.A. (2018) Fundamental contradictions among observational and experimental estimates of non‐trophic species interactions. Ecology, doi: https://doi.org/10.1002/ecy.2133.
Clark, N.J., Wells, K. & Lindberg, O. (2018) Unravelling changing interspecific interactions across environmental gradients using Markov random fields. Ecology, doi: 10.1002/ecy.2221.
Harris, D.J. (2015) Generating realistic assemblages with a joint species distribution model. Methods in Ecology and Evolution, 6, 465-473.
Harris, D.J. (2016) Inferring species interactions from co‐occurrence data with Markov networks. Ecology, 97, 3308-3314.
Dormann, C.F., Bobrowski, M., Dehling, D.M., Harris, D.J., Hartig, F., Lischke, H., Moretti, M.D., Pagel, J., Pinkert, S., Schleuning, M., Schmidt, S.I., Sheppard, C.S., Steinbauer, M.J., Zeuss, D. & Kraan, C. (2018) Biotic interactions in species distribution modelling: 10 questions to guide interpretation and avoid false conclusions. Global Ecology and Biogeography, doi: https://doi.org/10.1111/geb.12759.
About me
My name is Nicholas (Nick) Clark, and I am a disease ecologist exploring new ways to (1) understand how natural communities are formed and (2) predict how they will change over time. I completed my PhD with Dr Sonya Clegg and the Griffith University Wildlife Disease Ecology Group. This work blended ideas from disease ecology, evolution and biogeography to yield new insights into how avian malaria and microfilaria parasites disperse and evolve. Currently a Postdoctoral Fellow in the School of Veterinary Science at the University of Queensland (UQ), I develop computational phylogenetic tools and adapt techniques from statistical network theory to study how parasites interact with their hosts and how these interactions change across urbanisation gradients. This work covers a broad range of host-pathogen systems. Please have a look at my website to learn more about my work. http://nicholasjclark.weebly.com/
The full function used for fitting models
#### Begin co_occur_test function ####
co_occur_test = function(comm.data,
prevalence.cutoff = 0,
n.cores = parallel::detectCores() - 1){
# Remove site-level columns to only keep species pres-abs columns
site.level.cols <- which(grepl('quadrat',colnames(comm.data)))
comm.data.all.spec <- comm.data[,-site.level.cols]
#### Keep only species represented in the truth data, this is to
# ensure all names match properly and facilitate comparisons ####
cat('Matching species names in comm.data to truth dataset...\n')
to.keep <- lapply(seq_len(ncol(comm.data.all.spec)), function(x){
keep <- colnames(comm.data.all.spec)[x] %in% taxon.names$survey_name
if(keep){
exp.name <- as.character(taxon.names$experimental_name[which(taxon.names$survey_name %in%
colnames(comm.data.all.spec)[x])])
} else {
exp.name <- 'NA'
}
if(is.na(exp.name)){
keep <- FALSE
} else {
if(exp.name %in% taxa.to.test){
keep <- TRUE
} else {
keep <- FALSE
}
}
list(keep = keep, exp.name = exp.name)
})
keepers <- data.frame(Names = unlist(purrr::map(to.keep,
'exp.name')),
Keep = unlist(purrr::map(to.keep,'keep')))
comm.data.all.spec <- comm.data.all.spec[, unlist(purrr::map(to.keep,
'keep'))]
colnames(comm.data.all.spec) <- keepers$Names[which(keepers$Keep)]
#### Removing duplicates and ultra-rare species ####
# We can't have duplicate names for MRFs, so need to aggregate these by summing
cat('Aggregating duplicate species occurrences...\n')
true.names <- unique(as.character(keepers$Names[which(keepers$Keep)]))
agg.comm.list <- lapply(seq_len(length(true.names)), function(x){
cols.match <- which(grepl(true.names[x],
colnames(comm.data.all.spec)) == T)
if(length(cols.match) > 1){
outcome.vector <- rowSums(comm.data.all.spec[,cols.match])
} else {
outcome.vector <- as.vector(comm.data.all.spec[,cols.match])
}
outcome.vector[outcome.vector > 1] <- 1
list(outcome.vector = outcome.vector, name = true.names[x])
})
agg.comm.data <- data.frame(do.call(cbind, purrr::map(agg.comm.list,
'outcome.vector')))
colnames(agg.comm.data) <- unlist(purrr::map(agg.comm.list, 'name'))
# Some species may be too rare to include for meaningful inference
# Here, we remove species that occurred in less than the specified cutoff
if(prevalence.cutoff == 0){
agg.comm.data <- agg.comm.data
} else {
cat('Removing species whose prevalences are below',
prevalence.cutoff * 100, '%...\n')
too.rare <- which((colSums(agg.comm.data) / nrow(agg.comm.data)) < prevalence.cutoff)
agg.comm.data <- agg.comm.data[, -too.rare]
}
# How many species in total do we have in the test data?
n.species <- ncol(agg.comm.data)
# Store a distribution of prevalences for later exploration
species.prevs <- colSums(agg.comm.data) / nrow(agg.comm.data)
#### Joining occurrence data to covariates ####
# Add quadrat id for joining to environmental variables
agg.comm.data$quadrat_id <- comm.data$quadrat_id
library(dplyr)
joined.data = agg.comm.data %>%
dplyr::left_join(data.frame(env.data %>%
dplyr::select(quadrat_id,
distanceto100m,
rugosity_1,
sand_presence,
airtemp_mean,
watertemp_mean)),
by = "quadrat_id") %>%
dplyr::select(-quadrat_id)
#### The first three methods we test are the Odds-Ratio, Null Randomisation
# and full correlation methods ####
# 1. Martin Westgate's Odds-Ratio example
cat('Fitting the OR pairwise test...\n')
library(sppairs)
or.dat <- sppairs::spaa(joined.data[,1:n.species], method = "or.glm")
# Join the resulting matrix to the 'true.ints' truth dataset
# This is done twice to ensure we capture all of the pairwise comparisons
or.dat %>%
dplyr::mutate(interaction.sign = dplyr::case_when(
# Lane et al use cutoffs of 3 and 1/3 for identifying 'interactions'
value > 3 ~ 1, value < 0.33 ~ -1, TRUE ~ 0)) %>%
dplyr::select(-value) %>%
purrr::set_names(c('sp_i_name','sp_j_name','interaction_sign_MRF')) %>%
dplyr::inner_join(true.ints, by = c("sp_i_name", "sp_j_name")) -> match.1
# Orders of names might not match the truth, try switching them and joining again
or.dat %>%
dplyr::mutate(interaction.sign = dplyr::case_when(
value > 3 ~ 1, value < 0.33 ~ -1, TRUE ~ 0)) %>%
dplyr::select(-value) %>%
purrr::set_names(c('sp_j_name','sp_i_name','interaction_sign_MRF')) %>%
dplyr::inner_join(true.ints, by = c("sp_j_name", "sp_i_name")) -> match.2
# Join the two matched datasets together and remove duplicates
data.frame(plyr::rbind.fill(match.1, match.2)) %>%
dplyr::filter(!is.na(interaction_sign)) %>%
dplyr::filter(!is.na(interaction_sign_MRF)) %>%
dplyr::select(sp_i_name, sp_j_name,interaction_sign_MRF,
interaction_sign) %>%
dplyr::distinct() -> or.test
# 2. Null matrix randomizations
# Function to count number of co-occurrences for each pair
cat('Fitting the CO-OCCUR test with 500 randomisations...\n')
count.cooccurs = function(data){
co.occurs <- crossprod(as.matrix(data))
co.occurs <- as.integer(co.occurs[upper.tri(co.occurs)])
}
# Count observed co-occurrences
obs.cos <- count.cooccurs(joined.data[,1:n.species])
# Compare with 500 null matrices
nul <- vegan::permatswap(joined.data[,1:n.species], fixedmar = "both",
mtype = "prab",
method = "swap", times = 500, burnin = 1000)
# Calculate number of expected co-occurrences, subtract observed for a differential
library(parallel)
cl <- makePSOCKcluster(n.cores)
setDefaultCluster(cl)
# Export necessary datasets
clusterExport(NULL, c('nul','obs.cos','n.species'), envir = environment())
# Export the count.cooccurs function
clusterExport(NULL, c('count.cooccurs'), envir = environment())
# Run in parallel, include a progress bar
exp.co.list <- pbapply::pblapply(seq_len(500), function(x){
comp <- count.cooccurs(data.frame(nul$perm[[x]])) - obs.cos},
cl = cl)
stopCluster(cl)
# Bind all differentials in a dataframe
null.compares <- data.frame(do.call(cbind, exp.co.list))
# Function to calculate 95% quantiles of differentials and
# return a 'significance' value for each comparison
sig.func = function(x){
quant <- quantile(x, probs = c(0.025, 0.975))
# comparison is 'significant' if 95% quantiles don't include zero
if(quant[1] < 0 & quant[2] < 0){
sig <- 1
} else if(quant[1] > 0 & quant[2] > 0){
sig <- -1
} else {
sig <- 0
}
}
# Calculate 'significant' pairwise co-occurrences
sigs <- apply(null.compares, 1, sig.func)
null.test <- matrix(NA, n.species, n.species)
null.test[upper.tri(null.test)] <- sigs
null.test[lower.tri(null.test)] <- t(null.test)[lower.tri(null.test)]
diag(null.test) <- 0
rownames(null.test) <- colnames(null.test) <- colnames(joined.data)[1:n.species]
# 3. Fit full correlations on bootstrapped datasets (shuffled rows)
# Function to calculate Pearson correlation and set non-significant ones to zero
cat('Fitting the Pearson Correlations test...\n')
cor.prob <- function(X, dfr = nrow(X) - 2) {
# Take the Pearson correlation of all pairwise combinations
R <- cor(X)
above <- row(R) < col(R)
# Calculate significance level for ech
r2 <- R[above]^2
Fstat <- r2 * dfr / (1 - r2)
sigs.mat <- R
sigs.mat[above] <- 1 - pf(Fstat, 1, dfr)
sigs.mat[lower.tri(sigs.mat)] <- t(sigs.mat)[lower.tri(sigs.mat)]
# Replace non-significant corrs with zero
R[which(sigs.mat < 0.05)] <- 0
R
}
corr_matrices <- lapply(seq_len(500), function(x){
# Shuffle rows of data for bootstrapping
shuffle_data = data.frame(joined.data[, 1:n.species]) %>%
dplyr::sample_n(., nrow(.), TRUE)
fit <- suppressWarnings(cor.prob(shuffle_data))
})
corr_matrix <- Reduce(`+`, corr_matrices) / length(corr_matrices)
diag(corr_matrix) <- 0
rownames(corr_matrix) <- colnames(corr_matrix) <- colnames(joined.data)[1:n.species]
corr_matrix[is.na(corr_matrix)] <- 0
#### The above three methods ignore indirect associations, where a pair
# of species are associated but this is masked by their shared associations
# with a third species. Now we fit two models that can capture these ####
# 4. Fit a Markov Random Fields model (no covariates) ####
library(MRFcov)
cat('Fitting a Markov Random Fields model...\n')
mrf <- try(MRFcov(data = joined.data[, 1:n.species],
family = 'binomial',
n_cores = n.cores,
n_nodes = n.species), silent = TRUE)
# MRFcov has a built-in test to error if fits cannot take place
if(class(mrf) == "try-error"){
mrf.passed <- FALSE
cat('Some species are too rare, Markov Random Fields model has failed...\n')
} else {
mrf.passed <- TRUE
}
# 5. Calculate a Regularized (ridge regression) Partial Correlations metric
# (code from David Harris)
cat('Fitting a regularized Partial Correlations model...\n')
par_corr_matrices <- lapply(seq_len(500), function(x){
shuffle_data = data.frame(joined.data[, 1:n.species]) %>%
dplyr::sample_n(., nrow(.), TRUE)
# regularization penalty increases sparsity, reduces spurious 'interactions'
par_fit <- -suppressWarnings(rags2ridges::ridgeS(cov(shuffle_data),
lambda = 1 / (5 * nrow(shuffle_data))))
})
par_corr_matrix <- Reduce(`+`,par_corr_matrices) / length(par_corr_matrices)
diag(par_corr_matrix) <- 0
rownames(par_corr_matrix) <- colnames(par_corr_matrix) <- colnames(joined.data)[1:n.species]
#### Assessing how did they all fit the truth data ####
# Function to calculate 'fit' metrics by comparing against 'truth'
mod.metrics = function(graph, reshape){
if(reshape){
reshape2::melt(graph) %>%
dplyr::mutate(interaction.sign = dplyr::case_when(
value > 0 ~ 1,
value < 0 ~ -1,
value == 0 ~ 0
)) %>%
dplyr::select(-value) %>%
purrr::set_names(c('sp_i_name','sp_j_name','interaction_sign_MRF')) %>%
dplyr::inner_join(true.ints, by = c("sp_i_name", "sp_j_name")) -> match.1
reshape2::melt(graph) %>%
dplyr::mutate(interaction.sign = dplyr::case_when(
value > 0 ~ 1,
value < 0 ~ -1,
value == 0 ~ 0
)) %>%
dplyr::select(-value) %>%
purrr::set_names(c('sp_j_name','sp_i_name','interaction_sign_MRF')) %>%
dplyr::inner_join(true.ints, by = c("sp_j_name", "sp_i_name")) -> match.2
data.frame(plyr::rbind.fill(match.1, match.2)) %>%
dplyr::filter(!is.na(interaction_sign)) %>%
dplyr::filter(!is.na(interaction_sign_MRF)) %>%
dplyr::select(sp_i_name, sp_j_name,interaction_sign_MRF,
interaction_sign) %>%
dplyr::distinct() -> compare.data
} else {
compare.data <- graph
}
#model precision = (true positives/ true positives + false positives)
true.pos <- length(which(compare.data$interaction_sign_MRF == 1 &
compare.data$interaction_sign == 1))
false.pos <- length(which(compare.data$interaction_sign_MRF == 1 &
compare.data$interaction_sign != 1))
precision <- true.pos / (true.pos + false.pos)
if(is.na(precision)){
precision <- 0
}
#model recall = (true positives/true positives + false negatives)
false.neg <- length(which(compare.data$interaction_sign_MRF == -1 &
compare.data$interaction_sign != -1))
recall <- true.pos / (true.pos + false.neg)
if(is.na(recall)){
recall <- 0
}
#model specificity = (true negatives/true negatives + false positives)
true.neg <- length(which(compare.data$interaction_sign_MRF == -1 &
compare.data$interaction_sign == -1))
specificity <- true.neg / (true.neg + false.pos)
#rates of false predictions
false.pos.rate <- false.pos / nrow(compare.data)
false.neg.rate <- false.neg / nrow(compare.data)
list(precision = precision,
recall = recall,
specificity = specificity,
prop.correct = length(which(compare.data$interaction_sign_MRF ==
compare.data$interaction_sign)) /
length(compare.data$interaction_sign))
}
cat('Calculating fit metrics for each test...\n')
# OR
or.fit <- suppressWarnings(mod.metrics(or.test, reshape = F))
# Null
null.fit <- suppressWarnings(mod.metrics(null.test, reshape = T))
# Corr fit
corrs.fit <- suppressWarnings(mod.metrics(corr_matrix, reshape = T))
# MRF
if(mrf.passed){
mrf.fit <- suppressWarnings(mod.metrics(mrf$graph, reshape = T))
}
# Partial Corrs
partial.corrs.fit <- suppressWarnings(mod.metrics(par_corr_matrix, reshape = T))
if(mrf.passed){
return(list(species.prevs = species.prevs,
num.interactions.tested = length(or.test$interaction_sign_MRF),
or.fit = or.fit,
null.fit = null.fit,
corrs.fit = corrs.fit,
mrf.fit = mrf.fit,
partial.corrs.fit = partial.corrs.fit))
} else {
return(list(species.prevs = species.prevs,
num.interactions.tested = length(or.test$interaction_sign_MRF),
or.fit = or.fit,
null.fit = null.fit,
corrs.fit = corrs.fit,
partial.corrs.fit = partial.corrs.fit))
}
# End co_occur_test function
}