Estimation of multilevel co-occurrence networks from eDNA using taxonomic information and coupling

Author

Trevor D. Ruiz

Published

April 9, 2026

Note

This report was generated using AI under human direction to summarize results prepared and validated by the author. The contents of the report have been reviewed and edited for accuracy.

Introduction

Co-occurrence networks identify pairs of taxa whose relative abundances co-vary across samples in ways that are unlikely to arise by chance, after accounting for indirect associations. The edges of such a network represent partial correlations — associations between two taxa that hold even after conditioning on all others — and can suggest direct ecological interactions such as competition, predation, or mutualism.

This report documents network inference from an eDNA time series from Morro Bay, CA using a novel taxonomy-aware “coupling” method that borrows information across multiple taxonomic levels to improve network inference in the underpowered/noisy setting and detect otherwise latent potential ecological interactions. The report also includes some preliminary exploratory analysis downstream of network inference.

Data

Water samples were collected from the Cal Poly Pier by the Pasulka Lab monthly at three depths (1, 5, and 9 m) yielding 47 samples in total. Microbial communities were characterised by 18S rRNA amplicon sequencing and reads were processed with QIIME2. Resulting ASV taxonomic annotations were cleaned and partial annotations were removed by propagating the first missing level downward to all finer levels to prevent partially assigned taxa from appearing at spuriously specific ranks. For quality control checks: (1) the per-sample fraction of unassigned reads (no taxonomic annotations at any level) was below 9% for all samples; (2) two samples with more than 20% of reads mapping to Metazoa were flagged and excluded (AV341 and AV342); and (3) all ASVs assigned to Metazoa or Fungi were removed along with fully unassigned ASVs. After quality control, a total of 47 raw ASVs across 51 classes and 125 genera remained. Prior to network inference, ASVs were aggregated to genus and class levels, and genera present in fewer than 15% of samples were excluded to reduce noise from near-absent taxa. This left 40 genus-level and 18 class-level features available for modelling.

Table 1: Dataset dimensions after prevalence filtering. The n/p ratio at genus level is severely underpowered for standard graphical lasso, motivating the coupled approach.
Level Features Samples n/p ratio Possible edges
Class 18 47 2.61 153
Genus 40 47 1.18 780

The n/p ratio at the genus level is 1.18, well below the range where standard graphical lasso is reliable. This underpowered regime is the primary motivation for the novel methodology.

Methods

Graphical lasso

The graphical lasso fits a sparse precision matrix (the inverse of the covariance matrix) by penalising non-zero off-diagonal entries. Each non-zero entry in the estimated precision matrix corresponds to a direct pairwise association; the inferred network is simply a representation of the set of non-zero entries. Thus, network edges inferred via the graphical lasso represent partial correlations, i.e., the association between two taxa after removing the linear effects of all other taxa in the network. Since the method assumes data are Gaussian, these are sometimes called conditional dependence networks. Because amplicon sequencing data are compositional, raw counts are first transformed using a row-adaptive centred log-ratio (CLR) transformation before network inference. The row-adaptive variant uses only features present in a given sample when standardizing counts, which better handles the sparse, zero-inflated structure typical of eDNA data.

Coupling across taxonomic levels: borrowing information to detect weak signal

The central methodological challenge is that the genus-level n/p ratio of 1.18 gives a standard graphical lasso very little power to distinguish genuine from spurious associations. The genus-level network is likely to either miss real edges (false negatives) or produce a noisy mix of real and artefactual ones.

The coupled approach addresses this by jointly fitting genus-level and class-level networks and allowing each to inform the other. The intuition is straightforward: if two classes co-occur consistently across samples, genera that belong to those classes are more likely to also co-occur — and vice versa. Formally, a cross-level consistency penalty is added to the joint estimation objective. Genus-level signals that sit between classes with a stronger aggregate signal receive a reduced penalty, making them easier to select, and genus-level signals that cross a class boundary with weaker class-level signal receive an increased penalty, suppressing associations that lack class-level support. The two levels are fitted simultaneously, so the class-level network is itself shaped by accumulated evidence from the genus level and vice-versa. This information borrowing allows the coupled model to detect associations at both levels that are below independently specified intra-level detection thresholds. Simulation has been used to confirm that the method does lead to improved recovery on synthetic datasets in which ground truth is known.

Following network estimation, edge diagnostics — scatter plots of CLR values for each inferred edge — were inspected visually to confirm that selected edges reflect genuine co-variation across samples. The majority of edges selected by the coupled approach passed that inspection. (The same was not true of uncoupled graphical lasso fits at each taxonomic level, which instead tended to select co-absent taxa.)

Network structure

The coupled network at the class level comprises 26 edges across 15 connected classes, while the genus-level network includes 22 edges. The combined network layout below shows both levels simultaneously. Solid blue edges are inferred edges (present in the stability-selected network); dashed orange borders indicate discrepancies between levels — class pairs that share at least one cross-class genus edge in the genus network but that did not themselves reach the selection threshold at the class level.

Figure 1: Coupled network at class (left heatmap) and genus (right panel) levels. In the heatmap, fill intensity reflects the number of supporting genus-level cross-class edges; black borders mark direct class-level edges; dashed orange borders mark implied (but not estimated) class-level edges. The genus network (right) shows inferred genus-level partial correlations.

Owing to the coupling, there is very little upward-discrepancy (dashed orange edges) – only two genus-level cross-class connections are not reaffirmed at the class level. However, the genus-level network is notably sparse and most genus-level connections involve Dinophyceae. Interestingly, as a result there is substantial downward discrepancy – the heatmap shows many inferred class-level interactions that are not resolved at the genus level. However, the class-level edges passed diagnostic validation and do show real co-occurrence; the most likely explanation for the discrepancy is simply that the genus-level data are too noisy to resolve aggregate interactions.

Influential taxa

Degree centrality — the number of edges incident to a node — is a summary statistic that identifies the most connected taxa in a network. At the class level, Mediophyceae and Filosa-Imbricatea are the most connected; at the genus level, Akashiwo and Tripos (both dinoflagellates) and Prorocentrum (a harmful algal bloom genus) are the most connected.

Table 2: Degree centrality at the class level (coupled fit). Only connected nodes shown.
Class Degree
Mediophyceae 7
Filosa-Imbricatea 6
Bacillariophyceae 5
Bicoecea 5
Dinophyceae 5
Filosa-Thecofilosea 5
Opalozoa 4
Spirotrichea 3
Syndiniales 3
Chrysophyceae 2
Gregarinomorphea 2
Oligohymenophorea 2
Bolidophyceae 1
Pelagophyceae 1
Prostomatea 1
Table 3: Top 10 genus-level nodes by degree centrality (coupled fit).
Genus Degree
Akashiwo 8
Tripos 7
Prorocentrum 5
Thalassiosira 4
Cryothecomonas 3
Minidiscus 3
Ansanella 2
Chaetoceros 2
Ebria 2
Lepidodinium 2

Taxonomic interactions

The five strongest class-level edges, ranked by absolute partial correlation, span a mix of positive (co-occurrence) and negative (exclusion) associations.

Table 4: Top class-level edges by absolute partial correlation. Negative values indicate exclusion; positive values indicate co-occurrence.
Edge ID Class A Class B Partial correlation
1 Filosa-Imbricatea Spirotrichea -0.382
2 Bicoecea Dinophyceae 0.338
3 Bicoecea Mediophyceae -0.330
4 Chrysophyceae Syndiniales 0.279
5 Dinophyceae Filosa-Imbricatea -0.276

The strongest estimated interaction is a negative partial correlation between Filosa-Imbricatea and Spirotrichea, suggesting competitive exclusion or top-down control between two sub-groups. The second-strongest estimated interaction is a positive association between Bicoecea and Dinophyceae (dinoflagellates), suggesting co-occurrence.

To understand which genera drive each class-level edge, we decompose each class node into its genus-level constituents and quantify each genus’ contribution to the aggregate signal via the covariance between the genus signal and the aggregate (class) signal, reported as fraction of the class-level variance. Values of this metric near 1 indicate the genus closely tracks the class node; values near 0 indicate the genus adds little to the class signal; negative values indicate the genus co-varies inversely with the rest of the class.

The top five contributing genera per node are shown below for each of the five strongest class-level edges.

Table 5

Edge 1: Filosa-Imbricatea — Spirotrichea (partial correlation = -0.382; negative association)

Class Genus ASVs Variance proportion
Filosa-Imbricatea Thaumatomonas 4 0.348
Filosa-Imbricatea Paulinella 5 0.198
Filosa-Imbricatea Peregrinia 2 0.027
Spirotrichea Pelagostrobilidium 6 1.044
Spirotrichea Eutintinnus 5 0.387
Spirotrichea Rimostrombidium 3 0.359
Spirotrichea Strombidinopsis 7 0.329
Spirotrichea Tintinnopsis_05 2 0.212

Edge 2: Bicoecea — Dinophyceae (partial correlation = 0.338; positive association)

Class Genus ASVs Variance proportion
Bicoecea Bicosoeca 5 0.137
Bicoecea Caecitellus 2 0.014
Dinophyceae Gyrodinium 120 0.859
Dinophyceae Tripos 83 0.818
Dinophyceae Dinophysis 7 0.812
Dinophyceae Prorocentrum 25 0.701
Dinophyceae Ansanella 5 0.626

Edge 3: Bicoecea — Mediophyceae (partial correlation = -0.330; negative association)

Class Genus ASVs Variance proportion
Bicoecea Bicosoeca 5 0.137
Bicoecea Caecitellus 2 0.014
Mediophyceae Thalassiosira 34 1.015
Mediophyceae Chaetoceros 46 0.808
Mediophyceae Minidiscus 11 0.745
Mediophyceae Skeletonema 3 0.184
Mediophyceae Leptocylindrus 1 0.065

Edge 4: Chrysophyceae — Syndiniales (partial correlation = 0.279; positive association)

Class Genus ASVs Variance proportion
Chrysophyceae Paraphysomonas 16 0.583
Syndiniales Euduboscquella 9 0.433
Syndiniales Syndinium 6 0.431
Syndiniales Hematodinium 5 0.264

Edge 5: Dinophyceae — Filosa-Imbricatea (partial correlation = -0.276; negative association)

Class Genus ASVs Variance proportion
Dinophyceae Gyrodinium 120 0.859
Dinophyceae Tripos 83 0.818
Dinophyceae Dinophysis 7 0.812
Dinophyceae Prorocentrum 25 0.701
Dinophyceae Ansanella 5 0.626
Filosa-Imbricatea Thaumatomonas 4 0.348
Filosa-Imbricatea Paulinella 5 0.198
Filosa-Imbricatea Peregrinia 2 0.027

Several patterns stand out. Within Spirotrichea, Pelagostrobilidium and Eutintinnus dominate the class signal, with variance proportions indicating they closely track the class-level abundance. Dinophyceae signal is spread more broadly, with Gyrodinium, Tripos, Dinophysis, and Prorocentrum all contributing substantially — notably, several of these are known harmful algal bloom (HAB) genera. Mediophyceae is driven primarily by Thalassiosira and Chaetoceros, both abundant diatom genera in coastal systems.

Network coherence

A natural question is whether the inferred network structure is actually expressed in any given sample. A sample where all co-occurring pairs are positively associated and all exclusion pairs are negatively expressed is behaving “as expected” by the network — it is coherent with the inferred structure. A sample where associations are scrambled is not.

We measure this with a coherence t-statistic computed per sample. For each sample, we form the vector of sign-weighted edge products: for each network edge (A, B), we compute sign(ρAB) × CLR(A) × CLR(B), where ρAB is the partial correlation. This sign-weighting aligns the product with the expected direction — positive contributions come from edges where co-occurring taxa are indeed co-occurring, and from exclusion pairs that are indeed negatively expressed. The coherence t-statistic is:

\[t = \sqrt{n_\text{edges}} \cdot \frac{\bar{x}}{s_x}\]

where \(\bar{x}\) and \(s_x\) are the mean and standard deviation of the edge product vector across edges. Under random expression with respect to the network, \(t \approx N(0, 1)\) for \(n_\text{edges} \geq 20\). Values of \(|t| > 2\) indicate that the community is expressing the network structure significantly beyond chance. Importantly, the statistic is computed separately for the class-level network, the genus-level network, and both combined, allowing comparison across levels.

Figure 2: Per-sample network coherence t-statistic over time, faceted by depth. The dashed line marks the approximate N(0,1) threshold of |t| = 2. Values above this line indicate that the community is expressing the network significantly beyond chance.

Coherence is generally positive across samples and time points, indicating that the inferred network is meaningfully expressed in the data rather than being an artefact of the estimation procedure. The class-level and total coherence tracks are broadly similar, while genus-level coherence is noisier — expected given the sparser genus-level network and the greater compositional variability at finer taxonomic resolution.

In general, the coherence t-statistic provides a convenient sample-level summary of network activation that facilitates downstream analyses of community dynamics across environmental conditions without requiring a full multivariate accounting of the compositional data or disaggregated interactions along specific edges.

Potential downstream analyses

CautionPreliminary analysis

The analyses in this section are exploratory and should be treated as hypothesis-generating. With 47 samples and multiple simultaneous comparisons, effect sizes are uncertain and individual p-values should be interpreted with appropriate caution.

Environmental drivers

We can ask whether variation in network coherence tracks measurable environmental conditions, i.e., whether the inferred community is “activated” by particular conditions and thus whether the network captures ecologically meaningful structure.

Spearman correlations were computed between each coherence measure and four continuous environmental variables: depth, temperature, salinity, and chlorophyll. Conditioning on the estimated network, per-sample t-statistics are functions only of that sample’s CLR values, so standard hypothesis tests are applicable (with the large caveat that the unconditional dependence is very difficult to quantify so a high level of caution should be used when interpreting p-values).

Table 6: Spearman correlations between network coherence t-statistics and environmental drivers. Bonferroni threshold for 12 simultaneous tests: p < 0.004.
Network Driver ρ p-value
class depth -0.128 0.390
genus depth 0.081 0.587
total depth -0.056 0.709
class temp 0.192 0.202
genus temp 0.034 0.820
total temp 0.191 0.204
class salinity 0.117 0.437
genus salinity -0.235 0.116
total salinity -0.043 0.778
class chlorophyll 0.364 0.012
genus chlorophyll 0.279 0.058
total chlorophyll 0.450 0.002

Chlorophyll is the only variable with a notable and statistically credible association with network coherence. Depth, temperature, and salinity show no meaningful signal across network levels.

The chlorophyll–coherence association is not uniform across the water column. Stratifying by depth reveals that the relationship is essentially absent at the surface (1 m) and moderate-to-strong at 5 m and 9 m — though individual stratum sample sizes (n ≈ 15–17) limit the precision of these estimates.

Table 7: Spearman correlations between chlorophyll and network coherence, stratified by depth. Low power at n ≈ 15–17 per depth.
Depth (m) Class Genus Total n
1 rho=0.07, p=0.801 rho=0.21, p=0.417 rho=0.23, p=0.374 17
5 rho=0.42, p=0.121 rho=0.45, p=0.092 rho=0.50, p=0.058 15
9 rho=0.43, p=0.108 rho=0.20, p=0.467 rho=0.56, p=0.031 15
Figure 3: Network coherence (total) vs chlorophyll, coloured by depth. Linear trend lines are fitted per depth. The surface layer (1 m) shows a markedly shallower slope, driven in part by two high-chlorophyll samples with unexpectedly low coherence.

One interpretation of the surface anomaly is that near-surface communities are subject to greater physical variability (mixing) that decouples compositional patterns from chlorophyll.

Seasonal variation

Table 8: Kruskal-Wallis test for differences in coherence across astronomical seasons.
Network KW statistic p-value
class 9.981 0.019
genus 7.505 0.057
total 13.052 0.005
Table 9: Median coherence t-statistic by season. Summer and fall show the highest coherence at all network levels.
Season Class Genus Total N
spring 1.43 1.04 1.53 11
summer 2.34 2.87 3.70 11
fall 1.75 3.31 3.60 13
winter 1.05 2.33 2.61 12

Total and class-level coherence differ significantly across seasons (Kruskal-Wallis, p = 0.005 for total coherence). Summer and fall show the highest median coherence, consistent with bloom conditions; winter is the lowest. Season explains only approximately 22% of the variance in chlorophyll, which suggests that the two variables share signal but are far from redundant.

The results above, as mentioned at the top of the document, are exploratory in nature and intended to illustrate potential applications of the inferred networks for identifying potential ecological structure in the data.