| Level | Features | Samples | n/p ratio | Possible edges |
|---|---|---|---|---|
| Class | 18 | 47 | 2.61 | 153 |
| Genus | 40 | 47 | 1.18 | 780 |
Estimation of multilevel co-occurrence networks from eDNA using taxonomic information and coupling
This report was generated using AI under human direction to summarize results prepared and validated by the author. The contents of the report have been reviewed and edited for accuracy.
Introduction
Co-occurrence networks identify pairs of taxa whose relative abundances co-vary across samples in ways that are unlikely to arise by chance, after accounting for indirect associations. The edges of such a network represent partial correlations — associations between two taxa that hold even after conditioning on all others — and can suggest direct ecological interactions such as competition, predation, or mutualism.
This report documents network inference from an eDNA time series from Morro Bay, CA using a novel taxonomy-aware “coupling” method that borrows information across multiple taxonomic levels to improve network inference in the underpowered/noisy setting and detect otherwise latent potential ecological interactions. The report also includes some preliminary exploratory analysis downstream of network inference.
Data
Water samples were collected from the Cal Poly Pier by the Pasulka Lab monthly at three depths (1, 5, and 9 m) yielding 47 samples in total. Microbial communities were characterised by 18S rRNA amplicon sequencing and reads were processed with QIIME2. Resulting ASV taxonomic annotations were cleaned and partial annotations were removed by propagating the first missing level downward to all finer levels to prevent partially assigned taxa from appearing at spuriously specific ranks. For quality control checks: (1) the per-sample fraction of unassigned reads (no taxonomic annotations at any level) was below 9% for all samples; (2) two samples with more than 20% of reads mapping to Metazoa were flagged and excluded (AV341 and AV342); and (3) all ASVs assigned to Metazoa or Fungi were removed along with fully unassigned ASVs. After quality control, a total of 47 raw ASVs across 51 classes and 125 genera remained. Prior to network inference, ASVs were aggregated to genus and class levels, and genera present in fewer than 15% of samples were excluded to reduce noise from near-absent taxa. This left 40 genus-level and 18 class-level features available for modelling.
The n/p ratio at the genus level is 1.18, well below the range where standard graphical lasso is reliable. This underpowered regime is the primary motivation for the novel methodology.
Methods
Graphical lasso
The graphical lasso fits a sparse precision matrix (the inverse of the covariance matrix) by penalising non-zero off-diagonal entries. Each non-zero entry in the estimated precision matrix corresponds to a direct pairwise association; the inferred network is simply a representation of the set of non-zero entries. Thus, network edges inferred via the graphical lasso represent partial correlations, i.e., the association between two taxa after removing the linear effects of all other taxa in the network. Since the method assumes data are Gaussian, these are sometimes called conditional dependence networks. Because amplicon sequencing data are compositional, raw counts are first transformed using a row-adaptive centred log-ratio (CLR) transformation before network inference. The row-adaptive variant uses only features present in a given sample when standardizing counts, which better handles the sparse, zero-inflated structure typical of eDNA data.
Coupling across taxonomic levels: borrowing information to detect weak signal
The central methodological challenge is that the genus-level n/p ratio of 1.18 gives a standard graphical lasso very little power to distinguish genuine from spurious associations. The genus-level network is likely to either miss real edges (false negatives) or produce a noisy mix of real and artefactual ones.
The coupled approach addresses this by jointly fitting genus-level and class-level networks and allowing each to inform the other. The intuition is straightforward: if two classes co-occur consistently across samples, genera that belong to those classes are more likely to also co-occur — and vice versa. Formally, a cross-level consistency penalty is added to the joint estimation objective. Genus-level signals that sit between classes with a stronger aggregate signal receive a reduced penalty, making them easier to select, and genus-level signals that cross a class boundary with weaker class-level signal receive an increased penalty, suppressing associations that lack class-level support. The two levels are fitted simultaneously, so the class-level network is itself shaped by accumulated evidence from the genus level and vice-versa. This information borrowing allows the coupled model to detect associations at both levels that are below independently specified intra-level detection thresholds. Simulation has been used to confirm that the method does lead to improved recovery on synthetic datasets in which ground truth is known.
Following network estimation, edge diagnostics — scatter plots of CLR values for each inferred edge — were inspected visually to confirm that selected edges reflect genuine co-variation across samples. The majority of edges selected by the coupled approach passed that inspection. (The same was not true of uncoupled graphical lasso fits at each taxonomic level, which instead tended to select co-absent taxa.)
Network structure
The coupled network at the class level comprises 26 edges across 15 connected classes, while the genus-level network includes 22 edges. The combined network layout below shows both levels simultaneously. Solid blue edges are inferred edges (present in the stability-selected network); dashed orange borders indicate discrepancies between levels — class pairs that share at least one cross-class genus edge in the genus network but that did not themselves reach the selection threshold at the class level.
Owing to the coupling, there is very little upward-discrepancy (dashed orange edges) – only two genus-level cross-class connections are not reaffirmed at the class level. However, the genus-level network is notably sparse and most genus-level connections involve Dinophyceae. Interestingly, as a result there is substantial downward discrepancy – the heatmap shows many inferred class-level interactions that are not resolved at the genus level. However, the class-level edges passed diagnostic validation and do show real co-occurrence; the most likely explanation for the discrepancy is simply that the genus-level data are too noisy to resolve aggregate interactions.
Influential taxa
Degree centrality — the number of edges incident to a node — is a summary statistic that identifies the most connected taxa in a network. At the class level, Mediophyceae and Filosa-Imbricatea are the most connected; at the genus level, Akashiwo and Tripos (both dinoflagellates) and Prorocentrum (a harmful algal bloom genus) are the most connected.
| Class | Degree |
|---|---|
| Mediophyceae | 7 |
| Filosa-Imbricatea | 6 |
| Bacillariophyceae | 5 |
| Bicoecea | 5 |
| Dinophyceae | 5 |
| Filosa-Thecofilosea | 5 |
| Opalozoa | 4 |
| Spirotrichea | 3 |
| Syndiniales | 3 |
| Chrysophyceae | 2 |
| Gregarinomorphea | 2 |
| Oligohymenophorea | 2 |
| Bolidophyceae | 1 |
| Pelagophyceae | 1 |
| Prostomatea | 1 |
| Genus | Degree |
|---|---|
| Akashiwo | 8 |
| Tripos | 7 |
| Prorocentrum | 5 |
| Thalassiosira | 4 |
| Cryothecomonas | 3 |
| Minidiscus | 3 |
| Ansanella | 2 |
| Chaetoceros | 2 |
| Ebria | 2 |
| Lepidodinium | 2 |
Taxonomic interactions
The five strongest class-level edges, ranked by absolute partial correlation, span a mix of positive (co-occurrence) and negative (exclusion) associations.
| Edge ID | Class A | Class B | Partial correlation |
|---|---|---|---|
| 1 | Filosa-Imbricatea | Spirotrichea | -0.382 |
| 2 | Bicoecea | Dinophyceae | 0.338 |
| 3 | Bicoecea | Mediophyceae | -0.330 |
| 4 | Chrysophyceae | Syndiniales | 0.279 |
| 5 | Dinophyceae | Filosa-Imbricatea | -0.276 |
The strongest estimated interaction is a negative partial correlation between Filosa-Imbricatea and Spirotrichea, suggesting competitive exclusion or top-down control between two sub-groups. The second-strongest estimated interaction is a positive association between Bicoecea and Dinophyceae (dinoflagellates), suggesting co-occurrence.
To understand which genera drive each class-level edge, we decompose each class node into its genus-level constituents and quantify each genus’ contribution to the aggregate signal via the covariance between the genus signal and the aggregate (class) signal, reported as fraction of the class-level variance. Values of this metric near 1 indicate the genus closely tracks the class node; values near 0 indicate the genus adds little to the class signal; negative values indicate the genus co-varies inversely with the rest of the class.
The top five contributing genera per node are shown below for each of the five strongest class-level edges.
Edge 1: Filosa-Imbricatea — Spirotrichea (partial correlation = -0.382; negative association)
| Class | Genus | ASVs | Variance proportion |
|---|---|---|---|
| Filosa-Imbricatea | Thaumatomonas | 4 | 0.348 |
| Filosa-Imbricatea | Paulinella | 5 | 0.198 |
| Filosa-Imbricatea | Peregrinia | 2 | 0.027 |
| Spirotrichea | Pelagostrobilidium | 6 | 1.044 |
| Spirotrichea | Eutintinnus | 5 | 0.387 |
| Spirotrichea | Rimostrombidium | 3 | 0.359 |
| Spirotrichea | Strombidinopsis | 7 | 0.329 |
| Spirotrichea | Tintinnopsis_05 | 2 | 0.212 |
Edge 2: Bicoecea — Dinophyceae (partial correlation = 0.338; positive association)
| Class | Genus | ASVs | Variance proportion |
|---|---|---|---|
| Bicoecea | Bicosoeca | 5 | 0.137 |
| Bicoecea | Caecitellus | 2 | 0.014 |
| Dinophyceae | Gyrodinium | 120 | 0.859 |
| Dinophyceae | Tripos | 83 | 0.818 |
| Dinophyceae | Dinophysis | 7 | 0.812 |
| Dinophyceae | Prorocentrum | 25 | 0.701 |
| Dinophyceae | Ansanella | 5 | 0.626 |
Edge 3: Bicoecea — Mediophyceae (partial correlation = -0.330; negative association)
| Class | Genus | ASVs | Variance proportion |
|---|---|---|---|
| Bicoecea | Bicosoeca | 5 | 0.137 |
| Bicoecea | Caecitellus | 2 | 0.014 |
| Mediophyceae | Thalassiosira | 34 | 1.015 |
| Mediophyceae | Chaetoceros | 46 | 0.808 |
| Mediophyceae | Minidiscus | 11 | 0.745 |
| Mediophyceae | Skeletonema | 3 | 0.184 |
| Mediophyceae | Leptocylindrus | 1 | 0.065 |
Edge 4: Chrysophyceae — Syndiniales (partial correlation = 0.279; positive association)
| Class | Genus | ASVs | Variance proportion |
|---|---|---|---|
| Chrysophyceae | Paraphysomonas | 16 | 0.583 |
| Syndiniales | Euduboscquella | 9 | 0.433 |
| Syndiniales | Syndinium | 6 | 0.431 |
| Syndiniales | Hematodinium | 5 | 0.264 |
Edge 5: Dinophyceae — Filosa-Imbricatea (partial correlation = -0.276; negative association)
| Class | Genus | ASVs | Variance proportion |
|---|---|---|---|
| Dinophyceae | Gyrodinium | 120 | 0.859 |
| Dinophyceae | Tripos | 83 | 0.818 |
| Dinophyceae | Dinophysis | 7 | 0.812 |
| Dinophyceae | Prorocentrum | 25 | 0.701 |
| Dinophyceae | Ansanella | 5 | 0.626 |
| Filosa-Imbricatea | Thaumatomonas | 4 | 0.348 |
| Filosa-Imbricatea | Paulinella | 5 | 0.198 |
| Filosa-Imbricatea | Peregrinia | 2 | 0.027 |
Several patterns stand out. Within Spirotrichea, Pelagostrobilidium and Eutintinnus dominate the class signal, with variance proportions indicating they closely track the class-level abundance. Dinophyceae signal is spread more broadly, with Gyrodinium, Tripos, Dinophysis, and Prorocentrum all contributing substantially — notably, several of these are known harmful algal bloom (HAB) genera. Mediophyceae is driven primarily by Thalassiosira and Chaetoceros, both abundant diatom genera in coastal systems.
Network coherence
A natural question is whether the inferred network structure is actually expressed in any given sample. A sample where all co-occurring pairs are positively associated and all exclusion pairs are negatively expressed is behaving “as expected” by the network — it is coherent with the inferred structure. A sample where associations are scrambled is not.
We measure this with a coherence t-statistic computed per sample. For each sample, we form the vector of sign-weighted edge products: for each network edge (A, B), we compute sign(ρAB) × CLR(A) × CLR(B), where ρAB is the partial correlation. This sign-weighting aligns the product with the expected direction — positive contributions come from edges where co-occurring taxa are indeed co-occurring, and from exclusion pairs that are indeed negatively expressed. The coherence t-statistic is:
\[t = \sqrt{n_\text{edges}} \cdot \frac{\bar{x}}{s_x}\]
where \(\bar{x}\) and \(s_x\) are the mean and standard deviation of the edge product vector across edges. Under random expression with respect to the network, \(t \approx N(0, 1)\) for \(n_\text{edges} \geq 20\). Values of \(|t| > 2\) indicate that the community is expressing the network structure significantly beyond chance. Importantly, the statistic is computed separately for the class-level network, the genus-level network, and both combined, allowing comparison across levels.
Coherence is generally positive across samples and time points, indicating that the inferred network is meaningfully expressed in the data rather than being an artefact of the estimation procedure. The class-level and total coherence tracks are broadly similar, while genus-level coherence is noisier — expected given the sparser genus-level network and the greater compositional variability at finer taxonomic resolution.
In general, the coherence t-statistic provides a convenient sample-level summary of network activation that facilitates downstream analyses of community dynamics across environmental conditions without requiring a full multivariate accounting of the compositional data or disaggregated interactions along specific edges.
Potential downstream analyses
The analyses in this section are exploratory and should be treated as hypothesis-generating. With 47 samples and multiple simultaneous comparisons, effect sizes are uncertain and individual p-values should be interpreted with appropriate caution.
Environmental drivers
We can ask whether variation in network coherence tracks measurable environmental conditions, i.e., whether the inferred community is “activated” by particular conditions and thus whether the network captures ecologically meaningful structure.
Spearman correlations were computed between each coherence measure and four continuous environmental variables: depth, temperature, salinity, and chlorophyll. Conditioning on the estimated network, per-sample t-statistics are functions only of that sample’s CLR values, so standard hypothesis tests are applicable (with the large caveat that the unconditional dependence is very difficult to quantify so a high level of caution should be used when interpreting p-values).
| Network | Driver | ρ | p-value |
|---|---|---|---|
| class | depth | -0.128 | 0.390 |
| genus | depth | 0.081 | 0.587 |
| total | depth | -0.056 | 0.709 |
| class | temp | 0.192 | 0.202 |
| genus | temp | 0.034 | 0.820 |
| total | temp | 0.191 | 0.204 |
| class | salinity | 0.117 | 0.437 |
| genus | salinity | -0.235 | 0.116 |
| total | salinity | -0.043 | 0.778 |
| class | chlorophyll | 0.364 | 0.012 |
| genus | chlorophyll | 0.279 | 0.058 |
| total | chlorophyll | 0.450 | 0.002 |
Chlorophyll is the only variable with a notable and statistically credible association with network coherence. Depth, temperature, and salinity show no meaningful signal across network levels.
The chlorophyll–coherence association is not uniform across the water column. Stratifying by depth reveals that the relationship is essentially absent at the surface (1 m) and moderate-to-strong at 5 m and 9 m — though individual stratum sample sizes (n ≈ 15–17) limit the precision of these estimates.
| Depth (m) | Class | Genus | Total | n |
|---|---|---|---|---|
| 1 | rho=0.07, p=0.801 | rho=0.21, p=0.417 | rho=0.23, p=0.374 | 17 |
| 5 | rho=0.42, p=0.121 | rho=0.45, p=0.092 | rho=0.50, p=0.058 | 15 |
| 9 | rho=0.43, p=0.108 | rho=0.20, p=0.467 | rho=0.56, p=0.031 | 15 |
One interpretation of the surface anomaly is that near-surface communities are subject to greater physical variability (mixing) that decouples compositional patterns from chlorophyll.
Seasonal variation
| Network | KW statistic | p-value |
|---|---|---|
| class | 9.981 | 0.019 |
| genus | 7.505 | 0.057 |
| total | 13.052 | 0.005 |
| Season | Class | Genus | Total | N |
|---|---|---|---|---|
| spring | 1.43 | 1.04 | 1.53 | 11 |
| summer | 2.34 | 2.87 | 3.70 | 11 |
| fall | 1.75 | 3.31 | 3.60 | 13 |
| winter | 1.05 | 2.33 | 2.61 | 12 |
Total and class-level coherence differ significantly across seasons (Kruskal-Wallis, p = 0.005 for total coherence). Summer and fall show the highest median coherence, consistent with bloom conditions; winter is the lowest. Season explains only approximately 22% of the variance in chlorophyll, which suggests that the two variables share signal but are far from redundant.
The results above, as mentioned at the top of the document, are exploratory in nature and intended to illustrate potential applications of the inferred networks for identifying potential ecological structure in the data.