Introduction

The topic of this blog post is negative edges in psychological networks. Negative edges are quite the mystery in substantive applications. This is because there is a long tradition in psychology of expecting only positive relations. This is termed a “positive manifold” (e.g., Horn and Cattell 1966). This is also an important aspect of network theory, for example, “[symptoms] not function as protective factors in the development of other symptoms”(p. 6, Borsboom et al. 2011). Hence, all of the edges in a given network are often expected to be positive.

That said, I never quite understood why any would actually be concerned by seeing negative (red) edges in a network plot (after all, is it really good practice to read too much into a plot of point estimates?). My initial apathy towards negative edges was because, even if all true non-zero relations are positive, we would actually expect to see some red edges in a network plot. But from now working with substantive researchers, it is clear that I unappreciated how concerning red edges are: is the method working incorrectly ? has some mistake been made ? are the effects flipping direction ?

Although these are important concerns, I would typically be more concerned if there were not any red edges:

This is paradoxical and related to natural sampling variability. This variability is required to compute valid1 confidence intervals and \(p-\)values. The means methods (e.g., (g)lasso) that reduce variance in the sampling distribution cannot provide actual confidence intervals.

In other words, to make inference about an effect we need the sampling variability. In turn, this requires (1) embracing red edges; or (2) incorporating the expectation of a positive manifold into the network. To clarify these points, I decided to communicate the idea of natural/expected sampling variability and how this can be harnessed to improve statistical inference in partial correlation networks.

This blog is organized as follows:

Together, this provides:

  1. A foundation for thinking about negative edges in networks and more generally sampling variability.

  2. A viable strategy that can address the “issue” of red edges completely with non-regularized methods based on null hypothesis significance testing (dare I suggest NHST is useful).

A frequentist approach

It is possible to estimate the network in a really straightforward fashion. Indeed, once the partial correlations are obtained, knowing n and p is all that is required to determine the non-zero effects. I do not present equations (too lazy! but see our papers) and instead opt for R-code in certain places:2

# these packages are needed to reproduce 
library(BGGM)
library(GGMnonreg)
library(qgraph)
library(ggplot2)
library(BDgraph)
library(dplyr)

# data
Y <- BGGM::ptsd

# number of nodes
p <- ncol(Y)

# adjacency matrix
adj <- matrix(0, p, p)

# number of nodes conditioned on
c <- p - 2

# number of variables
n <- nrow(Y)

# alpha
alpha = 0.05

In the above, we have the data (PTSD symptoms, Armour et al. 2017), n, p, c (the number of variables conditioned on), and a defined \(\alpha\) (type I error) level. Note that is the same exact procedure for determining the statistical significance of bivariate correlations. The only difference is the addition of c, as no variables are conditioned on for bivariate as opposed to partial correlations. Next we simply compute the partial correlations, compute the test-statistics, and then determine which \(p-\)values are less than alpha (\(\alpha\)):

# covariance matrix
covariance_matrix <- cov(Y)

# inverse of the covariance matrix
precision_matrix <- solve(covariance_matrix)

# partial correlation matrix
pcor_matrix <- -(cov2cor(precision_matrix) - diag(p))

# Fisher z transformation
z <- GGMnonreg::fisher_z2r(pcor_matrix[upper.tri(pcor_matrix)])

# test statistic
z_stat <- abs(z) / (1 / sqrt(n - c - 3))

# compute two-sided p-value
pvalues <- pnorm(z_stat, lower.tail = FALSE) * 2

# significant effects (the edges)
adj[upper.tri(adj)] <- ifelse(pvalues < alpha, 1, 0)

# weighted adjacency matrix
pcor_adj <- BGGM:::symmteric_mat(adj * pcor_matrix)

This is essentially information that is taught in introductory methods: we simply computed the significance of a correlation, but with the addition of c in the denominator of (1 / sqrt(n - c - 3)) (the standard error). For controlling false positives or discoveries, this is the gold standard for situations common to psychology (\(p < n\))!

It is then customary to plot the results:

qgraph::qgraph(pcor_adj)

This immediately shows the issue of negative (red) edges-they jump right off the graph, considering they are quite unexpected. How can this be, given the symptoms belong to a scale that was constructed to have positive associations (the bivariate correlations are all positive)? Furthermore, their presence contradicts strong theory in network psychometrics.

Is there something wrong with the gold standard approach?

The answer is no! Three red edges are expected.

Expected red edges

Buckle up as we journey into the trenches of frequentist logic. First, I emphasize that I am “Bayesian,” but I find frequentist logic quite useful for thinking about sampling variability. For example, to figure out if this number of red edges is cause for concern, we only need to make an assumption about the overall sparsity. Hence, with

  • The sample size n
  • The number of nodes p
  • A defined false positive rate \(\alpha\)
  • Assumed sparsity \(\pi\),

we can actually figure out (exactly) how many red edges there will be. This is because, by assuming a level of sparsity, we know (on average) how many false positive there will be (and half of these will be negative).

To see this, assume that \(p = 20\) which results in \(\frac{1}{2}p(p-1) = 190\) effects in total. Now if assume that \(\pi = 0.50\), this means that there are \(190 \cdot 0.5 = 95\) true zeros in the network (50% sparsity). And because we have a defined false positive rate, \(\alpha\), we then know how many false positives we will have in a network, that is \(95 \cdot 0.05 = 4.75\). Obviously there cannot be 4.75 false positives in a given network, which gets back to a key idea of frequentism. Namely, if we repeatedly sampled a population (analogous to a simulation), then, on average, there will indeed be 4.75 false positives. Because there is no reason to assume that false positives are in a certain direction, \(\frac{1}{2}4.75 = 2.375\) is the number of expected negative edges. This maps closely to the plot above where there are 3 red edges. This indicates that the method is not deficient and there is no error. The “problem”, as I show below, is that theory was not incorporated into the analysis.

I made the following plot to show how many false positives are expected for different network sizes and assumed sparsity levels (simulation is not needed):

# number of variables
p <- c(10, 25, 50)
# sparsity
sparsity <-  seq(0.1, 0.9, 0.1)
# number of edges
partials <- 0.5*p*(p-1)
# alpha level
alpha <- 0.05

# simple multiplication
res <- data.frame(FP = c(sparsity * partials[1]  * alpha, 
                         sparsity * partials[2]  * alpha,
                         sparsity * partials[3]  * alpha))

The take home from this plot (of the object res) is that the number of expected red edges is a function of how many true zeroes there are. Hence, when we have larger networks, we expect many red edges. For example, with 50% sparsity and 50 nodes, we would expect to see 15 red edges in the plot (even though the true network has only positive relations). This becomes larger as sparsity increases. In my experience, psychological networks are not all that sparse but we would still expect to see red edges. Note also there would be more (less) for a larger (smaller) \(\alpha\) level.

Before proceeding to show that we can also determine the size of the negative edges, it is important to consider the variability we might expect around the long run average. Here I used simulation.

First, it is informative to note that the results are essentially the same as the Figure titled Calculator (compare the lines between plots). The ribbon captures the minimum and maximum number of false positives. For example, with 50% sparsity and 50 nodes, there could be 5 or 25 red edges in a given graph. I emphasize that a key to frequentist inference is thinking in terms of a long run average (the line). Although some see this as a limitation (e.g., “in the long run we are all dead”, link), I think this perspective provides valuable information. In this case, by going deep into frequentist probability it was possible to use simple multiplication to calculate the expected number of red edges. Pretty cool!

Expected red edge size

Another concern that has been raised about non-regularized method is the size of the red edges, as they can be quite large. This is because, when using NHST, the effects must be “significant” to be included in the graph. This also has a simple solution: given \(n\) and \(p\) we can compute the minimum size of a red edge.

To see this, consider that using \(\alpha = 0.05\) translates into the partial correlation needing to be 1.96 standard errors (\(SE\)) away from zero to be “significant”. This means that an effect must be larger than \(|z_{ij}| > 1.96 \cdot SE\). Because the standard error is simply \(1 / \sqrt{n - c - 3}\), we know exactly the minimum edge weight of the red edges. Note that this is on the Fisher-z scale and when back transforming the values will be slightly smaller.

Here is an example with \(p = 20\). I have also run a simulation, from which I collected all of the false positives that were detected. Here is the simple solution for the lower-bound of a negative edge.

# sample size
n <- seq(50, 1000, 50)

# controlled
c <- 20 - 2

# analytic
res1 <- data.frame(r =  BGGM::fisher_z2r((1 / sqrt(n - c - 3)) * 1.96), 
                   n  = n)

The take home from this plot is that red edges will be quite large with small sample sizes. Indeed, the absolute minimum for a red edge with \(n = 50\) is larger than 0.30! This is because the standard error is large, which, in turn, requires the false positive to be quite large to be “significant” (included in the graph). This quickly dissipates. The simulated results (the points) match the solution based on a “calculator” (the red line corresponding to the object res1). And, no, this is not magic: the basic ideas are derived from the very foundations of statistical inference (in this case of the frequentist variety).

Dealing with red edges

At this point, it may seem a bit unsatisfying that I have simply reaffirmed what some might already know (at least those that have tried non-regularized methods). The point was that seeing red edges is not indicative of a problem with the method, but, instead, the underlying “issue” is not incorporating theory into the analysis.

In this section, I show how to completely solve the red edge conundrum by incorporating the theoretical exception of a positive manifold into the network. This is done with a one-sided hypothesis test for positive edges (note positive_manifold = TRUE):

# incorporate theory
fit <-  GGMnonreg::GGM_fisher_z(Y, 
                                alpha = 0.05, 
                                positive_manifold = TRUE)
# add column names
colnames(fit$pcor_selected) <- colnames(Y)
# plot
qgraph::qgraph(fit$pcor_selected)

This is the same data that was used previously (there were three red edges). Now there are no red edges in the network: This is a direct result of incorporating the theoretical expectation of positive relations into the analysis. Hence, the pressing concern of negative edges has been completely solved using frequentist methods.

What about glasso ?

I did not spend much time talking about regularization. This is because I have quite a few papers now critiquing the widespread use of (g)lasso for network analysis and do not want to belabor the point. I will elaborate on something I stated earlier in regards to glasso. Computing \(p-\)values and confidence intervals requires sampling variance, whereas (g)lasso introduces bias and reduces variance. Hence, even the bootstrap will not work for constructing confidence intervals. I will not demonstrate this here, as this is well-known in the statistical literature (pp. 7 - 8, Bühlmann, Kalisch, and Meier 2014):

The (limiting) distribution of such a sparse estimator is non-Gaussian with point mass at zero, and this is the reason why standard bootstrap or sub-sampling techniques do not provide valid confidence regions or \(p-\)values.

Thus the “confidence” intervals commonly computed from (g)lasso are not confidence intervals (i.e., they do not cover the true value, say, for a 95 % interval, 95 % of the time).

On the other hand, the variance reduction and bias towards zero is what makes the graphs seem more “interpretable” when using glasso. This is because the false positives have a narrow sampling distribution around zero. This translates into red edges being quite small and not all that apparent in a plot, for example:

fit <- qgraph::EBICglasso(cor(Y), n = 221)
qgraph::qgraph(fit)

Here the red edges are less (far) less pronounced than the non-regularized method above. However, the graph still raises the concern of red edges because they are there. A naive perspective would be to think (g)lasso has magical powers, for example, perhaps overcoming the magic spell of \(\ell_1\)-regularization suggests there really is a negative effect. I have seen this reasoning when reviewing papers and I think it is misguided (to say the least). But I digress. The important thing is that the very reason glasso has less pronounced negative edges (reducing variance, introducing bias, and inducing sparsity) is the very reason bootstrapping does not work correctly-a methodological quagmire indeed.

Error control

In this section, I demonstrate a key advantage of using non-regularized methods based on \(p-\)values. Namely, that it is possible to control the error rate. No such thing is possible with (g)lasso. To understand this, it is important to know that commonly used measures to assess method performance in simulation can be directly controlled by defining an \(\alpha\) level. For example,

This indicates that we can estimate a network that controls either. This does not actually require a simulation, as it is a fact. However, I will simulate to compare to glasso (using the default settings in the R package qgraph, Epskamp et al. 2012).

Specificity control

Specificity can be controlled by setting it to \(1 - \alpha\). Hence, for \(\alpha = 0.05\) this corresponds to specificity of 0.95, etc.

I simulated networks that incuded 20 nodes, 50% sparsity, and the partial correlations were mostly between -0.4 and 0.4 (see Tables 1 and 2 in Wysocki and Rhemtulla 2019). Further, I looked at four \(\alpha\) levels, 0.5, 0.25, 0.10, and 0.01, that correspond to specificity of 0.5, 0.75, 0.90, and 0.99. The scores were averaged across 250 simulated datasets.

I have intentionally highlighted glasso with the black line. A common misconception in the network literature is that (g)lasso reduces the chances of making a false positive and that it converges on the true model. Both are incorrect. This is well-known in the statistics literature–statisticians have explicitly stated lasso should not be used for model selection (p. 278, Tibshirani 2011):

The lasso is doing variable screening and, hence, I suggest that we interpret the second ‘s’ in lasso as ‘screening’ rather than ‘selection’…Once we have the screening property, the task is to remove the false positive selections…

This was stated by Dr. Peter Bühlmann in the discussion of the paper, “Regression shrinkage and selection via the lasso: A retrospective” (link to paper). This sentiment can be seen in the plot, as specificity actually decreases with larger \(n\). Said another way, the false positive rate actually increases with more data!

Now focus on the NHST methods. Here we see perfectly horizontal lines at the desired \(1 - \alpha\) level. This is because when using NHST we have formal error control. Hence, we can calibrate the network to the desired level of specificity. Note that sensitivity is essentially “power.” The light blue line makes clear that power is about the same as glasso but specificity is much higher. Furthermore, if the goal is in fact to limit false positives, this can be achieved by setting \(\alpha = 0.01\) (i.e., specificity of 99 %).

Side note:

I think there is much confusion surrounding the term “specificity” in the network literature. When talking about detecting effects (as opposed to diagnosing patients) specificity is the same as \(1 - \alpha\). Now, when suggesting that specificity is 75%, this might sound rather impressive. But, in fact, this corresponds to a false positive rate of 25%! Clearly much higher than the conventional level of 5%.

Precision control

Note that precision is equal to 1 - FDR. I personally do not like this metric for evaluating a method, because it also considers the assumed sparsity, network size, edge size, etc. This is because the FDR is the expected number of detected effects that are false, which is obviously influenced by, say, how many effects there are in the true matrix to be detected. I included precision to demonstrate that this can also be controlled.

Here there are again horizontal lines. These are not perfectly at the desired level of precision. For example, setting \(\alpha = 0.50\) results in a precision of around 0.75 (or a FDR of 0.25). This is a FDR less than \(\alpha\) which reflects the fact that using a FDR correction results in the FDR being less than or equal to the chosen \(\alpha\) level. Hence, the precision is indeed controlled in this simulation. We also see that precision reduces for glasso as \(n\) increases. This unusual and troubling result is due to (g)lasso only performing well when the true model is (very) sparse (this is hardly the case in psychology).

Conclusion

I hope this post provided some insights into thinking about sampling variability. Because network do have (many) more effects than other models, considering sampling variability is that much more important:

We should not run from sampling variability (e.g., by using methods that reduce it). Rather, we should embrace variability with arms wide open.

I am not the first to suggest that we should think more about sampling variability in networks. Recently, this same logic has been applied to network replication (Jones, Williams, and McNally 2019; Williams 2020). Hence, thinking about sampling variability is useful for replicability and wrestling with negative edges. As I demonstrated, there is a simple approach for directly incorporating the theoretical expectation of a positive manifold into network analysis. This avoids perhaps searching for a method that “works correctly”, and, in doing so, it is possible to control either specificity or precision.

Some thoughts

We have written many papers arguing against the wide spread use of (g)lasso. Yet, regularization is still more popular than ever, even though we have demonstrated that (1) it does not have a low false positive rate (see the plots in this blog as well, Williams et al. 2019; Williams and Rast 2019); and (2) it is not needed to mitigate overfitting (Williams and Rodriguez 2020). I now think (without evidence) that (g)lasso is not used because of (1) or (2) but actually how the results look in a plot3. This is not particularly rigorous and it is also unnecessary if the goal is to have a model that reflects a positive manifold. This theoretical expectation should simply be included in the analysis.

What about ordinal data?

I focused on continuous data because there are simple equations to work with.4 The exact idea would also apply to ordinal data with polychoric partials correlations. However, because the standard error is sometimes much larger for polychoric partial correlations, the (false positive) red edges will be larger as well. But this is easily addressed with the approach presented in this blog (see the GGMnonreg R package).

True models?

One obvious concern with this blog is that I used the word “exact” quite often. All of these computations assume repeatedly sampling of a population with a true network structure. In my view, it is science fiction to think we actually randomly sample a population, and that, in nature, there is a multivariate normal random number generator. However, I have zero doubt that assuming these things provides valuable information for thinking about sampling variability and inference.

Moreover, I want to emphasize that error control is for the procedure and not the model. This is essentially just like a simulation. Envision the necessary ingredients to reproduce a simulation in reality. This would require a sampling plan/procedure that includes deciding on the model, number of variables, a defined population, and not changing anything from random sample to random sample (e.g., everything is exactly the same for each iteration in a simulation). Here each random sample (iteration) is analogous to a Bernoulli trial that results in success or failure (determined by statistical significance). This is useful fiction.

Bayesian edition

In a follow-up blog, I will describe a Bayesian approach for dealing with negative edges.

Take home

  • (g)lasso is not needed to ensure a plot looks palatable.

  • NHST allows for directly incorporating scientific expectations into network analysis. This completely solves the negative edge conundrum.

  • Red edges are expected\(-\)due to natural sampling variability\(-\)when theory is absent from the model. A case of “theoretical amnesia” (link):

…at least try to remember what theory is and what it’s good for [emphasis added], so that we don’t fall into theoretical amnesia (Borsboom, 2013).

  • NHST allows for formal error control (e.g., with actual confidence intervals), which is not possible with (g)lasso.5

References

Armour, Cherie, Eiko I. Fried, Marie K. Deserno, Jack Tsai, and Robert H. Pietrzak. 2017. “A network analysis of DSM-5 posttraumatic stress disorder symptoms and correlates in U.S. military veterans.” Journal of Anxiety Disorders 45 (May 2013): 49–59. https://doi.org/10.1016/j.janxdis.2016.11.008.

Borsboom, Denny, Angélique O. J. Cramer, Verena D. Schmittmann, Sacha Epskamp, and Lourens J. Waldorp. 2011. “The Small World of Psychopathology.” Edited by Rochelle E. Tractenberg. PLoS ONE 6 (11): e27407. https://doi.org/10.1371/journal.pone.0027407.

Bühlmann, Peter, Markus Kalisch, and Lukas Meier. 2014. “High-Dimensional Statistics with a View Toward Applications in Biology.” Annual Review of Statistics and Its Application 1 (1): 255–78. https://doi.org/10.1146/annurev-statistics-022513-115545.

Epskamp, Sacha, Angélique O. J. Cramer, Lourens J. Waldorp, Verena D. Schmittmann, and Denny Borsboom. 2012. “qgraph: Network Visualizations of Relationships in Psychometric Data.” Journal of Statistical Software 48 (4). https://doi.org/10.18637/jss.v048.i04.

Horn, John L, and Raymond B Cattell. 1966. “Refinement and test of the theory of fluid and crystallized general intelligences.” Journal of Educational Psychology 57 (5): 253.

Jones, Payton J., Donald R. Williams, and Richard J. McNally. 2019. “Sampling Variability is not Nonreplication: A Bayesian Reanalysis of Forbes, Wright, Markon, & Krueger.” PsyArXiv. https://doi.org/10.31234/OSF.IO/EGWFJ.

Tibshirani, Robert. 2011. “Regression shrinkage and selection via the lasso: A retrospective.” Journal of the Royal Statistical Society. Series B: Statistical Methodology 73 (3): 273–82. https://doi.org/10.1111/j.1467-9868.2011.00771.x.

Williams, Donald R. 2020. “Learning to Live with Sampling Variability: Expected Replicability in Partial Correlation Networks.” PsyArXiv. https://doi.org/10.31234/OSF.IO/FB4SA.

Williams, Donald R., and Philippe Rast. 2019. “Back to the basics: Rethinking partial correlation network methodology.” British Journal of Mathematical and Statistical Psychology. https://doi.org/10.1111/bmsp.12173.

Williams, Donald R., Mijke Rhemtulla, Anna C. Wysocki, and Philippe Rast. 2019. “On Nonregularized Estimation of Psychological Networks.” Multivariate Behavioral Research 54 (5): 1–23. https://doi.org/10.1080/00273171.2019.1575716.

Williams, Donald R., and Josue E. Rodriguez. 2020. “Why Overfitting is Not (Usually) a Problem in Partial Correlation Networks.” PsyArXiv. https://doi.org/10.31234/OSF.IO/8PR9B.

Wysocki, Anna C., and Mijke Rhemtulla. 2019. “On Penalty Parameter Selection for Estimating Network Models.” Multivariate Behavioral Research, 1–15. https://doi.org/10.1080/00273171.2019.1672516.

Zhang, Rong, Zhao Ren, and Wei Chen. 2018. “SILGGM: An extensive R package for efficient statistical inference in large-scale gene networks.” Edited by Manja Marz. PLoS Computational Biology 14 (August): e1006369. https://doi.org/10.1371/journal.pcbi.1006369.


  1. Valid refers to the definition of a confidence interval and \(p\)-value. See Wikipedia: (https://en.wikipedia.org/wiki/Confidence_interval) (https://en.wikipedia.org/wiki/P-value#Distribution)

  2. Most code has been omitted because it was making the blog hard to follow. The .Rmd file is provided at my github.

  3. And do not get me started on the notion of making “inference” from a plot :-)

  4. There are also rather simple equations for Spearman and Kendall rank correlations. The standard errors a bit larger, which translates into larger red edges in the graph.

  5. More recent lasso-based method can construct confidence intervals (but not with the bootstrap). This is accomplished by removing the bias and sparsity, which aims to have the proper variance to have the correct properties. These methods are not needed in psychology (e.g., Zhang, Ren, and Chen 2018), because we already have confidence intervals with the correct properties.