Characterizing degree sequences of protein interaction networks (possibly as power laws)

Introduction

In the ongoing effort to understand the topology of network models of biomolecular systems, a great deal of attention has centered on the possible power-law behavior of networks’ degree sequences and the implications for their underlying structure.¹

Briefly, the degree sequence of a network model is just the sequence of degrees of its nodes, i.e. the number of neighboring nodes to which each node is linked. Here, as in most places, degree sequences will be understood to proceed in non-increasing order. Such a degree sequence, or any sequence $(x_1\geq x_2\geq x_3\geq\cdots\geq x_n)$ of positive integers, is said to exhibit power-law behavior if it roughly satisfies the scaling relationship $k\propto {x_k}^{-\alpha+1}$ for some scaling parameter $\alpha$. In this case, the corresponding frequency distribution satisfies the relationship $p_x\propto x^{-\alpha}$.² (Both of these relationships are “power laws”.) In most applications, this power-law behavior will only manifest in the “heavy tail” of the distribution, when $k$ is small or when $x$ is large, e.g. above some lower bound $x_{\sf min}$. We therefore talk about such sequences exhibiting power-law behavior when \[p_x=Cx^{-\alpha}\ \ \ \ \text{for}\ \ x\geq x_{\sf min}\text{.}\]

Whereas the emergence of power laws is peculiar to phase transitions in statistical physics, widespread power-law characterizations of biomolecular networks have generated the impression that these network structures are tightly constrained by biological processes. Furthermore, power-law or “scaling” degree sequences are widely associated with a constellation of topological properties that have been assimilated into the term “scale-free”, leading many researchers to infer these properties from degree sequence analyses.³

As it turns out, there is vast topological diversity among graphs with scaling degree sequences, with respect both to their architecture (such graphs may be hierarchical and self-similar or distributed and scale-rich) and to their capacity to channel flows (they may be optimized for low and for high efficiency).⁴ Moreover, power law probability distributions are stable under several mathematical transformations (including linear combinations),⁵ and power-law–distributed data may be generated from a variety of simple processes.⁶ This calls into question almost all of the causal and structural inferences made about graphs based on scaling degree sequences (the exception being the existence of highly-connected hubs).

Even then, a crucial problem remains: The most common evidence used to conclude that a dataset follows a power-law distribution is the appearance of linearity on a doubly-logarithmic plot, sometimes aided by linear regression,⁷ from which the fitted slope provides an estimate of the scaling parameter. This procedure neither reliably distinguishes power laws from other behavior⁸ nor provides accurate estimates.⁹ A systematic assessment of 24 purported power-law relationships in several domains found that only one was unambiguously better-described by a power law than by either of two alternatives (exponential and log-normal).¹⁰

This tutorial walks through the procedural steps described by Clauset, Shalizi, and Newman (hereafter “CSN”) for recognizing, describing, and testing power law relationships in data, using Gillespie’s poweRlaw package in R. We take as an example a protein interaction network maintained at the Reactome database.

Data collection

Wrangling

We begin by using the rvest package to see the available lists of protein–protein interaction pairs, by listing all of the hyperlinks at the downloads webpage:

library(rvest)
net_list <- "http://www.reactome.org/download/all_interactions.html"
nets <-
  read_html(net_list) %>%
  html_nodes("a")
net_names <- html_text(nets)
print(net_names)

##  [1] "Homo sapiens protein-protein interaction pairs"                  
##  [2] "Arabidopsis thaliana protein-protein interaction pairs"          
##  [3] "Bos Taurus protein-protein interaction pairs"                    
##  [4] "Caenorhabditis elegans protein-protein interaction pairs"        
##  [5] "Canis Familiaris protein-protein interaction pairs"              
##  [6] "Danio Rerio protein-protein interaction pairs"                   
##  [7] "Drosophila melanogaster protein-protein interaction pairs"       
##  [8] "Gallus gallus protein-protein interaction pairs"                 
##  [9] "Human immunodeficiency virus 1 protein-protein interaction pairs"
## [10] "Influenza a virus protein-protein interaction pairs"             
## [11] "Mus musculus protein-protein interaction pairs"                  
## [12] "Mycobacterium tuberculosis protein-protein interaction pairs"    
## [13] "Oryza sativa protein-protein interaction pairs"                  
## [14] "Plasmodium falciparum protein-protein interaction pairs"         
## [15] "Rattus norvegicus protein-protein interaction pairs"             
## [16] "Saccharomyces cerevisiae protein-protein interaction pairs"      
## [17] "Schizosaccharomyces pombe protein-protein interaction pairs"     
## [18] "Sus Scrofa protein-protein interaction pairs"                    
## [19] "Taeniopygia Guttata protein-protein interaction pairs"           
## [20] "Xenopus Tropicalis protein-protein interaction pairs"            
## [21] "here"

I invite the reader to pick the dataset of their choice. For this exercise, i’ve selected S. cerevisiae. The following code chunks download the dataset and read the key columns (the two interacting proteins and the type of interaction) into the active R session as a data frame.

i <- grep("S.*cerevisiae", net_names)
net_file <- "network.gz"
net_dir <- gsub("(^.*/)[^/]+$", "\\1", net_list)
net_url <- paste0(net_dir, html_attr(nets[i], "href"))
download.file(net_url, destfile = net_file)

net <- read.table(gzfile(net_file), sep = "\t", header = FALSE,
                  skip = 1)[, c(1, 4, 7)]
names(net) <- c("Protein1", "Protein2", "Type")
knitr::kable(head(net))

Protein1	Protein2	Type
UniProt:A6ZMQ6	UniProt:P0CS91	neighbouring_reaction
UniProt:A6ZMQ6	UniProt:P0CS91	neighbouring_reaction
UniProt:B5VDL3	UniProt:B5VDL3	reaction
UniProt:B5VDL3	UniProt:P0CS91	neighbouring_reaction
UniProt:B5VDL3	UniProt:P0CS91	neighbouring_reaction
UniProt:O13516	UniProt:O13516	neighbouring_reaction

This dataset contains the following types: neighbouring_reaction, reaction, indirect_complex, direct_complex. They are described at the protein interaction dataset README. Depending on our objectives, we might want to restrict to a subset of these. In fact, for the purpose of illustration it may be helpful to have a smaller network whose degree sequence is less straightforward to rule out as any specific type of distribution. In the S. cerevisiae dataset, more than half (0.55) of the recorded interactions are neighboring reactions, whose pairs may be better understood in a graph-theoretic framework as a number of hops away from each other, according as the reactions that link them proceed. The following chunk restricts the data to other interaction types.

net <- net[net$Type != "neighbouring_reaction", ]

Pre-processing

Now that we have a somewhat cleaned edgelist, we can calculate the degree sequence of the graph it yields—importantly, without constructing a graph object using a package like network or igraph, although i encourage anyone interested in further exploring the topology of the network to familiarize themselves with at least one of those packages! For now, we instead simply calculate the number of times each protein appears as the source or target of a link (i.e., as the first or second member of a pair).¹¹ This list of frequencies, arranged in descending order, is the degree sequence.

sc_dt <- table(c(as.character(net$Protein1), as.character(net$Protein2)))
sc_ds <- as.vector(sc_dt)
names(sc_ds) <- names(sc_dt)
sc_ds <- sort(sc_ds, decreasing = TRUE)
print(sc_ds[1:5])

## UniProt:P05759 UniProt:P0CH08 UniProt:P0CH09 UniProt:P06839 UniProt:P32776 
##           1705           1646           1646           1367           1367

Analysis

Exploration

Now that we have a dataset suitable for a discrete power law approach, our first step is to visualize it. Size–frequency plots, which depict the discrete density function on log–log axes, are perhaps the most common way to visualize large discrete frequency data. However, size–rank plots, which instead depict the cumulative distribution (sideways and in reverse), are superior, for the purpose of recognizing power-law behavior or simply for comparing different datasets.¹² This is because size–frequency plots actually shrink the size of the data being visualized: instead of one point for each observation, as in a size–rank plot, we are left with one point for each distinct value. One consequence of this is that the “long tail” of high-value, low-frequency observations become messy and their progression more difficult to discern. Here are both types of plot for the S. cerevisiae protein interaction network:

par(mfrow = c(1, 2))
plot(x = sc_ds, y = 1:length(sc_ds), log = "xy",
     xlab = "Number of interaction partners",
     ylab = "Rank")
plot(x = unique(rev(sc_ds)), y = table(rev(sc_ds)), log = "xy",
     xlab = "Number of interaction partners",
     ylab = "Frequency")
par(mfrow = c(1, 1))
mtext(paste("Size-rank and size-frequency plots",
            "for the S. cerevisiae protein interaction network"),
      line = 1)

While the size–rank plot is clearly nonlinear, the size–frequency plot is, for me, too messy to confidently judge its linearity (with the exception of the handful of unexpectedly high-frequency numbers between 100 and 500). We could stop here; certainly the degree sequence as a whole violates power-law behavior. On the other hand, most power-law behavior, whether in the real world or among popular probability distributions, is confined to the tail of the distribution, after some threshold $x_{\sf min}$. It might be that the relatively high-degree nodes—more than 400 interaction partners, say—are indeed power-law–distributed. The rest of this section tests that hypothesis.

Model-fitting

Now comes the model-fitting process, where we rely on the procedural implementation in poweRlaw. Taking the degree sequence as input, we use displ$new() to create a discrete power-law distribution object with no pre-specified parameters.

library(poweRlaw)
sc_pl <- displ$new(dat = sc_ds)

The poweRlaw documentation for the inner workings of the key distributions is limited (execute ?displ to see an overview of the available distributions and some examples), but we can peek into its structure with the str() function:

print(str(sc_pl))

## Reference class 'displ' [package "poweRlaw"] with 5 fields
##  $ dat     : Named int [1:1048] 1 1 1 1 1 1 1 1 1 1 ...
##   ..- attr(*, "names")= chr [1:1048] "UniProt:P11154" "UniProt:P11491" "UniProt:P13856" "UniProt:P14680" ...
##  $ internal:List of 9
##   ..$ freq   : int [1:170] 49 71 49 29 12 44 31 18 13 18 ...
##   ..$ values : num [1:170] 1 2 3 4 5 6 7 8 9 10 ...
##   ..$ cum_slx: num [1:170] 3517 3517 3468 3414 3374 ...
##   ..$ cum_n  : int [1:170] 1048 999 928 879 850 838 794 763 745 732 ...
##   ..$ dat    : Named int [1:1048] 1 1 1 1 1 1 1 1 1 1 ...
##   .. ..- attr(*, "names")= chr [1:1048] "UniProt:P11154" "UniProt:P11491" "UniProt:P13856" "UniProt:P14680" ...
##   ..$ xmin   : num 1
##   ..$ v      : int [1:2] 1 0
##   ..$ slx    : num 3517
##   ..$ n      : int 1048
##  $ xmin    : num 1
##  $ pars    : NULL
##  $ no_pars : num 1
##  and 24 methods, of which 10 are  possibly relevant:
##    getDat, getNo_pars, getPars, getXmin, initialize, mle, setDat,
##    setNo_pars, setPars, setXmin
## NULL

The distribution object has five fields:

the dataset dat (which we provided)
a list internal of transformations and summaries of this data, including the number of observations n, the distinct values values, their frequencies freq, their cumulative frequencies cum_n (in reverse), and the cumulative log-transformed values cum_slx (also in reverse, recorded only for each distinct value, to be used in MLE calculations)
the assignment xmin for the lower bound, currently 1 (in general, this is set to the smallest value in dat)
the assignment pars for the scaling parameter (plural for consistency with other distributions governed by more than one shape parameter), currently NULL
the number of shape parameters no_pars

The key step is to use the CSN procedure to estimate the unknown parameters from the data. I want to make one observation before starting the process: In poweRlaw, it is possible to estimate the scaling parameter $\alpha$ for any fixed value of $x_{\sf min}$, but the iterative process of estimating $x_{\sf min}$ involves calculating each corresponding value of $\alpha$. To check the current values assigned to xmin and pars at any time, execute sc_pl$xmin or sc_pl$pars. For more insight into the process, check out the poweRlaw vignettes by executing vignette(package = "poweRlaw").

Following CSN, we first estimate the lower bound:

xmin_est <- estimate_xmin(sc_pl)
sc_pl$setXmin(xmin_est)
writeLines(paste0(
  "Lower bound parameter is estimated at ", xmin_est$xmin, "\n",
  "Scaling parameter is estimated at ", round(xmin_est$pars, 2)
))

## Lower bound parameter is estimated at 20
## Scaling parameter is estimated at 1.59

The lower bound estimate, 20, is way lower than i would have expected from the plots. I’d therefore like to see how well that estimate holds up—that is, how robust it is to slight changes in the data. A curated protein interaction network, constructed as it is from a massive and growing scientific literature, cannot be treated as a complete catalogue of the reactions that make up an organism’s proteome. It is therefore important to do some kind of sensitivity analysis for any quantitative summary.

poweRlaw comes with a bootstrapping method for assessing the uncertainty of the parameter estimates from the data. The procedure, called by the bootstrap() function, generates a bunch of alternative datasets by sampling with replacement from the original dataset, then performs the same fitting procedures to estimate $x_{\sf min}$ and $\alpha$. This doesn’t make perfect sense for a degree sequence—a more appropriate sampling procedure might generate the kind of degree sequences that would come from randomly adding and removing nodes and links from the network—but it’s a quick and easy check that the fitting process itself is making sense. The following code chunk performs the bootstrap procedure with 1000 simulations.

set.seed(5356)
sc_pl_bs <- bootstrap(sc_pl,
                      no_of_sims = 1000, threads = parallel::detectCores())
print(str(sc_pl_bs))

## List of 6
##  $ gof            : num 0.102
##  $ bootstraps     :'data.frame': 1000 obs. of  4 variables:
##   ..$ gof  : num [1:1000] 0.1022 0.107 0.1064 0.0993 0.1118 ...
##   ..$ xmin : num [1:1000] 20 18 17 21 24 19 20 20 18 18 ...
##   ..$ pars : num [1:1000] 1.64 1.57 1.57 1.61 1.61 ...
##   ..$ ntail: num [1:1000] 586 628 608 597 571 645 582 594 630 632 ...
##  $ sim_time       : num 1.39
##  $ seed           : NULL
##  $ package_version:Classes 'package_version', 'numeric_version'  hidden list of 1
##   ..$ : int [1:3] 0 70 0
##  $ distance       : chr "ks"
##  - attr(*, "class")= chr "bs_xmin"
## NULL

The bootstrap result contains six fields including

distance, the goodness-of-fit distance measure used, in this case the default Kolmogorov–Smirnov measure
the goodness-of-fit distance gof for the original data and model
a data frame bootstraps containing, for each bootstrapped dataset, the GOF distance gof, the $x_{\sf min}$ estimate xmin, the $\alpha$ estimate pars, and the number ntail of observations in the tail, meaning greater than or equal to xmin

The most convenient way to evaluate the results is to watch the parameter estimates settle down as the runs accumulate. This is done with the plot() function (specialized to bootstrap objects). Each frame contains a parameter estimate or its estimated standard error (black) with 95% confidence bounds (red). Also shown are estimates and confidence bounds for the tail size ntail.

plot(sc_pl_bs, trim = .1)

The results are worrisome. While most samples produce behavior very similar to the original dataset, on rare occasions the fitting procedure yields a dramatically different model, with large estimates of $x_{\sf min}$ and $\alpha$ that skew the cumulative means. Below we plot the distributions of the parameter estimates directly:

par(mfrow = c(1, 3))
hist(sc_pl_bs$bootstraps$xmin, breaks = "fd",
     xlab = expression(paste("Estimates of ", x[min])), main = "")
abline(v = sc_pl$xmin, lty = 2, lwd = 3, col = "grey")
hist(sc_pl_bs$bootstraps$pars, breaks = "fd",
     xlab = expression(paste("Estimates of ", alpha)), main = "")
abline(v = sc_pl$pars, lty = 2, lwd = 3, col = "grey")
hist(sc_pl_bs$bootstraps$ntail, breaks = "fd",
     xlab = "Estimates of tail size", main = "")
abline(v = sum(sc_pl$dat >= sc_pl$xmin), lty = 2, lwd = 4, col = "grey")
par(mfrow = c(1, 1))
mtext("Histograms of bootstrapped estimates of each parameter", line = 1.5)

While the vast majority of bootstrapped samples produce estimates that congregate around those of the original dataset, a handfull of outliers produce wildly different results. This suggests that our estimates are highly sensitive to certain deviations in the data that, as i speculated above, could flip the best-fit model to a much larger choice of $x_{\sf min}$. We should therefore not put too much stock in the original parameter estimates.

Goodness of fit

The next and central question is whether the power-law model is a decent fit to the data in the first place. To answer this question, we can generate a p-value for our MLE model using a different poweRlaw bootstrapping procedure, called by bootstrap_p(). This function generates artificial samples from the combined data and model and assesses their distance from the MLE estimates using the Kolmogorov–Smirnov statistic. The artificial values greater than xmin are generated from the power-law model, whereas those smaller than xmin are generated by bootstrapping from the original dataset.

The bootstrap_p() function keeps track not only of the parameter estimates but also of the p-value estimate. The following code chunk performs the runs, reports the p-value, and plots the cumulative estimates. Since 1000 runs would take a while, this time we only perform 100. The power-law model–based half of the procedure is computationally intensive and may take several minutes to run, so for illustration purposes we perform only 250 simulations here.

set.seed(0600)
sc_pl_bsp <- bootstrap_p(sc_pl,
                         no_of_sims = 250, threads = parallel::detectCores())
plot(sc_pl_bsp)

The object produced by bootstrap_p() contains an additional field, the p-value, retrievable as sc_pl_bsp$p. As we might have expected from the highly nonlinear size–rank plot and the preceding sensitivity analysis, the S. cerevisiae degree sequence—more specifically, it’s tail $x\geq 20$—is ill-suited to a power-law model. Not a single bootstrapped sample was further from the theoretical distribution than the data itself, leaving us with an estimated p-value of 0.

Model comparison

While (the tail of) the S. cerevisiae degree sequence is discernibly different from the most likely power-law model, other heavy-tailed discrete distributions may prove more appropriate. The following code chunk fits three competing discrete model families to the S. cerevisiae degree sequence: Poisson, discrete exponential, and discrete log-normal. (These are the other three discrete families implemented in poweRlaw.) Each fit begins by assigning the value of $x_{\sf min}$ from the power-law model, so that the resulting collection of models can be meaningfully compared.

# Poisson fit
sc_pois <- dispois$new(dat = sc_ds)
sc_pois$setXmin(sc_pl$getXmin())
sc_pois$setPars(estimate_pars(sc_pois))
# discrete exponential fit
sc_exp <- disexp$new(dat = sc_ds)
sc_exp$setXmin(sc_pl$getXmin())
sc_exp$setPars(estimate_pars(sc_exp))
# discrete log-normal fit
sc_lnorm <- dislnorm$new(dat = sc_ds)
sc_lnorm$setXmin(sc_pl$getXmin())
sc_lnorm$setPars(estimate_pars(sc_lnorm))

Having fit the models using a common lower bound, we can perform likelihood-ratio tests (LRTs) to determine whether any of them outperforms the power-law. This test produces a statistic $R$ called the log-likelihood ratio, which is the ratio of the logarithms of the likelihoods of the two models. When $R>0$, the former model is judged to be a better fit, because its likelihood is greater than that of the latter; when $R<0$, the latter model is judged superior; and, when $R=0$ (when $R$ cannot be statistically discerned from zero), the models are judged equally good (or bad).

sc_pl_pois <- compare_distributions(sc_pl, sc_pois)
sc_pl_exp <- compare_distributions(sc_pl, sc_exp)
sc_pl_lnorm <- compare_distributions(sc_pl, sc_lnorm)
lrts <- data.frame(vs.Poisson = sc_pl_pois$test_statistic,
                   vs.exponential = sc_pl_exp$test_statistic,
                   vs.lognormal = sc_pl_lnorm$test_statistic)
rownames(lrts) <- "R"
knitr::kable(signif(lrts, 4))

	vs.Poisson	vs.exponential	vs.lognormal
R	14.17	3.178	-6.372

While the Poisson and exponential families provide worse fits than the power-law, the data appear to be much better suited to a discrete log-normal model. Each comparison comes with a two-sided p-value (use str() to see how the LRT objects are structured); in this case the p-values are $0$, $0.0015$, and $1.9\times 10^{-10}$, respectively. We can compare the four models visually as overlaid curves on a log–log plot of the data:

cols <- RColorBrewer::brewer.pal(4, "Set2")
plot(sc_pl,
     xlab = "Number of interaction partners",
     ylab = "Rank",
     main = "Four model families fitted to the S. cerevisiae degree sequence")
lines(sc_pl, col = cols[1], lty = 1, lwd = 4)
lines(sc_pois, col = cols[2], lty = 2, lwd = 4)
lines(sc_exp, col = cols[3], lty = 3, lwd = 4)
lines(sc_lnorm, col = cols[4], lty = 4, lwd = 4)
legend(x = "bottomleft", box.lty = 0, col = cols, lty = 1:4, lwd = 4,
       legend = c("power-law", "Poisson", "exponential", "log-normal"))

Indeed, the log-normal fit is much closer than the others. In particular, in contrast to the others (including the power-law fit), the log-normal does not tend to overestimate at one end of the data and underestimate at the other—a pattern in the residuals that is a classic sign of a poor fit. To assess the log-normal fit on its own merits, we can calculate a bootstrapped p-value, just as we did for the power-law fit:

set.seed(8285)
sc_lnorm_bsp <- bootstrap_p(sc_lnorm, no_of_sims = 250,
                            threads = parallel::detectCores())

The resulting p-value is again 0, suggesting that, while a much better fit than the power-law, the log-normal model remains a rough approximation to the degree sequence of the S. cerevisiae protein network, and may give way to a still better approximation (even, possibly, a power-law) as the network is more completely described.

Conclusion

I was prompted to dig (back) into the critical literature on power-law distributions, and especially scale-free networks, when, around the same time, a colleague shared an example of regression line–fitting from her own background reading¹³ and i came across another example in my own.¹⁴ While these papers are several years old, a search for “power law” in PubMed returns many more recent examples. Importantly, these examples cannot be singled out for special criticism when the zeitgeist of professional biologists is still infused with the enthusiasm for power laws that took hold in the early aughts.

While revisiting this literature, i was delighted to find the poweRlaw package itself, which enabled the examination in this tutorial, much more thorough than what i achieved by adapting the code provided with the CSN paper (which, in turn, speaks foremost to my own limitations). The more widely these tools are known, the more quickly these techniques will be adopted and improved.

We’ve only examined one biomolecular network in this tutorial, but the same process could be used to describe and test the shapes of any other networks maintained at Reactome or elsewhere. We also haven’t discussed the “inner” topology of the network, which is a much greater task than should be performed using only a single signature like the degree sequence.¹⁵ Indeed, once topological expectations are freed from the constraints imposed by “scale-freeness”, a world of possibilities presents itself.

Barabasi & Oltvai, 2004; Albert, 2005 ↩
Some sources designate the scaling parameter $\alpha$ the power used in the (cumulative) scaling relationship, rather than the one in the frequency relationship, which here is $\alpha-1$.↩
Keller, 2005 ↩
Li et al, 2005 ↩
Willinger et al, 2004 ↩
Mitzenmacher, 2004 ↩
See the references cited in Barabasi & Oltvai (2004) and Albert (2005).↩
Clauset et al, 2009 ↩
Goldstein, 2004 ↩
Clauset et al, 2009 ↩
It is important to note that this procedure will omit isolates—proteins contained in the full dataset but not linked to any others in our subset—from the degree sequence. For some purposes it may be necessary to preserve them with zero frequencies. For present purposes, zero values cause problems, both with visualization (in log–log plots) and with model-fitting.↩
Li et al, 2005 ↩
Zhang & Horvath, 2005 ↩
Ball & Botsis, 2011 ↩
https://link.springer.com/article/10.1007/s11693-012-9094-y ↩