A Note: For the below questions, I will ignore the isolated single nodes (i.e. nodes with degree 0).

1. Is this degree distribution consistent with a network generated by the ErdÃ¶s-Renyi network model? Why?

No. From the results obtained in the last homework, the average degree \(\bar{k}\) = \(23.998573\), thus if the network was generated by the ErdÃ¶s-Renyi model, the degree distribution can be approximated by \(\text{Poisson}(23.998573)\). When plotted together with the actual degree distribution as below, we see that they do not agree well to each other.

ks <- ks[-1]
pk <- pk[-1]

2. Fitting the degree distribution

Implement and apply the procedures of Section 4.13 in the Barabasi book to estimate the degree exponent of this network. Include a plot equivalent to Image 4.24b of the Barabasi book.

# a function to estimate gamma based on Equation (4.41) in Barabasi
estimate.gamma <- function(ksat) {
    k1 <- k[k >= ksat]  # k is the vector for the degree of each node
    n1 <- length(k1)
    1 + n1/(sum(log(k1)) - n1 * log(ksat - 0.5))
}

# a function giving CDF of PowerLawDist(gamma,ksat) at the unique k(>=ksat) values in the data, where
# gamma is estimated from the given ksat
pld.cdf <- function(ksat) {
    gamma <- estimate.gamma(ksat)
    ksg <- ks^(-gamma)  # ks is the vector for all unique k values in the network
    ksg <- ksg[ks >= ksat]
    # cumsum(ksg) / sum(ksg) # this CDF=1 at the largest k, we do not use this
    tmp <- rev(cumsum(rev(ksg)))
    1 - tmp/tmp[1]  # this CDF=0 at the first k>=ksat, we use this according to Equation (4.43) in Barabasi
}

# a function giving CDF of the data (real k) adjusted for ksat, at the unique k(>=ksat) values in the
# data
kd.cdf <- function(ksat) {
    pk1 <- pk[ks >= ksat]  # pk is the probablitity values for the k's
    cumsum(pk1)/sum(pk1)
    # tmp <- rev(cumsum(rev(pk1))) 1 - tmp / tmp[1]
}

# a function to compute the maximum distance D between the CDF of the data and the fitted model with
# (gamma,ksat)
get.D <- function(ksat) {
    max(abs(kd.cdf(ksat) - pld.cdf(ksat)))
}

# estimate D for ksat in 1:100 ksats <- seq(range(ks)[2]) # ks is the vector for all unique k values
# in the network
ksats <- 1:100
Ds <- sapply(ksats, get.D)
# min D
Dmin <- min(Ds, na.rm = TRUE)
# argmin_ksat D
ksat.opt <- ksats[which.min(Ds)]
# estimated gamma at ksat.opt
gamma.opt <- estimate.gamma(ksat.opt)

A plot of the maximum distance \(D\) between the CDF of the actual degree distribution and that of the fitted power law distribution with parameter \(\gamma\) (estimated from selected \(K_{min}\)) versus \(K_{min}\text{'s}\). Based on the log-log degree distribution plot shown above, we computed for \(K_{min}\) from 1 to 100:

qplot(ksats, Ds, xlab = "K_min", ylab = "D") + theme_classic()

From the plot, we get select \(K_{min}^{opt}=\underset{K_{min}}{\mathrm{argmin}}D = 21\), and the corresponding estimated \(\gamma=2.4325935\).

Extra challenge: Implement the goodness of fit method using the synthetic D statistic approach. How well does the power law with your estimated degree exponent fit the observed degree distribution of the genetic interaction network model?

Simulate \(B=10^4\) instances of node degrees from the power law distribution estimated from the selected optimal \(K_{min}^{opt}\), and plot the histogram of \(D_{synthetic}\):

# get the probability distribution and the CDF of the best-fit power law distribution
ks.opt <- ks[ks >= ksat.opt]  # ks is the vector for all unique k values in the network
tmp <- ks.opt^(-gamma.opt)
pk.opt <- tmp/sum(tmp)  # probability distribution

# a function giving CDF of PowerLawDist(gamma.opt,ksat.opt) at the the given unique
# ks(already>=ksat.opt) values, used for the simulated data
pld.cdf2 <- function(ks) {
    ksg <- ks^(-gamma.opt)
    tmp <- rev(cumsum(rev(ksg)))
    1 - tmp/tmp[1]
}

# simulate the degrees of nv nodes, nv is the number of nodes in the data; do the simulation 1e4
# times; for each simulation get the D_synthetic
Dsyns <- sapply(1:10000, function(i) {
    set.seed(i)
    k.sim <- sample(ks.opt, nv, replace = TRUE, prob = pk.opt)  # simulated degrees
    pk.sim <- table(k.sim)/nv  # probability distribution of the simulated degrees
    k.sim.cdf <- cumsum(pk.sim)  # CDF for simulated data
    # k.sim.cdf <- 1 - rev(cumsum(rev(pk.sim)))
    ks.sim <- as.integer(names(pk.sim))  # all unique k values present in simulated data, in increasing order
    k.opt.cdf <- pld.cdf2(ks.sim)  # CDF for best-fit model
    max(abs(k.sim.cdf - k.opt.cdf))  # D_synthetic
})

# histogram for Dsyns
qplot(Dsyns, xlab = "D_synthetic", bins = 50) + geom_vline(xintercept = Dmin, linetype = "dashed") + 
    geom_text(aes(x = Dmin - 0.003, y = 750, label = "D_real")) + theme_classic()

The value of \(D_{real}\) is noted by the dashed verticle line. We can compute a p-number based on Equation (4.46) in Barabasi, which is \(p=0.0017\), thus a pure power law is not a suitable model for the degree distribution of the data.

Extra-extra challenge: Extend your estimation procedure to include parameters k_{sat} and k_{cut}. What is the estimated degree exponent? Do you observe better goodness of fit?

(I chose to skip this one.)

3. Network evolution model

Choose what you think is the most appropriate network evolution model for the genetic interaction network from the taxonomy of network evolution models (Image 6.15). Explain why you chose it. If you can formulate an alternative model not in this taxonomy, please explain it and explain why it is a better model.

Here considering the fact that this is a network for positive and negative genetic interactions in yeast, the various extensions of the BarabÃ¡si-Albert (BA) model presented in Image 6.15 of the Barabasi book may mot be the most natural choices. Many of the biologically relevant mechanisms underlying the actual evolution of genetic interaction networks are not directly accounted for by preferential attachment (PA), which is the basis of the BA-extending models.

Instead, models based on vertex copying can be more appropriate for the network here. Vertex copying is based on the idea that a new node will “copy” (with some specific probability) the edges of some random existent node, which is a close modelling of the gene duplication events. Such models have been shown to be a reasonable model for protein-protein interaction (PPI) networks (explained in Section 14.5 of the Newman book). Genetic interaction networks share apparently similar nature with PPI networks, and we propose that vertex copying-based models may also be suitable for modeling genetic interaction networks.

Here we choose to use exactly the same model as described in Peterson GJ et al. 2012. The network generation process and the biological interpretations are given below:

Beginning from a current network, at each time step, either one node is added, or some internal links are added. To add a node:

1. Duplicate a randomly-chosen node with probability \(d\).
1. Choose either the original (50%) or duplicated (50%) node, and delete each of its links with probability \(\phi\).

The adding (duplication) of a node is parallel to gene duplication of a gene in evolution, where the new gene is similar to the original one, thus sharing some of the genetic interactions of the latter.

For adding internal links:

1. Choose a random node with probability \(\mu\).
1. Link the node to another random node (a target node).
1. Add a second link to one of the targetâs first-neighbors, chosen randomly, with probability \(a^2\).
1. Add a link to one of the targetâs second-neighbors, with probability \(a^3\), etc.

The adding of internal links this way is parallel to gene mutation, which results in novel genetic interactions to some other genes and potentially their interaction partners.

Plot the predicted degree distribution for your model of choice (you may need to do this by simulating network evolution under your chosen model). How closely can you capture the observed degree distribution of the genetic interaction network? Please answer quantitatively and qualitatively.

We are using the matlab code (https://github.com/tinybike/DUNE) provided by Peterson GJ et al. 2012 to simulate networks under the above model with the parameters proposed in Peterson GJ et al. 2012 for the yeast PPI network. For each simulation, we begin from two connected nodes, and run for time \(t\) such that the number of edges just reaches that of the yeast genetic interaction network. We performed 10 independent simulations, then plot all 10 log-log degree distributions together on the same plot. And only for the purpose of seeing the overall shape of the distribution, a LOESS curve is added.

# install.packages('R.matlab')
library(R.matlab)
sim.res <- readMat("C:/Users/Cyrus/Downloads/res.mat")
sim.pk <- lapply(sim.res$res, function(k) {
    k <- unlist(k)
    pk <- table(k)/sum(k)
    ks <- as.integer(names(pk))
    pk <- as.vector(pk)
    data.frame(k = ks, pk = pk)
})
sim.pk <- do.call(rbind, sim.pk)
qplot(k, pk, data = log10(sim.pk), xlab = "log k", ylab = "log p(k)") + geom_smooth(method = "loess") + 
    theme_classic()

Qualitatively, the distributions of the simulated network and our real network have similar shapes (compared on log-log scale). The proposed model is able to produce the low degree saturation (less frequent low-degree nodes compared to the BA model) as well as the high degree cut-off (less frequent high-degree nodes compared to the BA model, i.e. the tail falls more quickly than linear). Thus the proposed model seems more capable of producing the actual characteristics in degree distribution than the BA model. However, the actual data show an even “lower-than-saturation” probability values for very low degrees (\(k=1\text{-}5\)) that is not captured by the proposed model.

A quantitative analysis may be approached by optimizing the various parameters of network generation to fit the model to the actual data, then simulate to get the degree distribution of the best-fit model, compute a Kormogorov-Smirnov-based distance \(D_{new model}\) between the data and the best-fit model similarly as in Question 1, and then compare \(D_{new model}\) with \(D_{BA model}\) (i.e. the \(D_{real}\) as obtained in Question 1) through some statistical test procedure. If \(D_{new model}\) < \(D_{BA model}\) in a significant manner, then the proposed model can explain the data better than the BA model. But we don’t implement this quantitative analysis here due to the complexity in optimizing the parameters.

Reference

Peterson GJ, PressÃ© S, Peterson KS, Dill KA (2012) Simulated Evolution of Protein-Protein Interaction Networks with Realistic Topology. PLoS ONE 7(6): e39052. https://doi.org/10.1371/journal.pone.0039052

CMSC 828O Homework 02: Network Models

Sanju Sinha, Kuoyuan Cheng, Kunal Kundu, Javier Ponce

September 25, 2017