A Note: For the below questions, I will ignore the isolated single nodes (i.e. nodes with degree 0).
No. From the results obtained in the last homework, the average degree \(\bar{k}\) = \(23.998573\), thus if the network was generated by the Erdös-Renyi model, the degree distribution can be approximated by \(\text{Poisson}(23.998573)\). When plotted together with the actual degree distribution as below, we see that they do not agree well to each other.
ks <- ks[-1]
pk <- pk[-1]
# a function to estimate gamma based on Equation (4.41) in Barabasi
estimate.gamma <- function(ksat) {
k1 <- k[k >= ksat] # k is the vector for the degree of each node
n1 <- length(k1)
1 + n1/(sum(log(k1)) - n1 * log(ksat - 0.5))
}
# a function giving CDF of PowerLawDist(gamma,ksat) at the unique k(>=ksat) values in the data, where
# gamma is estimated from the given ksat
pld.cdf <- function(ksat) {
gamma <- estimate.gamma(ksat)
ksg <- ks^(-gamma) # ks is the vector for all unique k values in the network
ksg <- ksg[ks >= ksat]
# cumsum(ksg) / sum(ksg) # this CDF=1 at the largest k, we do not use this
tmp <- rev(cumsum(rev(ksg)))
1 - tmp/tmp[1] # this CDF=0 at the first k>=ksat, we use this according to Equation (4.43) in Barabasi
}
# a function giving CDF of the data (real k) adjusted for ksat, at the unique k(>=ksat) values in the
# data
kd.cdf <- function(ksat) {
pk1 <- pk[ks >= ksat] # pk is the probablitity values for the k's
cumsum(pk1)/sum(pk1)
# tmp <- rev(cumsum(rev(pk1))) 1 - tmp / tmp[1]
}
# a function to compute the maximum distance D between the CDF of the data and the fitted model with
# (gamma,ksat)
get.D <- function(ksat) {
max(abs(kd.cdf(ksat) - pld.cdf(ksat)))
}
# estimate D for ksat in 1:100 ksats <- seq(range(ks)[2]) # ks is the vector for all unique k values
# in the network
ksats <- 1:100
Ds <- sapply(ksats, get.D)
# min D
Dmin <- min(Ds, na.rm = TRUE)
# argmin_ksat D
ksat.opt <- ksats[which.min(Ds)]
# estimated gamma at ksat.opt
gamma.opt <- estimate.gamma(ksat.opt)
A plot of the maximum distance \(D\) between the CDF of the actual degree distribution and that of the fitted power law distribution with parameter \(\gamma\) (estimated from selected \(K_{min}\)) versus \(K_{min}\text{'s}\). Based on the log-log degree distribution plot shown above, we computed for \(K_{min}\) from 1 to 100:
qplot(ksats, Ds, xlab = "K_min", ylab = "D") + theme_classic()
From the plot, we get select \(K_{min}^{opt}=\underset{K_{min}}{\mathrm{argmin}}D = 21\), and the corresponding estimated \(\gamma=2.4325935\).
Simulate \(B=10^4\) instances of node degrees from the power law distribution estimated from the selected optimal \(K_{min}^{opt}\), and plot the histogram of \(D_{synthetic}\):
# get the probability distribution and the CDF of the best-fit power law distribution
ks.opt <- ks[ks >= ksat.opt] # ks is the vector for all unique k values in the network
tmp <- ks.opt^(-gamma.opt)
pk.opt <- tmp/sum(tmp) # probability distribution
# a function giving CDF of PowerLawDist(gamma.opt,ksat.opt) at the the given unique
# ks(already>=ksat.opt) values, used for the simulated data
pld.cdf2 <- function(ks) {
ksg <- ks^(-gamma.opt)
tmp <- rev(cumsum(rev(ksg)))
1 - tmp/tmp[1]
}
# simulate the degrees of nv nodes, nv is the number of nodes in the data; do the simulation 1e4
# times; for each simulation get the D_synthetic
Dsyns <- sapply(1:10000, function(i) {
set.seed(i)
k.sim <- sample(ks.opt, nv, replace = TRUE, prob = pk.opt) # simulated degrees
pk.sim <- table(k.sim)/nv # probability distribution of the simulated degrees
k.sim.cdf <- cumsum(pk.sim) # CDF for simulated data
# k.sim.cdf <- 1 - rev(cumsum(rev(pk.sim)))
ks.sim <- as.integer(names(pk.sim)) # all unique k values present in simulated data, in increasing order
k.opt.cdf <- pld.cdf2(ks.sim) # CDF for best-fit model
max(abs(k.sim.cdf - k.opt.cdf)) # D_synthetic
})
# histogram for Dsyns
qplot(Dsyns, xlab = "D_synthetic", bins = 50) + geom_vline(xintercept = Dmin, linetype = "dashed") +
geom_text(aes(x = Dmin - 0.003, y = 750, label = "D_real")) + theme_classic()
The value of \(D_{real}\) is noted by the dashed verticle line. We can compute a p-number based on Equation (4.46) in Barabasi, which is \(p=0.0017\), thus a pure power law is not a suitable model for the degree distribution of the data.
(I chose to skip this one.)
Here considering the fact that this is a network for positive and negative genetic interactions in yeast, the various extensions of the Barabási-Albert (BA) model presented in Image 6.15 of the Barabasi book may mot be the most natural choices. Many of the biologically relevant mechanisms underlying the actual evolution of genetic interaction networks are not directly accounted for by preferential attachment (PA), which is the basis of the BA-extending models.
Instead, models based on vertex copying can be more appropriate for the network here. Vertex copying is based on the idea that a new node will “copy” (with some specific probability) the edges of some random existent node, which is a close modelling of the gene duplication events. Such models have been shown to be a reasonable model for protein-protein interaction (PPI) networks (explained in Section 14.5 of the Newman book). Genetic interaction networks share apparently similar nature with PPI networks, and we propose that vertex copying-based models may also be suitable for modeling genetic interaction networks.
Here we choose to use exactly the same model as described in Peterson GJ et al. 2012. The network generation process and the biological interpretations are given below:
Beginning from a current network, at each time step, either one node is added, or some internal links are added. To add a node:
The adding (duplication) of a node is parallel to gene duplication of a gene in evolution, where the new gene is similar to the original one, thus sharing some of the genetic interactions of the latter.
For adding internal links:
The adding of internal links this way is parallel to gene mutation, which results in novel genetic interactions to some other genes and potentially their interaction partners.
We are using the matlab code (https://github.com/tinybike/DUNE) provided by Peterson GJ et al. 2012 to simulate networks under the above model with the parameters proposed in Peterson GJ et al. 2012 for the yeast PPI network. For each simulation, we begin from two connected nodes, and run for time \(t\) such that the number of edges just reaches that of the yeast genetic interaction network. We performed 10 independent simulations, then plot all 10 log-log degree distributions together on the same plot. And only for the purpose of seeing the overall shape of the distribution, a LOESS curve is added.
# install.packages('R.matlab')
library(R.matlab)
sim.res <- readMat("C:/Users/Cyrus/Downloads/res.mat")
sim.pk <- lapply(sim.res$res, function(k) {
k <- unlist(k)
pk <- table(k)/sum(k)
ks <- as.integer(names(pk))
pk <- as.vector(pk)
data.frame(k = ks, pk = pk)
})
sim.pk <- do.call(rbind, sim.pk)
qplot(k, pk, data = log10(sim.pk), xlab = "log k", ylab = "log p(k)") + geom_smooth(method = "loess") +
theme_classic()
Qualitatively, the distributions of the simulated network and our real network have similar shapes (compared on log-log scale). The proposed model is able to produce the low degree saturation (less frequent low-degree nodes compared to the BA model) as well as the high degree cut-off (less frequent high-degree nodes compared to the BA model, i.e. the tail falls more quickly than linear). Thus the proposed model seems more capable of producing the actual characteristics in degree distribution than the BA model. However, the actual data show an even “lower-than-saturation” probability values for very low degrees (\(k=1\text{-}5\)) that is not captured by the proposed model.
A quantitative analysis may be approached by optimizing the various parameters of network generation to fit the model to the actual data, then simulate to get the degree distribution of the best-fit model, compute a Kormogorov-Smirnov-based distance \(D_{new model}\) between the data and the best-fit model similarly as in Question 1, and then compare \(D_{new model}\) with \(D_{BA model}\) (i.e. the \(D_{real}\) as obtained in Question 1) through some statistical test procedure. If \(D_{new model}\) < \(D_{BA model}\) in a significant manner, then the proposed model can explain the data better than the BA model. But we don’t implement this quantitative analysis here due to the complexity in optimizing the parameters.
Peterson GJ, Pressé S, Peterson KS, Dill KA (2012) Simulated Evolution of Protein-Protein Interaction Networks with Realistic Topology. PLoS ONE 7(6): e39052. https://doi.org/10.1371/journal.pone.0039052