Load packages.
library(SpiecEasi)
library(Matrix)
library(reshape2)
library(igraph)
# library(parallel)
devtools::install_github("berryni/mDINGO")
* installing *source* package ‘mDINGO’ ...
** R
** data
** preparing package for lazy loading
** help
*** installing help indices
** building package indices
** testing if installed package can be loaded
* DONE (mDINGO)
library(mDINGO)
availcores = 1#parallel::detectCores() - 1
Load data from american gut project. This is a very small subset of the dataset to make this run quickly for testing purposes.
data(amgut1.filt)
depths <- rowSums(amgut1.filt)
amgut1.filt.n <- t(apply(amgut1.filt, 1, norm_to_total))
amgut1 <- round(amgut1.filt.n * min(depths))
p = dim(amgut1)[2]
n = dim(amgut1)[1]
The dimensions of this data are (289, 127).
To show it has been filtered, a summary of the row sums
print(summary(rowSums(amgut1)))
Min. 1st Qu. Median Mean 3rd Qu. Max.
11989 12006 12009 12009 12012 12024
And the column sums:
print(summary(colSums(amgut1)))
Min. 1st Qu. Median Mean 3rd Qu. Max.
538 2670 7098 27327 16572 1148740
Now, we’ll generate a dataset using a form of the code from the SpiecEasi demo. Unlike that demo, we will generate two different datasets, one for each level of the covariate, and combine them to form a covariate dependent dataset.
Use function from mDINGO to create underlying graph structure. Then plot those graphs. Looking to use SpiecEasi to recreate Global Graph.
graphList = gen_graphs_sf(p, 1, .6, .05)
l_fr = layout_with_fr(graphList$disease)
par(mfrow = c(2,2), mar=c(.5,.5,1,.5))
plot(graphList$disease, vertex.size= 0, main = "Disease Graph", layout = l_fr, vertex.label = NA)
plot(graphList$change, vertex.size= 0, main = "Change Graph", layout = l_fr, vertex.label = NA)
plot(graphList$control, vertex.size= 0, main = "Control Graph", layout = l_fr, vertex.label = NA)
plot(graphList$global, vertex.size= 0, main = "Global Graph", layout = l_fr, vertex.label = NA)
Use built in SpiecEasi functions to create datasets from the underlying graphs using our imported dataset as example. The posThetaLims option sets the strenght (weight) of the connections. We’ll make them easy to detect for testing purposes.
prec_dis = igraph2prec(graphList$disease)
cor_dis = cov2cor(prec2cov(prec_dis))
prec_con = igraph2prec(graphList$control)
cor_con = cov2cor(prec2cov(prec_con))
Y_dis = synth_comm_from_counts(amgut1, mar=2, distr='zinegbin', Sigma=cor_dis, n=n)
Y_con = synth_comm_from_counts(amgut1, mar=2, distr='zinegbin', Sigma=cor_con, n=n)
Y = rbind(Y_dis, Y_con)
colnames(Y) = paste0("OTU", 1:dim(Y)[2])
Assign the levels of covariate to vector X
X = as.factor(c(rep(c(1,-1), each = dim(Y)[1]/2)))
We now have a dataset in which the first 289 subjects are in the disease group and the last 289 subjects are controls.
We now use SpiecEasi to perform global estimation. Purpose of this is to get the graph that the two groups have in common so we can remove that graph and later estimate the differences between the group graphs. This is the estimation of G from the dingo paper.
seRes = spiec.easi(Y, verbose = TRUE, sel.criterion = 'ebic')
Normalizing/clr transformation of data with pseudocount ...
Inverse Covariance Estimation with glasso ...
Model selection with ebic ...
Done!
We used the extended BIC model selection strategy, rather than the default ``Stars’’ method. This could be changed, but we have a clearer idea of what eBIC is doing, so we use it here for demonstrative purposes.
temp = rbind(round(seRes$lambda, 4), round(seRes$ebic.score, 0), round(seRes$sparsity, 2))
row.names(temp) = c("Lambda", "eBIC", "Sparsity")
knitr::kable(temp, col.names = as.character(1:dim(temp)[2]))
| 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | |
|---|---|---|---|---|---|---|---|---|---|---|
| Lambda | 0.4184 | 1.942e-01 | 0.0901 | 4.180e-02 | 0.0194 | 0.009 | 0.0042 | 0.0019 | 9.000e-04 | 0.0004 |
| eBIC | 73406.0000 | 7.456e+04 | 75829.0000 | 9.539e+04 | 132431.0000 | 161923.000 | 179351.0000 | 187441.0000 | 1.904e+05 | 191106.0000 |
| Sparsity | 0.0000 | 1.000e-02 | 0.0400 | 2.100e-01 | 0.5200 | 0.760 | 0.9000 | 0.9600 | 9.900e-01 | 1.0000 |
huge::huge.roc(seRes$path, as.matrix(graphList$global[]))
Computing F1 scores, false positive rates and true positive rates....done.
True Postive Rate: from 0 to 1
False Positive Rate: from 0 to 0.9988625
Area under Curve: 0.9885659
Maximum F1 Score: 0.9729656
The chosen lambda is the first column in this case. This gives us a Sparcity of 0, meaning that there are no connections in the global graph. Below we compare the global basis graph (truth) with the estimated global graph.
se_igraph = adj2igraph(as.matrix(seRes$refit))
par(mfrow = c(1,2), mar=c(1, 1, 1, 1) + 0.1)
plot(graphList$global, layout = l_fr, vertex.size= 0, vertex.label=NA, main = "Global Basis Graph")
plot(se_igraph, layout= l_fr, vertex.size= 0, vertex.label=NA, main = "Global Spiec-Easi Inferred Graph")
As stated, there are no connections in the estimated global graph (sparsity is 0).
Lets look at what happens if we take the tuning parameter 1 step smaller than the chosen one (column 2 in the table above).
se_igraph = adj2igraph(as.matrix(seRes$path[[2]]))
par(mfrow = c(1,2), mar=c(1, 1, 1, 1) + 0.1)
plot(graphList$global, layout = l_fr, vertex.size= 0, vertex.label=NA, main = "Global Basis Graph")
plot(se_igraph, layout= l_fr, vertex.size= 0, vertex.label=NA, main = "Global Spiec-Easi Inferred Graph")
There are way too many connections in this inferred graph. We need somewhere in the middle. The current steps in the process are too big.
Preliminarily, lets try to add more points to the grid search of SpiecEasi and see if that does better.
seFine = spiec.easi(Y, nlambda = 50, verbose = TRUE, sel.criterion = 'ebic')
Normalizing/clr transformation of data with pseudocount ...
Inverse Covariance Estimation with glasso ...
Model selection with ebic ...
Done!
temp = rbind(round(seFine$lambda, 4), round(seFine$ebic.score, 0), round(seFine$sparsity, 2))
row.names(temp) = c("Lambda", "eBIC", "Sparsity")
knitr::kable(temp, col.names = as.character(1:dim(temp)[2]))
| 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 | 10 | 11 | 12 | 13 | 14 | 15 | 16 | 17 | 18 | 19 | 20 | 21 | 22 | 23 | 24 | 25 | 26 | 27 | 28 | 29 | 30 | 31 | 32 | 33 | 34 | 35 | 36 | 37 | 38 | 39 | 40 | 41 | 42 | 43 | 44 | 45 | 46 | 47 | 48 | 49 | 50 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Lambda | 0.4184 | 0.3634 | 0.3156 | 0.2741 | 0.2381 | 0.2068 | 0.1796 | 0.156 | 0.1355 | 0.1176 | 0.1022 | 0.0887 | 0.0771 | 0.0669 | 0.0581 | 0.0505 | 0.0439 | 0.0381 | 0.0331 | 0.0287 | 0.025 | 0.0217 | 0.0188 | 0.0163 | 0.0142 | 0.0123 | 0.0107 | 0.0093 | 0.0081 | 0.007 | 0.0061 | 0.0053 | 0.0046 | 0.004 | 0.0035 | 0.003 | 0.0026 | 0.0023 | 0.002 | 0.0017 | 0.0015 | 0.0013 | 0.0011 | 0.001 | 0.0008 | 0.0007 | 0.0006 | 0.0006 | 0.0005 | 0.0004 |
| eBIC | 73406.0000 | 73524.0000 | 73698.0000 | 73833.0000 | 74123.0000 | 74354.0000 | 74442.0000 | 74327.000 | 74234.0000 | 74552.0000 | 75136.0000 | 76052.0000 | 77633.0000 | 79772.0000 | 83548.0000 | 87646.0000 | 93083.0000 | 99841.0000 | 106332.0000 | 112908.0000 | 120375.000 | 127128.0000 | 134044.0000 | 140655.0000 | 146785.0000 | 151521.0000 | 156606.0000 | 161093.0000 | 164915.0000 | 168821.000 | 172318.0000 | 174947.0000 | 177884.0000 | 180102.000 | 181991.0000 | 183414.000 | 184721.0000 | 186111.0000 | 187431.000 | 187958.0000 | 188775.0000 | 189278.0000 | 189556.0000 | 190057.000 | 190293.0000 | 190805.0000 | 190826.0000 | 191124.0000 | 190937.0000 | 191106.0000 |
| Sparsity | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 0.0100 | 0.0100 | 0.0100 | 0.020 | 0.0200 | 0.0300 | 0.0300 | 0.0400 | 0.0600 | 0.0800 | 0.1100 | 0.1500 | 0.1900 | 0.2500 | 0.3000 | 0.3600 | 0.420 | 0.4700 | 0.5300 | 0.5900 | 0.6400 | 0.6700 | 0.7200 | 0.7500 | 0.7800 | 0.810 | 0.8400 | 0.8600 | 0.8900 | 0.910 | 0.9200 | 0.930 | 0.9400 | 0.9500 | 0.960 | 0.9700 | 0.9800 | 0.9800 | 0.9800 | 0.990 | 0.9900 | 0.9900 | 1.0000 | 1.0000 | 1.0000 | 1.0000 |
huge::huge.roc(seFine$path, as.matrix(graphList$global[]))
Computing F1 scores, false positive rates and true positive rates....done.
True Postive Rate: from 0 to 1
False Positive Rate: from 0 to 0.9988625
Area under Curve: 0.9913756
Maximum F1 Score: 0.9812361
This is again the sparsity 0 model, so the inferred global graph has no edges. Looking at the ROC, we see potentially better models, so lets look at model 6, for examples sake.
seFine_igraph = adj2igraph(as.matrix(seFine$path[[6]]))
par(mfrow = c(1,2), mar=c(1, 1, 1, 1) + 0.1)
plot(graphList$global, layout = l_fr, vertex.size= 0, vertex.label=NA, main = "Global Basis Graph")
plot(seFine_igraph, layout= l_fr, vertex.size= 0, vertex.label=NA, main = "Global Spiec-Easi Inferred Graph")
Do model selection with Stars instead. Stars resamples and refits to get edge importance metrics, and chooses sparsity metric of best network. Should be related to most predictive graph.
seFine2 = spiec.easi(Y, method = 'glasso', lambda.min.ratio=.01, nlambda = 50)
Normalizing/clr transformation of data with pseudocount ...
Inverse Covariance Estimation with glasso ...
Model selection with stars ...
Done!
#temp = rbind(round(seFine2$lambda, 4), round(seFine2$ebic.score), round(seFine2$sparsity, 2))
#row.names(temp) = c("Lambda", "eBIC", "Sparsity")
#knitr::kable(temp, col.names = as.character(1:dim(temp)[2]))
huge::huge.roc(seFine2$path, as.matrix(graphList$global[]))
Computing F1 scores, false positive rates and true positive rates....done.
True Postive Rate: from 0 to 1
False Positive Rate: from 0 to 0.8979398
Area under Curve: 0.8905552
Maximum F1 Score: 0.9824202
Stars suggests choosing the 0.1634711 sparsity parameter, which is the 11 model in the ROC path.
Lets look at the graphs created and see if they do a better job at matching the global basis graph.
se_igraph = adj2igraph(as.matrix(seFine2$refit))
par(mfrow = c(1,2), mar=c(1, 1, 1, 1) + 0.1)
plot(graphList$global, layout = l_fr, vertex.size= 0, vertex.label=NA, main = "Global Basis Graph")
plot(se_igraph, layout= l_fr, vertex.size= 0, vertex.label=NA, main = "Global Spiec-Easi Inferred Graph")
INstead of choosing that parameter and worrying, we can use the default model selection tool, called Stars. Stars resamples and refits to get edge importance metrics, and chooses sparsity metric of best network. Should be related to most predictive graph.