suppressMessages(library(igraph))
suppressMessages(library(ggplot2))
suppressMessages(library(dplyr))
suppressMessages(library(RColorBrewer))
suppressMessages(library(huge))
suppressMessages(library(xlsx))
set.seed(42)
fl.sources <- list.files("../../scripts/utils/", full.names = T)
tmp <- sapply(fl.sources,source)
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
Since each context (TRM or TexTerm) has several networks in different replicate samples, we first calculated the mean edge weight of TF-regulatee pair across samples. We kept regulatees with moderate variation across TFs (sd>1).
edge_combined.csv in each network/
foldermean_edge_weight.csv: rows are regulatees
(regulated downstream genes) and columns are TFs We’ll skip this part
for demo purpose. The output file is provided.# sample <- 'TRM' # 'TRM' or 'TexTerm'
# tag <- '_subset_TFs_v2'
#
# L <- readLines(paste0(sample,'_samples.txt'))
# TFs <- readLines(paste0(sample,'_TFs_for_network_v3.txt'))
# TFs <- lapply(TFs, toupper)
#
#
# edge_weight <- do.call("rbind", lapply(L, function(x) read.table(paste0(x,'/edges_combined.csv'), sep = ",", header = T) %>% select(c(1,2,3)) %>% setNames(c("TF","regulatee","weight")))) %>% dplyr::filter(TF %in% TFs) %>%
# group_by(TF, regulatee) %>% dplyr::summarize(weight = mean(weight, na.rm=TRUE)) %>%
# reshape2::dcast(regulatee ~ TF, value.var = "weight", fill=0) %>%
# tibble::column_to_rownames(var = "regulatee")
#
# df <- edge_weight %>% rowwise() %>% mutate(sd=sd(c_across(everything()))) %>% ungroup() %>%
# filter(sd>1) %>% select(!matches("MOUSE$|[0-9\\.]{6}$")) %>% select(-sd) %>%
# rename_with(firstup)
# write.csv(df,file = paste0("mean_",sample,"_edge_weight",tag,".csv")) # write to file
sample <- 'TexTerm' ### 'TRM' or 'TexTerm'
tag <- '_subset_TFs_v2'
df <- read.csv(paste0("mean_",sample,"_edge_weight",tag,".csv"), row.names = 1)
colnames(df) <- firstup(colnames(df))
knitr::kable(df[10000:10006,1:6], caption = 'TF-gene edge weight') |> kableExtra::kable_styling(latex_options = 'scale_down')
| Ahr | Arid3a | Arnt | Arntl | Atf1 | Atf2 | |
|---|---|---|---|---|---|---|
| MICU1 | 0 | 0.000000 | 7.977929 | 10.103853 | 0.000000 | 0.00000 |
| MICU2 | 0 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.00000 |
| MICU3 | 0 | 2.876754 | 0.000000 | 0.000000 | 0.000000 | 0.00000 |
| MID1 | 0 | 3.793183 | 0.000000 | 0.000000 | 0.000000 | 0.00000 |
| MID1IP1 | 0 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.00000 |
| MIDN | 0 | 2.554116 | 5.850666 | 2.887889 | 5.682970 | 14.71426 |
| MIEF1 | 0 | 0.000000 | 0.000000 | 0.000000 | 4.736144 | 0.00000 |
print(dim(df))
## [1] 17270 170
We used R package huge to construct Gaussian graphical
models (GGM) with copula modification. A little bit background to help
with selction of hyper-parameters.
GGM assumes that the observations have a multivariate Gaussian distribution with mean \(µ\), and covariance matrix \(Σ\). The conditional independence can be implied by the inverse covariance (concentration) matrix \(Ω = Σ^{−1}\). If \(Ω_{jk} = 0\), then the i-th variable and j-th variables are conditional independent given all other variables.
This important property serves as the foundation for GGM to infer direct interactions from high-dimensional data. Unlike relevance networks or correlation networks, in which edges are determined based on marginal correlations, GGM provides a stronger criterion of dependency, and thus further reduces the false positive rate.
However, a great limitation of classic learning methods for GGM is the lack of sparsity in the resulting graph. A dense graph not only complicates downstream analysis but also raises the issue of overfitting the data. Thus it makes sense to impose an \(ℓ\) penalty for the estimation of \(Ω\), to increase its sparsity, and the sparse pattern of \(Ω\) is essentially the same as the adjacency matrix of the underlying undirected graph.
Additionally, Liu et al. developed a data transformation method called Copula that can be used with the graphical Lasso algorithm to relax the normality assumption of GGM. The intuition of Copula is to infer the the relationship of the variables without being influenced by their individual distributions
For more details about package huge, check the vignette.
For instructions on selecting the hyper-parameters like
npn.func, check the manual
npn.func <- 'shrinkage'
df.npn <- huge.npn(df, npn.func = npn.func) # default
## Conducting the nonparanormal (npn) transformation via shrunkun ECDF....done.
out.npn <- huge(df.npn, method = "glasso", nlambda = 30, lambda.min.ratio = 0.01) # fit glasso model to data
##
Conducting the graphical lasso (glasso) wtih lossless screening....in progress: 3%
Conducting the graphical lasso (glasso) wtih lossless screening....in progress: 6%
Conducting the graphical lasso (glasso) wtih lossless screening....in progress: 9%
Conducting the graphical lasso (glasso) wtih lossless screening....in progress: 13%
Conducting the graphical lasso (glasso) wtih lossless screening....in progress: 16%
Conducting the graphical lasso (glasso) wtih lossless screening....in progress: 19%
Conducting the graphical lasso (glasso) wtih lossless screening....in progress: 23%
Conducting the graphical lasso (glasso) wtih lossless screening....in progress: 26%
Conducting the graphical lasso (glasso) wtih lossless screening....in progress: 30%
Conducting the graphical lasso (glasso) wtih lossless screening....in progress: 33%
Conducting the graphical lasso (glasso) wtih lossless screening....in progress: 36%
Conducting the graphical lasso (glasso) wtih lossless screening....in progress: 40%
Conducting the graphical lasso (glasso) wtih lossless screening....in progress: 43%
Conducting the graphical lasso (glasso) wtih lossless screening....in progress: 46%
Conducting the graphical lasso (glasso) wtih lossless screening....in progress: 50%
Conducting the graphical lasso (glasso) wtih lossless screening....in progress: 53%
Conducting the graphical lasso (glasso) wtih lossless screening....in progress: 56%
Conducting the graphical lasso (glasso) wtih lossless screening....in progress: 60%
Conducting the graphical lasso (glasso) wtih lossless screening....in progress: 63%
Conducting the graphical lasso (glasso) wtih lossless screening....in progress: 66%
Conducting the graphical lasso (glasso) wtih lossless screening....in progress: 70%
Conducting the graphical lasso (glasso) wtih lossless screening....in progress: 73%
Conducting the graphical lasso (glasso) wtih lossless screening....in progress: 76%
Conducting the graphical lasso (glasso) wtih lossless screening....in progress: 80%
Conducting the graphical lasso (glasso) wtih lossless screening....in progress: 83%
Conducting the graphical lasso (glasso) wtih lossless screening....in progress: 86%
Conducting the graphical lasso (glasso) wtih lossless screening....in progress: 90%
Conducting the graphical lasso (glasso) wtih lossless screening....in progress: 93%
Conducting the graphical lasso (glasso) wtih lossless screening....in progress: 96%
## Conducting the graphical lasso (glasso)....done.
saveRDS(out.npn,paste0("result_copula_",sample,'_',npn.func,tag,".rds"))
The program automatically sets up a sequence of 30 regularization parameters and estimates the corresponding graph path.
# default visualization provided by huge package
plot(out.npn)
From the above plot, we can see the local minimum of sparsity is 0.15. We can infer the precision matrix when sparsity=0.15
sparsity <- 0.15
idx = which(out.npn$sparsity>=sparsity & out.npn$sparsity<=sparsity+0.01)[1]
print(paste0("lambda: ", out.npn$lambda[idx]))
## [1] "lambda: 0.0903400786315669"
print(paste0("sparsity: ", out.npn$sparsity[idx]))
## [1] "sparsity: 0.151653324051514"
# 1. Get the Precision Matrix (Inverse Covariance)
theta <- as.matrix(out.npn$icov[[idx]])
# 2. Convert Precision Matrix to Partial Correlation Matrix
# Formula: rho_ij = - theta_ij / sqrt(theta_ii * theta_jj)
diag_theta <- diag(theta)
m <- -theta / sqrt(outer(diag_theta, diag_theta, "*"))
diag(m) <- 1 # The diagonal of a correlation matrix is always 1
# 3. Apply names
rownames(m) <- colnames(df)
colnames(m) <- colnames(df)
knitr::kable(head(m[,1:6]), caption = 'sparse TF-TF correlation matrix') |> kableExtra::kable_styling(latex_options = 'scale_down')
| Ahr | Arid3a | Arnt | Arntl | Atf1 | Atf2 | |
|---|---|---|---|---|---|---|
| Ahr | 1.0000000 | 0 | 0.0292973 | 0.0000000 | 0 | 0 |
| Arid3a | 0.0000000 | 1 | 0.0000000 | 0.0000000 | 0 | 0 |
| Arnt | 0.0292973 | 0 | 1.0000000 | 0.2880293 | 0 | 0 |
| Arntl | 0.0000000 | 0 | 0.2880254 | 1.0000000 | 0 | 0 |
| Atf1 | 0.0000000 | 0 | 0.0000000 | 0.0000000 | 1 | 0 |
| Atf2 | 0.0000000 | 0 | 0.0000000 | 0.0000000 | 0 | 1 |
## write to file
write.csv(m, paste0("partial_correlation_copula_",sample,"_",npn.func,"_sparsity_",sparsity,tag,".csv"))
df <- data.frame(row=rownames(m)[row(m)[upper.tri(m)]],
col=colnames(m)[col(m)[upper.tri(m)]],
corr=m[upper.tri(m)])
### remove zero entries
df <- df[df$corr!=0,]
knitr::kable(head(df), caption = 'long form of TF correlation') |> kableExtra::kable_styling(latex_options = 'scale_down')
| row | col | corr | |
|---|---|---|---|
| 2 | Ahr | Arnt | 0.0292973 |
| 6 | Arnt | Arntl | 0.2880293 |
| 20 | Atf1 | Atf7 | 0.0258843 |
| 21 | Atf2 | Atf7 | 0.5207381 |
| 23 | Arid3a | Batf | 0.0252175 |
| 30 | Arid3a | Bbx | 0.0830651 |
print(dim(df))
## [1] 2179 3
print(paste('percent of correlation score >','0',':',sum(df$corr > 0) *100 / nrow(df),"%"))
## [1] "percent of correlation score > 0 : 99.7705369435521 %"
print(paste('percent of correlation score <','0',':',sum(df$corr < 0) *100 / nrow(df),"%"))
## [1] "percent of correlation score < 0 : 0.229463056447912 %"
df$corr = abs(df$corr)
write.csv(df, paste0("correlation_intensity_table_copula_",sample,"_",npn.func,"_sparsity_",sparsity,tag,".csv"))
Node annotation came from the results of Fig. S3a.
TRM_Tex: multi-tasker in both TRM and TexTermTex_specific/TRM_specific: single-tasker
in TexTerm/TRMTex_important/TRM_important: important in
TexTerm/TRMHousekeeping: 54 universal TFs with small variation
across cell typesnodes <- read.xlsx('TRM_Tex_TF_list_20240121.xlsx', sheetName = sample) %>% tidyr::drop_na(TF) %>% dplyr::mutate(group=ifelse(is.na(Specificity), Important, Specificity)) %>% select(TF, group)
names(nodes) <- c('name','group')
knitr::kable(head(nodes), caption = 'node meta') |> kableExtra::kable_styling(latex_options = 'scale_down')
| name | group |
|---|---|
| Bhlhe40 | TRM_Tex |
| Nr4a2 | TRM_Tex |
| Snai3 | TRM_Tex |
| Xbp1 | TRM_Tex |
| Egr1 | TRM_Tex |
| Nfil3 | TRM_Tex |
print(dim(nodes))
## [1] 170 2
g = graph_from_data_frame(df, directed=FALSE, vertices = nodes)
print(g, e=TRUE, v=TRUE)
## IGRAPH 68035fe UN-- 170 2179 --
## + attr: name (v/c), group (v/c), corr (e/n)
## + edges from 68035fe (vertex names):
## [1] Ahr --Arnt Arnt --Arntl Atf1 --Atf7 Atf2 --Atf7
## [5] Arid3a --Batf Arid3a --Bbx Atf2 --Bbx Bcl6 --Bbx
## [9] Bhlhe40--Arnt Bhlhe40--Arntl Bhlhe41--Arntl Bhlhe40--Bhlhe41
## [13] Cbfb --Bbx Cebpb --Bcl6 Creb1 --Atf1 Creb1 --Atf2
## [17] Creb1 --Atf7 Crem --Atf1 Crem --Atf2 Crem --Atf7
## [21] Crem --Creb1 Arid3a --Ctcf Ctcf --Arnt Ctcf --Bbx
## [25] Ctcf --Bcl6 Ctcf --Cbfb Cebpb --Ctcf Ddit3 --Arid3a
## [29] Ddit3 --Atf2 Ddit3 --Bbx Ddit3 --Bcl6 Ddit3 --Cbfb
## + ... omitted several edges
Leiden clustering generates communities. Adjust the res
to get a reasonable number of groups and high modularity score. High
res gives more groups. Through experimenting, res=0.9 suits
the best.
set.seed(42)
res <- 1.1
clustering = cluster_leiden(g, objective_function="modularity",resolution_parameter=res)
print(clustering)
## IGRAPH clustering leiden, groups: 8, mod: NA
## + groups:
## $`1`
## [1] "Bhlhe40" "Xbp1" "Heyl" "Bhlhe41" "Tfeb" "Usf1" "Mlx"
## [8] "Srebf1" "Arntl" "Tfe3"
##
## $`2`
## [1] "Nr4a2" "Nfil3" "Gfi1" "Zbtb3" "Ddit3" "Jun" "Irf4"
## [8] "Hmg20b" "Meox2" "Fos" "Isl2" "Prdm1" "Nfe2l1" "Dmrt2"
## [15] "Zfp768" "Smad4" "Eomes" "Zbtb20" "Irf8" "Arid3a" "Foxj2"
## [22] "Batf" "Nfatc1" "Jund" "Cebpb" "Vax2" "Foxd2" "Zscan20"
## [29] "Stat6" "Etv6" "Tcf3" "Irf3" "Stat4" "Stat3" "Gata3"
## + ... omitted several groups/vertices
print(sizes(clustering))
## Community sizes
## 1 2 3 4 5 6 7 8
## 10 66 25 51 15 1 1 1
print(paste0('modularity:',modularity(g, membership(clustering))))
## [1] "modularity:0.350524037176596"
saveRDS(clustering,paste0('clustering_copula_',sample,'_',npn.func,'_res',res,'_s',sparsity,tag,'.rds'))
## write clustering result
clust <- data.frame(name = names(membership(clustering)),cluster = as.numeric(membership(clustering))) %>%
dplyr::inner_join(nodes, by = "name") %>%
dplyr::arrange(group, cluster)
knitr::kable(head(clust), caption = 'community clustering') |> kableExtra::kable_styling(latex_options = 'scale_down')
| name | cluster | group |
|---|---|---|
| Stat6 | 2 | Housekeeping |
| Etv6 | 2 | Housekeeping |
| Tcf3 | 2 | Housekeeping |
| Irf3 | 2 | Housekeeping |
| Stat4 | 2 | Housekeeping |
| Stat3 | 2 | Housekeeping |
write.csv(clust,paste0('clustering_membership_copula_',sample,'_',npn.func,'_res',res,'_s',sparsity,tag,'.csv'), row.names=F)
sessionInfo()
## R version 4.3.1 (2023-06-16)
## Platform: aarch64-apple-darwin20 (64-bit)
## Running under: macOS 15.5
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.11.0
##
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
##
## time zone: America/Los_Angeles
## tzcode source: internal
##
## attached base packages:
## [1] grid stats graphics grDevices utils datasets methods
## [8] base
##
## other attached packages:
## [1] factoextra_1.0.7 ggpubr_0.6.0 xlsx_0.6.5 huge_1.3.5
## [5] RColorBrewer_1.1-3 dplyr_1.1.2 ggplot2_3.4.2 igraph_1.5.0
##
## loaded via a namespace (and not attached):
## [1] xlsxjars_0.6.1 sass_0.4.6 utf8_1.2.3 generics_0.1.3
## [5] tidyr_1.3.0 xml2_1.3.4 rstatix_0.7.2 stringi_1.7.12
## [9] lattice_0.21-8 digest_0.6.31 magrittr_2.0.3 evaluate_0.21
## [13] fastmap_1.1.1 jsonlite_1.8.7 Matrix_1.6-5 ggrepel_0.9.3
## [17] backports_1.4.1 httr_1.4.6 rvest_1.0.3 purrr_1.0.1
## [21] fansi_1.0.4 viridisLite_0.4.2 scales_1.2.1 jquerylib_0.1.4
## [25] abind_1.4-5 cli_3.6.1 rlang_1.1.1 munsell_0.5.0
## [29] withr_2.5.0 cachem_1.0.8 yaml_2.3.7 tools_4.3.1
## [33] ggsignif_0.6.4 colorspace_2.1-0 webshot_0.5.5 kableExtra_1.3.4
## [37] broom_1.0.5 vctrs_0.6.3 R6_2.5.1 lifecycle_1.0.3
## [41] stringr_1.5.0 car_3.1-2 MASS_7.3-60 pkgconfig_2.0.3
## [45] rJava_1.0-6 pillar_1.9.0 bslib_0.5.0 gtable_0.3.3
## [49] glue_1.6.2 Rcpp_1.0.10 systemfonts_1.0.4 highr_0.10
## [53] xfun_0.39 tibble_3.2.1 tidyselect_1.2.0 rstudioapi_0.14
## [57] knitr_1.43 htmltools_0.5.5 svglite_2.1.1 carData_3.0-5
## [61] rmarkdown_2.23 compiler_4.3.1