1 import packages and functions

suppressMessages(library(igraph))
suppressMessages(library(ggplot2))
suppressMessages(library(dplyr))
suppressMessages(library(RColorBrewer))
suppressMessages(library(huge))
suppressMessages(library(xlsx))

set.seed(42)
fl.sources <- list.files("../../scripts/utils/", full.names = T)
tmp <- sapply(fl.sources,source)
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa

2 prepare TF-gene edge weight input

Since each context (TRM or TexTerm) has several networks in different replicate samples, we first calculated the mean edge weight of TF-regulatee pair across samples. We kept regulatees with moderate variation across TFs (sd>1).

  • input: edge_combined.csv in each network/ folder
  • optional argument: pre-defined TF list, here, we only focus on TFs important in each context
  • output file: mean_edge_weight.csv: rows are regulatees (regulated downstream genes) and columns are TFs We’ll skip this part for demo purpose. The output file is provided.
# sample <- 'TRM' # 'TRM' or 'TexTerm'
# tag <- '_subset_TFs_v2'
# 
# L <- readLines(paste0(sample,'_samples.txt'))
# TFs <- readLines(paste0(sample,'_TFs_for_network_v3.txt')) 
# TFs <- lapply(TFs, toupper)
#  
# 
# edge_weight <- do.call("rbind", lapply(L, function(x) read.table(paste0(x,'/edges_combined.csv'), sep = ",", header = T) %>% select(c(1,2,3)) %>% setNames(c("TF","regulatee","weight")))) %>% dplyr::filter(TF %in% TFs) %>% 
#                group_by(TF, regulatee) %>% dplyr::summarize(weight = mean(weight, na.rm=TRUE)) %>% 
#                reshape2::dcast(regulatee ~ TF, value.var = "weight", fill=0) %>% 
#                tibble::column_to_rownames(var = "regulatee") 
#                                        
# df <- edge_weight %>% rowwise() %>% mutate(sd=sd(c_across(everything()))) %>% ungroup() %>% 
#                 filter(sd>1) %>% select(!matches("MOUSE$|[0-9\\.]{6}$")) %>% select(-sd) %>%
#                 rename_with(firstup)
# write.csv(df,file = paste0("mean_",sample,"_edge_weight",tag,".csv")) # write to file
sample <- 'TexTerm' ### 'TRM' or 'TexTerm'
tag <- '_subset_TFs_v2'
df <- read.csv(paste0("mean_",sample,"_edge_weight",tag,".csv"), row.names = 1)
colnames(df) <- firstup(colnames(df))
knitr::kable(df[10000:10006,1:6], caption = 'TF-gene edge weight') |> kableExtra::kable_styling(latex_options = 'scale_down')
TF-gene edge weight
Ahr Arid3a Arnt Arntl Atf1 Atf2
MICU1 0 0.000000 7.977929 10.103853 0.000000 0.00000
MICU2 0 0.000000 0.000000 0.000000 0.000000 0.00000
MICU3 0 2.876754 0.000000 0.000000 0.000000 0.00000
MID1 0 3.793183 0.000000 0.000000 0.000000 0.00000
MID1IP1 0 0.000000 0.000000 0.000000 0.000000 0.00000
MIDN 0 2.554116 5.850666 2.887889 5.682970 14.71426
MIEF1 0 0.000000 0.000000 0.000000 4.736144 0.00000
print(dim(df))
## [1] 17270   170

3 copula pipeline to infer graph

We used R package huge to construct Gaussian graphical models (GGM) with copula modification. A little bit background to help with selction of hyper-parameters.

GGM assumes that the observations have a multivariate Gaussian distribution with mean \(µ\), and covariance matrix \(Σ\). The conditional independence can be implied by the inverse covariance (concentration) matrix \(Ω = Σ^{−1}\). If \(Ω_{jk} = 0\), then the i-th variable and j-th variables are conditional independent given all other variables.

This important property serves as the foundation for GGM to infer direct interactions from high-dimensional data. Unlike relevance networks or correlation networks, in which edges are determined based on marginal correlations, GGM provides a stronger criterion of dependency, and thus further reduces the false positive rate.

However, a great limitation of classic learning methods for GGM is the lack of sparsity in the resulting graph. A dense graph not only complicates downstream analysis but also raises the issue of overfitting the data. Thus it makes sense to impose an \(ℓ\) penalty for the estimation of \(Ω\), to increase its sparsity, and the sparse pattern of \(Ω\) is essentially the same as the adjacency matrix of the underlying undirected graph.

Additionally, Liu et al. developed a data transformation method called Copula that can be used with the graphical Lasso algorithm to relax the normality assumption of GGM. The intuition of Copula is to infer the the relationship of the variables without being influenced by their individual distributions

For more details about package huge, check the vignette. For instructions on selecting the hyper-parameters like npn.func, check the manual

npn.func <- 'shrinkage'
df.npn <- huge.npn(df, npn.func = npn.func) # default                      
## Conducting the nonparanormal (npn) transformation via shrunkun ECDF....done.
out.npn <- huge(df.npn, method = "glasso", nlambda = 30, lambda.min.ratio = 0.01) # fit glasso model to data
## 
Conducting the graphical lasso (glasso) wtih lossless screening....in progress: 3%
Conducting the graphical lasso (glasso) wtih lossless screening....in progress: 6%
Conducting the graphical lasso (glasso) wtih lossless screening....in progress: 9%
Conducting the graphical lasso (glasso) wtih lossless screening....in progress: 13%
Conducting the graphical lasso (glasso) wtih lossless screening....in progress: 16%
Conducting the graphical lasso (glasso) wtih lossless screening....in progress: 19%
Conducting the graphical lasso (glasso) wtih lossless screening....in progress: 23%
Conducting the graphical lasso (glasso) wtih lossless screening....in progress: 26%
Conducting the graphical lasso (glasso) wtih lossless screening....in progress: 30%
Conducting the graphical lasso (glasso) wtih lossless screening....in progress: 33%
Conducting the graphical lasso (glasso) wtih lossless screening....in progress: 36%
Conducting the graphical lasso (glasso) wtih lossless screening....in progress: 40%
Conducting the graphical lasso (glasso) wtih lossless screening....in progress: 43%
Conducting the graphical lasso (glasso) wtih lossless screening....in progress: 46%
Conducting the graphical lasso (glasso) wtih lossless screening....in progress: 50%
Conducting the graphical lasso (glasso) wtih lossless screening....in progress: 53%
Conducting the graphical lasso (glasso) wtih lossless screening....in progress: 56%
Conducting the graphical lasso (glasso) wtih lossless screening....in progress: 60%
Conducting the graphical lasso (glasso) wtih lossless screening....in progress: 63%
Conducting the graphical lasso (glasso) wtih lossless screening....in progress: 66%
Conducting the graphical lasso (glasso) wtih lossless screening....in progress: 70%
Conducting the graphical lasso (glasso) wtih lossless screening....in progress: 73%
Conducting the graphical lasso (glasso) wtih lossless screening....in progress: 76%
Conducting the graphical lasso (glasso) wtih lossless screening....in progress: 80%
Conducting the graphical lasso (glasso) wtih lossless screening....in progress: 83%
Conducting the graphical lasso (glasso) wtih lossless screening....in progress: 86%
Conducting the graphical lasso (glasso) wtih lossless screening....in progress: 90%
Conducting the graphical lasso (glasso) wtih lossless screening....in progress: 93%
Conducting the graphical lasso (glasso) wtih lossless screening....in progress: 96%
## Conducting the graphical lasso (glasso)....done.                                          
saveRDS(out.npn,paste0("result_copula_",sample,'_',npn.func,tag,".rds"))

The program automatically sets up a sequence of 30 regularization parameters and estimates the corresponding graph path.

# default visualization provided by huge package
plot(out.npn)

3.1 get precision matrix

From the above plot, we can see the local minimum of sparsity is 0.15. We can infer the precision matrix when sparsity=0.15

sparsity <- 0.15
idx = which(out.npn$sparsity>=sparsity & out.npn$sparsity<=sparsity+0.01)[1]
print(paste0("lambda: ", out.npn$lambda[idx]))
## [1] "lambda: 0.0903400786315669"
print(paste0("sparsity: ", out.npn$sparsity[idx]))
## [1] "sparsity: 0.151653324051514"
# 1. Get the Precision Matrix (Inverse Covariance)
theta <- as.matrix(out.npn$icov[[idx]])

# 2. Convert Precision Matrix to Partial Correlation Matrix
# Formula: rho_ij = - theta_ij / sqrt(theta_ii * theta_jj)
diag_theta <- diag(theta)
m <- -theta / sqrt(outer(diag_theta, diag_theta, "*"))
diag(m) <- 1 # The diagonal of a correlation matrix is always 1

# 3. Apply names
rownames(m) <- colnames(df)
colnames(m) <- colnames(df)

knitr::kable(head(m[,1:6]), caption = 'sparse TF-TF correlation matrix') |> kableExtra::kable_styling(latex_options = 'scale_down')
sparse TF-TF correlation matrix
Ahr Arid3a Arnt Arntl Atf1 Atf2
Ahr 1.0000000 0 0.0292973 0.0000000 0 0
Arid3a 0.0000000 1 0.0000000 0.0000000 0 0
Arnt 0.0292973 0 1.0000000 0.2880293 0 0
Arntl 0.0000000 0 0.2880254 1.0000000 0 0
Atf1 0.0000000 0 0.0000000 0.0000000 1 0
Atf2 0.0000000 0 0.0000000 0.0000000 0 1
## write to file
write.csv(m, paste0("partial_correlation_copula_",sample,"_",npn.func,"_sparsity_",sparsity,tag,".csv"))

4 create igraph plot

4.1 create dataframe for igraph

df <- data.frame(row=rownames(m)[row(m)[upper.tri(m)]], 
                 col=colnames(m)[col(m)[upper.tri(m)]], 
                 corr=m[upper.tri(m)])
### remove zero entries
df <- df[df$corr!=0,]
knitr::kable(head(df), caption = 'long form of TF correlation') |> kableExtra::kable_styling(latex_options = 'scale_down')
long form of TF correlation
row col corr
2 Ahr Arnt 0.0292973
6 Arnt Arntl 0.2880293
20 Atf1 Atf7 0.0258843
21 Atf2 Atf7 0.5207381
23 Arid3a Batf 0.0252175
30 Arid3a Bbx 0.0830651
print(dim(df))
## [1] 2179    3
print(paste('percent of correlation score >','0',':',sum(df$corr > 0) *100 / nrow(df),"%"))
## [1] "percent of correlation score > 0 : 99.7705369435521 %"
print(paste('percent of correlation score <','0',':',sum(df$corr < 0) *100 / nrow(df),"%"))
## [1] "percent of correlation score < 0 : 0.229463056447912 %"
df$corr = abs(df$corr)
write.csv(df, paste0("correlation_intensity_table_copula_",sample,"_",npn.func,"_sparsity_",sparsity,tag,".csv"))

4.2 import node meta

Node annotation came from the results of Fig. S3a.

  • TRM_Tex: multi-tasker in both TRM and TexTerm
  • Tex_specific/TRM_specific: single-tasker in TexTerm/TRM
  • Tex_important/TRM_important: important in TexTerm/TRM
  • Housekeeping: 54 universal TFs with small variation across cell types
nodes <- read.xlsx('TRM_Tex_TF_list_20240121.xlsx', sheetName = sample) %>% tidyr::drop_na(TF) %>% dplyr::mutate(group=ifelse(is.na(Specificity), Important, Specificity)) %>% select(TF, group)
names(nodes) <- c('name','group')
knitr::kable(head(nodes), caption = 'node meta') |> kableExtra::kable_styling(latex_options = 'scale_down')
node meta
name group
Bhlhe40 TRM_Tex
Nr4a2 TRM_Tex
Snai3 TRM_Tex
Xbp1 TRM_Tex
Egr1 TRM_Tex
Nfil3 TRM_Tex
print(dim(nodes))   
## [1] 170   2

4.3 graph generation

g = graph_from_data_frame(df, directed=FALSE, vertices = nodes)
print(g, e=TRUE, v=TRUE)
## IGRAPH 68035fe UN-- 170 2179 -- 
## + attr: name (v/c), group (v/c), corr (e/n)
## + edges from 68035fe (vertex names):
##  [1] Ahr    --Arnt    Arnt   --Arntl   Atf1   --Atf7    Atf2   --Atf7   
##  [5] Arid3a --Batf    Arid3a --Bbx     Atf2   --Bbx     Bcl6   --Bbx    
##  [9] Bhlhe40--Arnt    Bhlhe40--Arntl   Bhlhe41--Arntl   Bhlhe40--Bhlhe41
## [13] Cbfb   --Bbx     Cebpb  --Bcl6    Creb1  --Atf1    Creb1  --Atf2   
## [17] Creb1  --Atf7    Crem   --Atf1    Crem   --Atf2    Crem   --Atf7   
## [21] Crem   --Creb1   Arid3a --Ctcf    Ctcf   --Arnt    Ctcf   --Bbx    
## [25] Ctcf   --Bcl6    Ctcf   --Cbfb    Cebpb  --Ctcf    Ddit3  --Arid3a 
## [29] Ddit3  --Atf2    Ddit3  --Bbx     Ddit3  --Bcl6    Ddit3  --Cbfb   
## + ... omitted several edges

4.4 Leiden clustering

Leiden clustering generates communities. Adjust the res to get a reasonable number of groups and high modularity score. High res gives more groups. Through experimenting, res=0.9 suits the best.

set.seed(42)
res <- 1.1
clustering = cluster_leiden(g, objective_function="modularity",resolution_parameter=res)
print(clustering)
## IGRAPH clustering leiden, groups: 8, mod: NA
## + groups:
##   $`1`
##    [1] "Bhlhe40" "Xbp1"    "Heyl"    "Bhlhe41" "Tfeb"    "Usf1"    "Mlx"    
##    [8] "Srebf1"  "Arntl"   "Tfe3"   
##   
##   $`2`
##    [1] "Nr4a2"   "Nfil3"   "Gfi1"    "Zbtb3"   "Ddit3"   "Jun"     "Irf4"   
##    [8] "Hmg20b"  "Meox2"   "Fos"     "Isl2"    "Prdm1"   "Nfe2l1"  "Dmrt2"  
##   [15] "Zfp768"  "Smad4"   "Eomes"   "Zbtb20"  "Irf8"    "Arid3a"  "Foxj2"  
##   [22] "Batf"    "Nfatc1"  "Jund"    "Cebpb"   "Vax2"    "Foxd2"   "Zscan20"
##   [29] "Stat6"   "Etv6"    "Tcf3"    "Irf3"    "Stat4"   "Stat3"   "Gata3"  
##   + ... omitted several groups/vertices
print(sizes(clustering))
## Community sizes
##  1  2  3  4  5  6  7  8 
## 10 66 25 51 15  1  1  1
print(paste0('modularity:',modularity(g, membership(clustering))))
## [1] "modularity:0.350524037176596"
saveRDS(clustering,paste0('clustering_copula_',sample,'_',npn.func,'_res',res,'_s',sparsity,tag,'.rds'))
## write clustering result
clust <- data.frame(name = names(membership(clustering)),cluster = as.numeric(membership(clustering))) %>%
        dplyr::inner_join(nodes, by = "name") %>%
        dplyr::arrange(group, cluster)
knitr::kable(head(clust), caption = 'community clustering') |> kableExtra::kable_styling(latex_options = 'scale_down')
community clustering
name cluster group
Stat6 2 Housekeeping
Etv6 2 Housekeeping
Tcf3 2 Housekeeping
Irf3 2 Housekeeping
Stat4 2 Housekeeping
Stat3 2 Housekeeping
write.csv(clust,paste0('clustering_membership_copula_',sample,'_',npn.func,'_res',res,'_s',sparsity,tag,'.csv'), row.names=F)
sessionInfo()
## R version 4.3.1 (2023-06-16)
## Platform: aarch64-apple-darwin20 (64-bit)
## Running under: macOS 15.5
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib 
## LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## time zone: America/Los_Angeles
## tzcode source: internal
## 
## attached base packages:
## [1] grid      stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
## [1] factoextra_1.0.7   ggpubr_0.6.0       xlsx_0.6.5         huge_1.3.5        
## [5] RColorBrewer_1.1-3 dplyr_1.1.2        ggplot2_3.4.2      igraph_1.5.0      
## 
## loaded via a namespace (and not attached):
##  [1] xlsxjars_0.6.1    sass_0.4.6        utf8_1.2.3        generics_0.1.3   
##  [5] tidyr_1.3.0       xml2_1.3.4        rstatix_0.7.2     stringi_1.7.12   
##  [9] lattice_0.21-8    digest_0.6.31     magrittr_2.0.3    evaluate_0.21    
## [13] fastmap_1.1.1     jsonlite_1.8.7    Matrix_1.6-5      ggrepel_0.9.3    
## [17] backports_1.4.1   httr_1.4.6        rvest_1.0.3       purrr_1.0.1      
## [21] fansi_1.0.4       viridisLite_0.4.2 scales_1.2.1      jquerylib_0.1.4  
## [25] abind_1.4-5       cli_3.6.1         rlang_1.1.1       munsell_0.5.0    
## [29] withr_2.5.0       cachem_1.0.8      yaml_2.3.7        tools_4.3.1      
## [33] ggsignif_0.6.4    colorspace_2.1-0  webshot_0.5.5     kableExtra_1.3.4 
## [37] broom_1.0.5       vctrs_0.6.3       R6_2.5.1          lifecycle_1.0.3  
## [41] stringr_1.5.0     car_3.1-2         MASS_7.3-60       pkgconfig_2.0.3  
## [45] rJava_1.0-6       pillar_1.9.0      bslib_0.5.0       gtable_0.3.3     
## [49] glue_1.6.2        Rcpp_1.0.10       systemfonts_1.0.4 highr_0.10       
## [53] xfun_0.39         tibble_3.2.1      tidyselect_1.2.0  rstudioapi_0.14  
## [57] knitr_1.43        htmltools_0.5.5   svglite_2.1.1     carData_3.0-5    
## [61] rmarkdown_2.23    compiler_4.3.1