1 import packages and functions

suppressMessages(library(igraph))
suppressMessages(library(ggplot2))
suppressMessages(library(dplyr))
suppressMessages(library(RColorBrewer))
suppressMessages(library(huge))
suppressMessages(library(xlsx))

set.seed(42)
fl.sources <- list.files("../../scripts/utils/", full.names = T)
tmp <- sapply(fl.sources,source)

2 prepare TF-TF correlation input

Since each context (TRM or TexTerm) has several networks in different replicate samples, we first calculated the mean edge weight of TF-regulatee pair across samples. We kept regulatees with moderate variation across TFs (sd>1). Then we calculated the spearman correlation of TFs based on the edge weight profile.

  • input: edge_combined.csv in each network/ folder
  • optional argument: pre-defined TF list, here, we only focus on TFs important in each context
  • intermediate file: mean_edge_weight.csv: rows are regulatees (regulated downstream genes) and columns are TFs
  • output: TF_corr.csv: rows and columns are TFs, symmetric matrix with diagonal as 1.

We’ll skip this part for demo purpose. The output file is provided.

# sample <- 'TRM' # 'TRM' or 'TexTerm'
# tag <- '_subset_TFs_v2'
# 
# L <- readLines(paste0(sample,'_samples.txt'))
# TFs <- readLines(paste0(sample,'_TFs_for_network_v3.txt')) 
# TFs <- lapply(TFs, toupper)
#  
# 
# edge_weight <- do.call("rbind", lapply(L, function(x) read.table(paste0(x,'/edges_combined.csv'), sep = ",", header = T) %>% select(c(1,2,3)) %>% setNames(c("TF","regulatee","weight")))) %>% dplyr::filter(TF %in% TFs) %>% 
#                group_by(TF, regulatee) %>% dplyr::summarize(weight = mean(weight, na.rm=TRUE)) %>% 
#                reshape2::dcast(regulatee ~ TF, value.var = "weight", fill=0) %>% 
#                tibble::column_to_rownames(var = "regulatee") 
#                                        
# df <- edge_weight %>% rowwise() %>% mutate(sd=sd(c_across(everything()))) %>% ungroup() %>% 
#                 filter(sd>1) %>% select(!matches("MOUSE$|[0-9\\.]{6}$")) %>% select(-sd) %>%
#                 rename_with(firstup) %>% as.matrix %>% cor(method = "spearman") %>% as.data.frame
# write.csv(df,file = paste0("TF_corr_",sample,tag,".csv")) # write to file
# load TF-TF correlation file
sample <- 'TexTerm' ### 'TRM' or 'TexTerm'
tag <- '_subset_TFs_v2'
df <- read.csv(paste0("TF_corr_",sample,tag,".csv"), row.names = 1)
knitr::kable(head(df[,1:6]), caption = 'TF-TF spearman correlation') |> kableExtra::kable_styling(latex_options = 'scale_down')
TF-TF spearman correlation
Ahr Arid3a Arnt Arntl Atf1 Atf2
Ahr 1.0000000 0.0159095 0.1552732 0.0674772 0.0967936 0.0653700
Arid3a 0.0159095 1.0000000 0.1085567 0.0820837 0.1034220 0.1501552
Arnt 0.1552732 0.1085567 1.0000000 0.8563119 0.1860339 0.1763554
Arntl 0.0674772 0.0820837 0.8563119 1.0000000 0.1636281 0.1480576
Atf1 0.0967936 0.1034220 0.1860339 0.1636281 1.0000000 0.5753676
Atf2 0.0653700 0.1501552 0.1763554 0.1480576 0.5753676 1.0000000
print(dim(df))
## [1] 170 170

3 copula pipeline to infer graph

We used R package huge to construct Gaussian graphical models (GGM) with copula modification. A little bit background to help with selction of hyper-parameters.

GGM assumes that the observations have a multivariate Gaussian distribution with mean \(µ\), and covariance matrix \(Σ\). The conditional independence can be implied by the inverse covariance (concentration) matrix \(Ω = Σ^{−1}\). If \(Ω_{jk} = 0\), then the i-th variable and j-th variables are conditional independent given all other variables.

This important property serves as the foundation for GGM to infer direct interactions from high-dimensional data. Unlike relevance networks or correlation networks, in which edges are determined based on marginal correlations, GGM provides a stronger criterion of dependency, and thus further reduces the false positive rate.

However, a great limitation of classic learning methods for GGM is the lack of sparsity in the resulting graph. A dense graph not only complicates downstream analysis but also raises the issue of overfitting the data. Thus it makes sense to impose an \(ℓ\) penalty for the estimation of \(Ω\), to increase its sparsity, and the sparse pattern of \(Ω\) is essentially the same as the adjacency matrix of the underlying undirected graph.

Additionally, Liu et al. developed a data transformation method called Copula that can be used with the graphical Lasso algorithm to relax the normality assumption of GGM. The intuition of Copula is to infer the the relationship of the variables without being influenced by their individual distributions

For more details about package huge, check the vignette. For instructions on selecting the hyper-parameters like npn.func, check the manual

npn.func <- 'shrinkage'
df.npn <- huge.npn(df, npn.func = npn.func) # default                      
## Conducting the nonparanormal (npn) transformation via shrunkun ECDF....done.
out.npn <- huge(df.npn, method = "glasso", nlambda = 30, lambda.min.ratio = 0.01) # fit glasso model to data
## 
Conducting the graphical lasso (glasso) wtih lossless screening....in progress: 3%
Conducting the graphical lasso (glasso) wtih lossless screening....in progress: 6%
Conducting the graphical lasso (glasso) wtih lossless screening....in progress: 9%
Conducting the graphical lasso (glasso) wtih lossless screening....in progress: 13%
Conducting the graphical lasso (glasso) wtih lossless screening....in progress: 16%
Conducting the graphical lasso (glasso) wtih lossless screening....in progress: 19%
Conducting the graphical lasso (glasso) wtih lossless screening....in progress: 23%
Conducting the graphical lasso (glasso) wtih lossless screening....in progress: 26%
Conducting the graphical lasso (glasso) wtih lossless screening....in progress: 30%
Conducting the graphical lasso (glasso) wtih lossless screening....in progress: 33%
Conducting the graphical lasso (glasso) wtih lossless screening....in progress: 36%
Conducting the graphical lasso (glasso) wtih lossless screening....in progress: 40%
Conducting the graphical lasso (glasso) wtih lossless screening....in progress: 43%
Conducting the graphical lasso (glasso) wtih lossless screening....in progress: 46%
Conducting the graphical lasso (glasso) wtih lossless screening....in progress: 50%
Conducting the graphical lasso (glasso) wtih lossless screening....in progress: 53%
Conducting the graphical lasso (glasso) wtih lossless screening....in progress: 56%
Conducting the graphical lasso (glasso) wtih lossless screening....in progress: 60%
Conducting the graphical lasso (glasso) wtih lossless screening....in progress: 63%
Conducting the graphical lasso (glasso) wtih lossless screening....in progress: 66%
Conducting the graphical lasso (glasso) wtih lossless screening....in progress: 70%
Conducting the graphical lasso (glasso) wtih lossless screening....in progress: 73%
Conducting the graphical lasso (glasso) wtih lossless screening....in progress: 76%
Conducting the graphical lasso (glasso) wtih lossless screening....in progress: 80%
Conducting the graphical lasso (glasso) wtih lossless screening....in progress: 83%
Conducting the graphical lasso (glasso) wtih lossless screening....in progress: 86%
Conducting the graphical lasso (glasso) wtih lossless screening....in progress: 90%
Conducting the graphical lasso (glasso) wtih lossless screening....in progress: 93%
Conducting the graphical lasso (glasso) wtih lossless screening....in progress: 96%
## Conducting the graphical lasso (glasso)....done.                                          
saveRDS(out.npn,paste0("result_copula_",sample,'_',npn.func,tag,".rds"))

The program automatically sets up a sequence of 30 regularization parameters and estimates the corresponding graph path.

# default visualization provided by huge package
plot(out.npn)

3.1 get precision matrix

From the above plot, we can see the local minimum of sparsity is 0.15. We can infer the precision matrix when sparsity=0.15

sparsity <- 0.15
idx = which(out.npn$sparsity>=sparsity & out.npn$sparsity<=sparsity+0.01)[1]
print(paste0("lambda: ", out.npn$lambda[idx]))
## [1] "lambda: 0.0572582995483069"
print(paste0("sparsity: ", out.npn$sparsity[idx]))                       
## [1] "sparsity: 0.157117995127045"
m = out.npn$icov[[idx]]
rownames(m) <- rownames(df)
colnames(m) <- colnames(df)                       
knitr::kable(head(m[,1:6]), caption = 'sparse TF-TF correlation matrix') |> kableExtra::kable_styling(latex_options = 'scale_down')
sparse TF-TF correlation matrix
Ahr Arid3a Arnt Arntl Atf1 Atf2
Ahr 3.1226310 0.0143988 -0.0456422 0.0000000 -0.1233884 0.0000000
Arid3a 0.0142164 4.9156918 0.0000000 0.0000000 0.0000000 0.0000000
Arnt -0.0457246 0.0000000 7.0673992 -0.9341928 0.0000000 0.0000000
Arntl 0.0000000 0.0000000 -0.9345756 7.2953688 0.0000000 0.0000000
Atf1 -0.1234237 0.0000000 0.0000000 0.0000000 5.7669647 -0.0311552
Atf2 0.0000000 0.0000000 0.0000000 0.0000000 -0.0309058 5.5925174
## write to file
write.csv(m, paste0("precision_copula_",sample,"_",npn.func,"_sparsity_",sparsity,tag,".csv"))

4 create igraph plot

4.1 create dataframe for igraph

df <- data.frame(row=rownames(m)[row(m)[upper.tri(m)]], 
                 col=colnames(m)[col(m)[upper.tri(m)]], 
                 corr=m[upper.tri(m)])
### remove zero entries
df <- df[df$corr!=0,]
knitr::kable(head(df), caption = 'long form of TF correlation') |> kableExtra::kable_styling(latex_options = 'scale_down')
long form of TF correlation
row col corr
1 Ahr Arid3a 0.0143988
2 Ahr Arnt -0.0456422
6 Arnt Arntl -0.9341928
7 Ahr Atf1 -0.1233884
15 Atf1 Atf2 -0.0311552
20 Atf1 Atf7 -0.5549663
print(dim(df))
## [1] 2257    3
print(paste('percent of correlation score >','0',':',sum(df$corr > 0) *100 / nrow(df),"%"))
## [1] "percent of correlation score > 0 : 13.4692069118299 %"
print(paste('percent of correlation score <','0',':',sum(df$corr < 0) *100 / nrow(df),"%"))
## [1] "percent of correlation score < 0 : 86.5307930881701 %"
df$corr = abs(df$corr)
write.csv(df, paste0("correlation_intensity_table_copula_",sample,"_",npn.func,"_sparsity_",sparsity,tag,".csv"))

4.2 import node meta

Node annotation came from the results of Fig. S3a.

  • TRM_Tex: multi-tasker in both TRM and TexTerm
  • Tex_specific/TRM_specific: single-tasker in TexTerm/TRM
  • Tex_important/TRM_important: important in TexTerm/TRM
  • Housekeeping: 54 universal TFs with small variation across cell types
nodes <- read.xlsx('TRM_Tex_TF_list_20240121.xlsx', sheetName = sample) %>% tidyr::drop_na(TF) %>% dplyr::mutate(group=ifelse(is.na(Specificity), Important, Specificity)) %>% select(TF, group)
names(nodes) <- c('name','group')
knitr::kable(head(nodes), caption = 'node meta') |> kableExtra::kable_styling(latex_options = 'scale_down')
node meta
name group
Bhlhe40 TRM_Tex
Nr4a2 TRM_Tex
Snai3 TRM_Tex
Xbp1 TRM_Tex
Egr1 TRM_Tex
Nfil3 TRM_Tex
print(dim(nodes))   
## [1] 170   2

4.3 graph generation

g = graph_from_data_frame(df, directed=FALSE, vertices = nodes)
print(g, e=TRUE, v=TRUE)
## IGRAPH 051863b UN-- 170 2257 -- 
## + attr: name (v/c), group (v/c), corr (e/n)
## + edges from 051863b (vertex names):
##  [1] Ahr    --Arid3a  Ahr    --Arnt    Arnt   --Arntl   Ahr    --Atf1   
##  [5] Atf2   --Atf1    Atf1   --Atf7    Atf2   --Atf7    Arid3a --Bbx    
##  [9] Bhlhe40--Arnt    Bhlhe40--Arntl   Bhlhe41--Arid3a  Bhlhe41--Arnt   
## [13] Bhlhe41--Arntl   Bhlhe40--Bhlhe41 Ahr    --Cbfb    Cbfb   --Atf1   
## [17] Batf   --Cbfb    Cebpb  --Atf2    Ahr    --Creb1   Creb1  --Atf1   
## [21] Creb1  --Atf2    Creb1  --Atf7    Bhlhe41--Creb1   Crem   --Ahr    
## [25] Crem   --Atf1    Crem   --Atf2    Crem   --Atf7    Crem   --Bhlhe41
## [29] Crem   --Creb1   Ctcf   --Atf1    Ctcf   --Bbx     Cebpb  --Ctcf   
## + ... omitted several edges

4.4 Leiden clustering

Leiden clustering generates communities. Adjust the res to get a reasonable number of groups and high modularity score. High res gives more groups. Through experimenting, res=0.9 suits the best.

set.seed(42)
res <- 0.9
clustering = cluster_leiden(g, objective_function="modularity",resolution_parameter=res)
print(clustering)
## IGRAPH clustering leiden, groups: 5, mod: NA
## + groups:
##   $`1`
##    [1] "Bhlhe40" "Xbp1"    "Max"     "Setbp1"  "Mxd1"    "Tigd2"   "Heyl"   
##    [8] "Bhlhe41" "Tfeb"    "Usf1"    "Arnt"    "Mlx"     "Srebf1"  "Arntl"  
##   [15] "Tfe3"   
##   
##   $`2`
##    [1] "Nr4a2"   "Nfil3"   "Gfi1"    "Crem"    "Tef"     "Zbtb3"   "Homez"  
##    [8] "Thrb"    "Ddit3"   "Jun"     "Irf4"    "Meox2"   "Fos"     "Isl2"   
##   [15] "Prdm1"   "Nfe2l1"  "Dmrt2"   "Eomes"   "Zbtb20"  "Irf8"    "Arid3a" 
##   [22] "Foxj2"   "Batf"    "Nfatc1"  "Jund"    "Cebpb"   "Vax2"    "Foxd2"  
##   + ... omitted several groups/vertices
print(sizes(clustering))
## Community sizes
##  1  2  3  4  5 
## 15 66 24 45 20
print(paste0('modularity:',modularity(g, membership(clustering))))
## [1] "modularity:0.372509275038383"
saveRDS(clustering,paste0('clustering_copula_',sample,'_',npn.func,'_res',res,'_s',sparsity,tag,'.rds'))
## write clustering result
clust <- data.frame(name = names(membership(clustering)),cluster = as.numeric(membership(clustering))) %>%
        dplyr::inner_join(nodes, by = "name") %>%
        dplyr::arrange(group, cluster)
knitr::kable(head(clust), caption = 'community clustering') |> kableExtra::kable_styling(latex_options = 'scale_down')
community clustering
name cluster group
Creb1 2 Housekeeping
Stat6 2 Housekeeping
Irf3 2 Housekeeping
Stat4 2 Housekeeping
Gata3 2 Housekeeping
Irf1 2 Housekeeping
write.csv(clust,paste0('clustering_membership_copula_',sample,'_',npn.func,'_res',res,'_s',sparsity,tag,'.csv'), row.names=F)

4.5 visualization of graph

We need to set some parameters before plot.

# load genes which we want to highlight in the plot
label_targets <- readLines(paste0(sample,'_gene_annotation.txt'))
common_targets <- readLines('common_genes_annotated.txt')

## generate top degree nodes for each community
clust <- data.frame(name = names(membership(clustering)),
                cluster = as.numeric(membership(clustering)),
                degree=degree(g)) %>%
dplyr::inner_join(nodes, by = "name") %>%
dplyr::group_by(cluster) %>% slice_max(order_by=degree, n = 5) %>% pull(name) %>% 
writeLines(paste0('top_degree_nodes_per_community_',sample,'_',npn.func,'_res',res,"_s",sparsity,"_",sample,tag,'.txt'))

# combine top degree nodes to labeled list as well
tmp <- readLines(paste0('top_degree_nodes_per_community_',sample,'_',npn.func,'_res',res,"_s",sparsity,"_",sample,tag,'.txt'))
label_targets <- union(label_targets, tmp)

## add cluster info
V(g)$cluster <- clustering$membership # add community info to graph
V(g)$group <- ifelse(V(g)$name %in% label_targets,sample,"other")
V(g)$group2 <- ifelse(V(g)$name %in% common_targets,"common","specific")

# generate colors based on cluster
colors <- RColorBrewer::brewer.pal(length(clustering),"Set2")

V(g)$color <- colors[V(g)$cluster]
V(g)$color <- ifelse(V(g)$group2=='common',V(g)$color,paste0(V(g)$color,'7f')) # if common TFs, use darker shade

V(g)$frame.color <- "NA"

V(g)$label <- ifelse(V(g)$group == "other",NA,V(g)$name)
V(g)$label.color <- 'black'
V(g)$size <- 0.3*degree(g)
V(g)$label.cex <- 0.75

# set distance of label
V(g)$label.dist <- 0.4

# set edge color
E(g)$color <- "gray80"
# E(g)$size <- 0.1
E(g)$width <- 1.5*E(g)$corr
cut_off <- 0.9
E(g)$lty <- ifelse(E(g)$corr>=unname(quantile(E(g)$corr, cut_off)), 1, 0) # only show top edges

# set the within-module edges to some large weight, and the between module edges to some small weight
# and then choose 'layout_with_fr' to make the grouped layout 
edge.weights <- function(community, network, weight.within = 10, weight.between = 1) {
    bridges <- crossing(communities = community, graph = network)
    weights <- ifelse(test = bridges, yes = weight.between, no = weight.within)
    return(weights) 
}
E(g)$weight <- edge.weights(clustering, g)
plot(g, layout=layout_with_fr)

## write graph to object
saveRDS(g, paste0("graph_",npn.func,'_res',res,"_s",sparsity,"_",sample,"_",tag,".rds"))

5 session info

sessionInfo()
## R version 4.3.1 (2023-06-16)
## Platform: aarch64-apple-darwin20 (64-bit)
## Running under: macOS Sonoma 14.6.1
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib 
## LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## time zone: America/Los_Angeles
## tzcode source: internal
## 
## attached base packages:
## [1] grid      stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
## [1] ggpubr_0.6.0       xlsx_0.6.5         huge_1.3.5         RColorBrewer_1.1-3
## [5] dplyr_1.1.2        ggplot2_3.4.2      igraph_1.5.0      
## 
## loaded via a namespace (and not attached):
##  [1] xlsxjars_0.6.1    sass_0.4.6        utf8_1.2.3        generics_0.1.3   
##  [5] tidyr_1.3.0       xml2_1.3.4        rstatix_0.7.2     stringi_1.7.12   
##  [9] lattice_0.21-8    digest_0.6.31     magrittr_2.0.3    evaluate_0.21    
## [13] fastmap_1.1.1     jsonlite_1.8.7    Matrix_1.6-5      backports_1.4.1  
## [17] httr_1.4.6        rvest_1.0.3       purrr_1.0.1       fansi_1.0.4      
## [21] viridisLite_0.4.2 scales_1.2.1      jquerylib_0.1.4   abind_1.4-5      
## [25] cli_3.6.1         rlang_1.1.1       munsell_0.5.0     withr_2.5.0      
## [29] cachem_1.0.8      yaml_2.3.7        tools_4.3.1       ggsignif_0.6.4   
## [33] colorspace_2.1-0  webshot_0.5.5     kableExtra_1.3.4  broom_1.0.5      
## [37] vctrs_0.6.3       R6_2.5.1          lifecycle_1.0.3   stringr_1.5.0    
## [41] car_3.1-2         MASS_7.3-60       pkgconfig_2.0.3   rJava_1.0-6      
## [45] pillar_1.9.0      bslib_0.5.0       gtable_0.3.3      glue_1.6.2       
## [49] Rcpp_1.0.10       systemfonts_1.0.4 highr_0.10        xfun_0.39        
## [53] tibble_3.2.1      tidyselect_1.2.0  rstudioapi_0.14   knitr_1.43       
## [57] htmltools_0.5.5   svglite_2.1.1     carData_3.0-5     rmarkdown_2.23   
## [61] compiler_4.3.1