1 import packages and functions

suppressMessages(library(igraph))
suppressMessages(library(ggplot2))
suppressMessages(library(dplyr))
suppressMessages(library(RColorBrewer))
suppressMessages(library(huge))
suppressMessages(library(xlsx))

set.seed(42)
fl.sources <- list.files("../../scripts/utils/", full.names = T)
tmp <- sapply(fl.sources,source)

## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa

2 prepare TF-gene edge weight input

Since each context (TRM or TexTerm) has several networks in different replicate samples, we first calculated the mean edge weight of TF-regulatee pair across samples. We kept regulatees with moderate variation across TFs (sd>1).

input: edge_combined.csv in each network/ folder
optional argument: pre-defined TF list, here, we only focus on TFs important in each context
output file: mean_edge_weight.csv: rows are regulatees (regulated downstream genes) and columns are TFs We’ll skip this part for demo purpose. The output file is provided.

# sample <- 'TRM' # 'TRM' or 'TexTerm'
# tag <- '_subset_TFs_v2'
# 
# L <- readLines(paste0(sample,'_samples.txt'))
# TFs <- readLines(paste0(sample,'_TFs_for_network_v3.txt')) 
# TFs <- lapply(TFs, toupper)
#  
# 
# edge_weight <- do.call("rbind", lapply(L, function(x) read.table(paste0(x,'/edges_combined.csv'), sep = ",", header = T) %>% select(c(1,2,3)) %>% setNames(c("TF","regulatee","weight")))) %>% dplyr::filter(TF %in% TFs) %>% 
#                group_by(TF, regulatee) %>% dplyr::summarize(weight = mean(weight, na.rm=TRUE)) %>% 
#                reshape2::dcast(regulatee ~ TF, value.var = "weight", fill=0) %>% 
#                tibble::column_to_rownames(var = "regulatee") 
#                                        
# df <- edge_weight %>% rowwise() %>% mutate(sd=sd(c_across(everything()))) %>% ungroup() %>% 
#                 filter(sd>1) %>% select(!matches("MOUSE$|[0-9\\.]{6}$")) %>% select(-sd) %>%
#                 rename_with(firstup)
# write.csv(df,file = paste0("mean_",sample,"_edge_weight",tag,".csv")) # write to file

sample <- 'TexTerm' ### 'TRM' or 'TexTerm'
tag <- '_subset_TFs_v2'
df <- read.csv(paste0("mean_",sample,"_edge_weight",tag,".csv"), row.names = 1)
colnames(df) <- firstup(colnames(df))
knitr::kable(df[10000:10006,1:6], caption = 'TF-gene edge weight') |> kableExtra::kable_styling(latex_options = 'scale_down')

TF-gene edge weight
	Arid3a	Arnt	Arntl	Atf1	Atf2
MICU1	0.000000	7.977929	10.103853	0.000000	0.00000
MICU2	0.000000	0.000000	0.000000	0.000000	0.00000
MICU3	2.876754	0.000000	0.000000	0.000000	0.00000
MID1	3.793183	0.000000	0.000000	0.000000	0.00000
MID1IP1	0.000000	0.000000	0.000000	0.000000	0.00000
MIDN	2.554116	5.850666	2.887889	5.682970	14.71426
MIEF1	0.000000	0.000000	0.000000	4.736144	0.00000

print(dim(df))

## [1] 17270   170

3 copula pipeline to infer graph

We used R package huge to construct Gaussian graphical models (GGM) with copula modification. A little bit background to help with selction of hyper-parameters.

GGM assumes that the observations have a multivariate Gaussian distribution with mean \(µ\), and covariance matrix \(Σ\). The conditional independence can be implied by the inverse covariance (concentration) matrix \(Ω = Σ^{−1}\). If \(Ω_{jk} = 0\), then the i-th variable and j-th variables are conditional independent given all other variables.

This important property serves as the foundation for GGM to infer direct interactions from high-dimensional data. Unlike relevance networks or correlation networks, in which edges are determined based on marginal correlations, GGM provides a stronger criterion of dependency, and thus further reduces the false positive rate.

However, a great limitation of classic learning methods for GGM is the lack of sparsity in the resulting graph. A dense graph not only complicates downstream analysis but also raises the issue of overfitting the data. Thus it makes sense to impose an \(ℓ\) penalty for the estimation of \(Ω\), to increase its sparsity, and the sparse pattern of \(Ω\) is essentially the same as the adjacency matrix of the underlying undirected graph.

Additionally, Liu et al. developed a data transformation method called Copula that can be used with the graphical Lasso algorithm to relax the normality assumption of GGM. The intuition of Copula is to infer the the relationship of the variables without being influenced by their individual distributions

For more details about package huge, check the vignette. For instructions on selecting the hyper-parameters like npn.func, check the manual

npn.func <- 'shrinkage'
df.npn <- huge.npn(df, npn.func = npn.func) # default

## Conducting the nonparanormal (npn) transformation via shrunkun ECDF....done.

out.npn <- huge(df.npn, method = "glasso", nlambda = 30, lambda.min.ratio = 0.01) # fit glasso model to data

## 
Conducting the graphical lasso (glasso) wtih lossless screening....in progress: 3%
Conducting the graphical lasso (glasso) wtih lossless screening....in progress: 6%
Conducting the graphical lasso (glasso) wtih lossless screening....in progress: 9%
Conducting the graphical lasso (glasso) wtih lossless screening....in progress: 13%
Conducting the graphical lasso (glasso) wtih lossless screening....in progress: 16%
Conducting the graphical lasso (glasso) wtih lossless screening....in progress: 19%
Conducting the graphical lasso (glasso) wtih lossless screening....in progress: 23%
Conducting the graphical lasso (glasso) wtih lossless screening....in progress: 26%
Conducting the graphical lasso (glasso) wtih lossless screening....in progress: 30%
Conducting the graphical lasso (glasso) wtih lossless screening....in progress: 33%
Conducting the graphical lasso (glasso) wtih lossless screening....in progress: 36%
Conducting the graphical lasso (glasso) wtih lossless screening....in progress: 40%
Conducting the graphical lasso (glasso) wtih lossless screening....in progress: 43%
Conducting the graphical lasso (glasso) wtih lossless screening....in progress: 46%
Conducting the graphical lasso (glasso) wtih lossless screening....in progress: 50%
Conducting the graphical lasso (glasso) wtih lossless screening....in progress: 53%
Conducting the graphical lasso (glasso) wtih lossless screening....in progress: 56%
Conducting the graphical lasso (glasso) wtih lossless screening....in progress: 60%
Conducting the graphical lasso (glasso) wtih lossless screening....in progress: 63%
Conducting the graphical lasso (glasso) wtih lossless screening....in progress: 66%
Conducting the graphical lasso (glasso) wtih lossless screening....in progress: 70%
Conducting the graphical lasso (glasso) wtih lossless screening....in progress: 73%
Conducting the graphical lasso (glasso) wtih lossless screening....in progress: 76%
Conducting the graphical lasso (glasso) wtih lossless screening....in progress: 80%
Conducting the graphical lasso (glasso) wtih lossless screening....in progress: 83%
Conducting the graphical lasso (glasso) wtih lossless screening....in progress: 86%
Conducting the graphical lasso (glasso) wtih lossless screening....in progress: 90%
Conducting the graphical lasso (glasso) wtih lossless screening....in progress: 93%
Conducting the graphical lasso (glasso) wtih lossless screening....in progress: 96%
## Conducting the graphical lasso (glasso)....done.

saveRDS(out.npn,paste0("result_copula_",sample,'_',npn.func,tag,".rds"))

The program automatically sets up a sequence of 30 regularization parameters and estimates the corresponding graph path.

# default visualization provided by huge package
plot(out.npn)

3.1 get precision matrix

From the above plot, we can see the local minimum of sparsity is 0.15. We can infer the precision matrix when sparsity=0.15

sparsity <- 0.15
idx = which(out.npn$sparsity>=sparsity & out.npn$sparsity<=sparsity+0.01)[1]
print(paste0("lambda: ", out.npn$lambda[idx]))

## [1] "lambda: 0.0903400786315669"

print(paste0("sparsity: ", out.npn$sparsity[idx]))

## [1] "sparsity: 0.151653324051514"

# 1. Get the Precision Matrix (Inverse Covariance)
theta <- as.matrix(out.npn$icov[[idx]])

# 2. Convert Precision Matrix to Partial Correlation Matrix
# Formula: rho_ij = - theta_ij / sqrt(theta_ii * theta_jj)
diag_theta <- diag(theta)
m <- -theta / sqrt(outer(diag_theta, diag_theta, "*"))
diag(m) <- 1 # The diagonal of a correlation matrix is always 1

# 3. Apply names
rownames(m) <- colnames(df)
colnames(m) <- colnames(df)

knitr::kable(head(m[,1:6]), caption = 'sparse TF-TF correlation matrix') |> kableExtra::kable_styling(latex_options = 'scale_down')

sparse TF-TF correlation matrix
	Ahr	Arid3a	Arnt	Arntl	Atf1	Atf2
Ahr	1.0000000	0	0.0292973	0.0000000	0	0
Arid3a	0.0000000	1	0.0000000	0.0000000	0	0
Arnt	0.0292973	0	1.0000000	0.2880293	0	0
Arntl	0.0000000	0	0.2880254	1.0000000	0	0
Atf1	0.0000000	0	0.0000000	0.0000000	1	0
Atf2	0.0000000	0	0.0000000	0.0000000	0	1

## write to file
write.csv(m, paste0("partial_correlation_copula_",sample,"_",npn.func,"_sparsity_",sparsity,tag,".csv"))

4 create igraph plot

4.1 create dataframe for igraph

df <- data.frame(row=rownames(m)[row(m)[upper.tri(m)]], 
                 col=colnames(m)[col(m)[upper.tri(m)]], 
                 corr=m[upper.tri(m)])
### remove zero entries
df <- df[df$corr!=0,]
knitr::kable(head(df), caption = 'long form of TF correlation') |> kableExtra::kable_styling(latex_options = 'scale_down')

long form of TF correlation
	row	col	corr
2	Ahr	Arnt	0.0292973
6	Arnt	Arntl	0.2880293
20	Atf1	Atf7	0.0258843
21	Atf2	Atf7	0.5207381
23	Arid3a	Batf	0.0252175
30	Arid3a	Bbx	0.0830651

print(dim(df))

## [1] 2179    3

print(paste('percent of correlation score >','0',':',sum(df$corr > 0) *100 / nrow(df),"%"))

## [1] "percent of correlation score > 0 : 99.7705369435521 %"

print(paste('percent of correlation score <','0',':',sum(df$corr < 0) *100 / nrow(df),"%"))

## [1] "percent of correlation score < 0 : 0.229463056447912 %"

df$corr = abs(df$corr)
write.csv(df, paste0("correlation_intensity_table_copula_",sample,"_",npn.func,"_sparsity_",sparsity,tag,".csv"))

4.2 import node meta

Node annotation came from the results of Fig. S3a.

TRM_Tex: multi-tasker in both TRM and TexTerm
Tex_specific/TRM_specific: single-tasker in TexTerm/TRM
Tex_important/TRM_important: important in TexTerm/TRM
Housekeeping: 54 universal TFs with small variation across cell types

nodes <- read.xlsx('TRM_Tex_TF_list_20240121.xlsx', sheetName = sample) %>% tidyr::drop_na(TF) %>% dplyr::mutate(group=ifelse(is.na(Specificity), Important, Specificity)) %>% select(TF, group)
names(nodes) <- c('name','group')
knitr::kable(head(nodes), caption = 'node meta') |> kableExtra::kable_styling(latex_options = 'scale_down')

node meta
name	group
Bhlhe40	TRM_Tex
Nr4a2	TRM_Tex
Snai3	TRM_Tex
Xbp1	TRM_Tex
Egr1	TRM_Tex
Nfil3	TRM_Tex

print(dim(nodes))

## [1] 170   2

4.3 graph generation

g = graph_from_data_frame(df, directed=FALSE, vertices = nodes)
print(g, e=TRUE, v=TRUE)

## IGRAPH 68035fe UN-- 170 2179 -- 
## + attr: name (v/c), group (v/c), corr (e/n)
## + edges from 68035fe (vertex names):
##  [1] Ahr    --Arnt    Arnt   --Arntl   Atf1   --Atf7    Atf2   --Atf7   
##  [5] Arid3a --Batf    Arid3a --Bbx     Atf2   --Bbx     Bcl6   --Bbx    
##  [9] Bhlhe40--Arnt    Bhlhe40--Arntl   Bhlhe41--Arntl   Bhlhe40--Bhlhe41
## [13] Cbfb   --Bbx     Cebpb  --Bcl6    Creb1  --Atf1    Creb1  --Atf2   
## [17] Creb1  --Atf7    Crem   --Atf1    Crem   --Atf2    Crem   --Atf7   
## [21] Crem   --Creb1   Arid3a --Ctcf    Ctcf   --Arnt    Ctcf   --Bbx    
## [25] Ctcf   --Bcl6    Ctcf   --Cbfb    Cebpb  --Ctcf    Ddit3  --Arid3a 
## [29] Ddit3  --Atf2    Ddit3  --Bbx     Ddit3  --Bcl6    Ddit3  --Cbfb   
## + ... omitted several edges

4.4 Leiden clustering

Leiden clustering generates communities. Adjust the res to get a reasonable number of groups and high modularity score. High res gives more groups. Through experimenting, res=0.9 suits the best.

set.seed(42)
res <- 1.1
clustering = cluster_leiden(g, objective_function="modularity",resolution_parameter=res)
print(clustering)

## IGRAPH clustering leiden, groups: 8, mod: NA
## + groups:
##   $`1`
##    [1] "Bhlhe40" "Xbp1"    "Heyl"    "Bhlhe41" "Tfeb"    "Usf1"    "Mlx"    
##    [8] "Srebf1"  "Arntl"   "Tfe3"   
##   
##   $`2`
##    [1] "Nr4a2"   "Nfil3"   "Gfi1"    "Zbtb3"   "Ddit3"   "Jun"     "Irf4"   
##    [8] "Hmg20b"  "Meox2"   "Fos"     "Isl2"    "Prdm1"   "Nfe2l1"  "Dmrt2"  
##   [15] "Zfp768"  "Smad4"   "Eomes"   "Zbtb20"  "Irf8"    "Arid3a"  "Foxj2"  
##   [22] "Batf"    "Nfatc1"  "Jund"    "Cebpb"   "Vax2"    "Foxd2"   "Zscan20"
##   [29] "Stat6"   "Etv6"    "Tcf3"    "Irf3"    "Stat4"   "Stat3"   "Gata3"  
##   + ... omitted several groups/vertices

print(sizes(clustering))

## Community sizes
##  1  2  3  4  5  6  7  8 
## 10 66 25 51 15  1  1  1

print(paste0('modularity:',modularity(g, membership(clustering))))

## [1] "modularity:0.350524037176596"

saveRDS(clustering,paste0('clustering_copula_',sample,'_',npn.func,'_res',res,'_s',sparsity,tag,'.rds'))

## write clustering result
clust <- data.frame(name = names(membership(clustering)),cluster = as.numeric(membership(clustering))) %>%
        dplyr::inner_join(nodes, by = "name") %>%
        dplyr::arrange(group, cluster)
knitr::kable(head(clust), caption = 'community clustering') |> kableExtra::kable_styling(latex_options = 'scale_down')

community clustering
name	cluster	group
Stat6	2	Housekeeping
Etv6	2	Housekeeping
Tcf3	2	Housekeeping
Irf3	2	Housekeeping
Stat4	2	Housekeeping
Stat3	2	Housekeeping

write.csv(clust,paste0('clustering_membership_copula_',sample,'_',npn.func,'_res',res,'_s',sparsity,tag,'.csv'), row.names=F)

sessionInfo()

## R version 4.3.1 (2023-06-16)
## Platform: aarch64-apple-darwin20 (64-bit)
## Running under: macOS 15.5
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRblas.0.dylib 
## LAPACK: /Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.11.0
## 
## locale:
## [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
## 
## time zone: America/Los_Angeles
## tzcode source: internal
## 
## attached base packages:
## [1] grid      stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
## [1] factoextra_1.0.7   ggpubr_0.6.0       xlsx_0.6.5         huge_1.3.5        
## [5] RColorBrewer_1.1-3 dplyr_1.1.2        ggplot2_3.4.2      igraph_1.5.0      
## 
## loaded via a namespace (and not attached):
##  [1] xlsxjars_0.6.1    sass_0.4.6        utf8_1.2.3        generics_0.1.3   
##  [5] tidyr_1.3.0       xml2_1.3.4        rstatix_0.7.2     stringi_1.7.12   
##  [9] lattice_0.21-8    digest_0.6.31     magrittr_2.0.3    evaluate_0.21    
## [13] fastmap_1.1.1     jsonlite_1.8.7    Matrix_1.6-5      ggrepel_0.9.3    
## [17] backports_1.4.1   httr_1.4.6        rvest_1.0.3       purrr_1.0.1      
## [21] fansi_1.0.4       viridisLite_0.4.2 scales_1.2.1      jquerylib_0.1.4  
## [25] abind_1.4-5       cli_3.6.1         rlang_1.1.1       munsell_0.5.0    
## [29] withr_2.5.0       cachem_1.0.8      yaml_2.3.7        tools_4.3.1      
## [33] ggsignif_0.6.4    colorspace_2.1-0  webshot_0.5.5     kableExtra_1.3.4 
## [37] broom_1.0.5       vctrs_0.6.3       R6_2.5.1          lifecycle_1.0.3  
## [41] stringr_1.5.0     car_3.1-2         MASS_7.3-60       pkgconfig_2.0.3  
## [45] rJava_1.0-6       pillar_1.9.0      bslib_0.5.0       gtable_0.3.3     
## [49] glue_1.6.2        Rcpp_1.0.10       systemfonts_1.0.4 highr_0.10       
## [53] xfun_0.39         tibble_3.2.1      tidyselect_1.2.0  rstudioapi_0.14  
## [57] knitr_1.43        htmltools_0.5.5   svglite_2.1.1     carData_3.0-5    
## [61] rmarkdown_2.23    compiler_4.3.1

context-dependent TF cooperation network construction

Cong

2025-12-16