1 Installation

You can install the development version of vssHiC from Github.

is_installed = "vssHiC" %in% rownames(installed.packages())
if (! is_installed){
  library(devtools)
  devtools::install_github("nedashokraneh/vssHiC")
}
library(vssHiC)
library(coolR)
library(ggplot2)
library(ggpubr)
library(HiContacts)

2 Background

Genome-wide chromosome conformation capture assay Hi-C provides insights into chromatin 3D structures and their functional implications. Read counts from the Hi-C assay do not have constant variance, meaning we expect an observed interaction between a pair of genomic bins with an observed interaction of 1000 reads to have different (usually higher) uncertainty than another pair of genomic bins with 100 reads. This property of heteroskedasticity impedes visualization and downstream analyses. vssHiC is a statistical method to transform read counts so transformed signals have a unit variance across a dynamic range given a Hi-C experiment with two biological replicates available. We show that the heatmap visualization of pairwise contacts within a genomic region is more visually appealing given vssHiC signals compared to the raw and log-transformed signals. Furthermore, vssHiC signals improve the performance of subcompartment caller relying on Gaussian observed variables. vssHiC works with a well-developed R class (InteractionSet) and file format (.cool) for Hi-C data.

3 Load data

vssHiC takes two processed Hi-C data in .cool format as input and vssHiC::cool2IntSet loads interactions corresponding to one pair of chromosomes into InteractionSet object. vssHiC::cool2IntSet is adapted from coolR::cool2IntSet with two small modifications to load InteractionSet object corresponding to only one pair of chromosomes and handle inter-chromosomal interactions.

files = list("../sample_data/GM12878_rep1.cool",
             "../sample_data/GM12878_rep2.cool")
chr = "21"
res = NULL

iset = vssHiC::cool2IntSet(files, chr, chr, res)
iset
## class: InteractionSet 
## dim: 60667 2 
## metadata(0):
## assays(1): counts
## rownames: NULL
## rowData names(0):
## colnames(2): IF1 IF2
## colData names(1): totals
## type: GInteractions
## regions: 357

3.1 Variance instability of raw Hi-C signals

Estimated mean and standard deviations (stdev) across a dynamic range based on VSS algorithm show the dependency between mean and variance such that higher signal intensities have higher variance. This dependency is robust to auxiliary type choice (whether replicate 1 or replicate 2 is used as an auxiliary signal) as long as total read counts of two replicates are not very different.

# total number of read read counts per replicate

print(colSums(assay(iset)))
##      IF1      IF2 
## 14944983 12594605
train.result1 = vssHiC::vssHiC.train(iset, "rep1")
train.result2 = vssHiC::vssHiC.train(iset, "rep2")

ggplot() +
  geom_line(data = train.result1$vss_mean_sd[train.result1$vss_mean_sd$mean < 500, ],
            aes(x = mean, y = pred_sd, colour = "rep1")) +
  geom_line(data = train.result2$vss_mean_sd[train.result2$vss_mean_sd$mean < 500, ],
            aes(x = mean, y = pred_sd, colour = "rep2")) +
  xlab("Mean") + ylab("Fitted stdev") +
  theme(axis.title.x = element_text(size = 12),
        axis.title.y = element_text(size = 12)) +
  scale_color_manual(name = "Aux type", 
                     values = c("rep1" = "darkblue", 
                                "rep2" = "red"))

4 Variance stabilization of Hi-C signals

vss.iset = vssHiC::vssHiC.transform(iset, train.result1$vss_fit)
vss.iset
## class: InteractionSet 
## dim: 60667 2 
## metadata(0):
## assays(1): counts
## rownames: NULL
## rowData names(0):
## colnames(2): rep1 rep2
## colData names(0):
## type: GInteractions
## regions: 357

4.1 vssHiC signals have a unit variance

After variance stabilization, vssHiC signals have an approximate unit variance (stdev = 1) across a dynamic range.

vss.result = vssHiC::vssHiC.train(vss.iset, "rep1")
ggplot(data = vss.result$vss_mean_sd) +
  geom_point(aes(x = mean, y = sd)) +
  geom_line(aes(x = mean, y = pred_sd), colour = "green") +
  xlab("Mean") + ylab("Stdev") +
  theme(axis.title.x = element_text(size = 12),
        axis.title.y = element_text(size = 12)) 

5 Downstream analysis

5.1 Visualization

One application of signals with a unit variance is their visualization. For example, the heatmap visualization of chromatin contact maps should reflect differences across a dynamic range of signals including small intensity of long-range interactions and large intensity of short-range interactions. A common heuristic variance stabilizing transformation, log function, inflates the variance of low-intensity signals while stabilizing the variance of larger signals. Therefore, the visualization of log-scaled contact maps results in a blurry expected plaid pattern of long-range interactions. The plotMatrix function from the HiContacts package designs a manual color map (coolerColors) carefully picked for Hi-C contact heatmaps for visualization of log-scaled signals. We show that signals with a unit variance alleviate the need for manual color maps and avoid blurry plaid patterns even with default color maps like bgrColors.

vss_GI = iset2GI(vss.iset, 1)
vss_GI_sub = subset_GI(vss_GI, 100000, 30000000, 42000000)
vss_mat_plot1 = plotMatrix(vss_GI_sub, scale = "linear", cmap = bgrColors())
vss_mat_plot2 = plotMatrix(vss_GI_sub, scale = "linear", cmap = coolerColors())
ggarrange(vss_mat_plot1,
          vss_mat_plot2,
          ncol=2)

raw_GI = iset2GI(iset, 1)
raw_GI_sub = subset_GI(raw_GI, 100000, 30000000, 42000000)
log_mat_plot1 = plotMatrix(raw_GI_sub, scale = "log10", cmap = bgrColors())
log_mat_plot2 = plotMatrix(raw_GI_sub, scale = "log10", cmap = coolerColors())
ggarrange(log_mat_plot1,
          log_mat_plot2,
          ncol=2)

5.2 Wrapper functions for TAD callers given InteractionSet object

vssHiC has wrapper functions for three TAD callers, TopDom, SpectralTAD, and HiCseg to take InteractionSet as input, call TADs, and return GRanges object including identified TAD ranges. GRanges objects can be easily visualized on top of the plotMatrix output.

rep_num = 1
hicseg_res = iset_hicseg(iset, 
                         rep_num, 
                         max_cp = 40, 
                         model = "D", 
                         dist = "G")

window_size = 2000000
topdom_res = iset_topdom(iset,
                         rep_num,
                         100000,
                         window_size)

spectral_res = iset_spectralTAD(iset,
                                rep_num,
                                100000,
                                window_size)

spectral_plot = plotMatrixBorder(vss_GI_sub, spectral_res, 
                                 scale = "linear", title = "SpectralTAD")
topdom_plot = plotMatrixBorder(vss_GI_sub, topdom_res,
                               scale = "linear", title = "TopDom")
hicseg_plot = plotMatrixBorder(vss_GI_sub, hicseg_res,
                               scale = "linear", title = "HiCseg")
ggarrange(spectral_plot,
          topdom_plot,
          hicseg_plot,
          ncol=2)
## $`1`
## 
## $`2`
## 
## attr(,"class")
## [1] "list"      "ggarrange"

6 Citation

Neda Shokraneh Kenari, Faezeh Bayat, Maxwell Libbrecht. VSS-Hi-C: Variance-stabilized signals for chromatin contacts. bioRxiv 2021.10.19.

sessionInfo()
## R version 4.3.1 (2023-06-16)
## Platform: x86_64-pc-linux-gnu (64-bit)
## Running under: Ubuntu 20.04.6 LTS
## 
## Matrix products: default
## BLAS:   /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0 
## LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0
## 
## locale:
##  [1] LC_CTYPE=en_CA.UTF-8       LC_NUMERIC=C              
##  [3] LC_TIME=en_CA.UTF-8        LC_COLLATE=en_CA.UTF-8    
##  [5] LC_MONETARY=en_CA.UTF-8    LC_MESSAGES=en_CA.UTF-8   
##  [7] LC_PAPER=en_CA.UTF-8       LC_NAME=C                 
##  [9] LC_ADDRESS=C               LC_TELEPHONE=C            
## [11] LC_MEASUREMENT=en_CA.UTF-8 LC_IDENTIFICATION=C       
## 
## time zone: America/Vancouver
## tzcode source: system (glibc)
## 
## attached base packages:
## [1] stats4    stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] HiContacts_1.5.0            HiCExperiment_1.2.0        
##  [3] ggpubr_0.6.0                ggplot2_3.5.0              
##  [5] coolR_0.0.0.9000            rhdf5_2.46.1               
##  [7] InteractionSet_1.30.0       SummarizedExperiment_1.32.0
##  [9] Biobase_2.62.0              MatrixGenerics_1.14.0      
## [11] matrixStats_1.2.0           GenomicRanges_1.54.1       
## [13] GenomeInfoDb_1.38.8         IRanges_2.36.0             
## [15] S4Vectors_0.40.2            BiocGenerics_0.48.1        
## [17] vssHiC_0.0.0.9000          
## 
## loaded via a namespace (and not attached):
##  [1] bitops_1.0-7            rlang_1.1.3             magrittr_2.0.3         
##  [4] compiler_4.3.1          vctrs_0.6.5             stringr_1.5.1          
##  [7] pkgconfig_2.0.3         crayon_1.5.2            fastmap_1.1.1          
## [10] backports_1.4.1         XVector_0.42.0          labeling_0.4.3         
## [13] utf8_1.2.4              rmarkdown_2.26          tzdb_0.4.0             
## [16] ggbeeswarm_0.7.2        strawr_0.0.91           purrr_1.0.2            
## [19] bit_4.0.5               xfun_0.43               haarfisz_4.5.4         
## [22] zlibbioc_1.48.2         cachem_1.0.8            wavethresh_4.7.2       
## [25] jsonlite_1.8.8          highr_0.10              rhdf5filters_1.14.1    
## [28] DelayedArray_0.28.0     Rhdf5lib_1.24.2         BiocParallel_1.36.0    
## [31] jpeg_0.1-10             cluster_2.1.6           broom_1.0.5            
## [34] parallel_4.3.1          R6_2.5.1                bslib_0.6.1            
## [37] stringi_1.8.3           car_3.1-2               jquerylib_0.1.4        
## [40] Rcpp_1.0.12             knitr_1.45              readr_2.1.5            
## [43] Matrix_1.6-5            tidyselect_1.2.1        rstudioapi_0.16.0      
## [46] abind_1.4-5             yaml_2.3.8              codetools_0.2-20       
## [49] lattice_0.22-6          tibble_3.2.1            withr_3.0.0            
## [52] evaluate_0.23           ggrastr_1.0.2           PRIMME_3.2-6           
## [55] pillar_1.9.0            carData_3.0-5           generics_0.1.3         
## [58] vroom_1.6.5             RCurl_1.98-1.12         hms_1.1.3              
## [61] munsell_0.5.1           scales_1.3.0            glue_1.7.0             
## [64] tools_4.3.1             BiocIO_1.12.0           data.table_1.15.4      
## [67] RSpectra_0.16-1         locfit_1.5-9.9          ggsignif_0.6.4         
## [70] Cairo_1.6-2             cowplot_1.1.3           grid_4.3.1             
## [73] tidyr_1.3.1             colorspace_2.1-0        GenomeInfoDbData_1.2.11
## [76] beeswarm_0.4.0          vipor_0.4.7             cli_3.6.2              
## [79] fansi_1.0.6             S4Arrays_1.2.1          dplyr_1.1.4            
## [82] gtable_0.3.4            rstatix_0.7.2           DESeq2_1.42.1          
## [85] sass_0.4.4              digest_0.6.35           SparseArray_1.2.4      
## [88] farver_2.1.1            htmltools_0.5.8.1       lifecycle_1.0.4        
## [91] HiCseg_1.1              bit64_4.0.5             MASS_7.3-60.0.1