Word count: 5432
The modelling of the data obtained from chromatin assays can provide us with a visual landscape of the genetic and epigenetic interactions occurring in the cells/tissues at a specific time point1. Several computational programs have been developed to process the datasets, fill the gaps, find constants, and predict the possible role of epigenetic elements in gene regulation. This review explores how chromatin datasets can predict transcription factor activity in different contexts and how this information has been leveraged to provide insights into normal, developmental and disorder conditions. The DNA is packaged in the cell nucleus, 146 base pairs (bp) coiled around a complex of 4 dimers of proteins (histones) H2A, H2B, H3 and H4. The formed structure is referred to as nucleosome2 which, can form different arrangements. The basic level, 10nm fibre or beads on a string, is loose and consists of the connexion of each nucleosome, leaving a space of 60bp (linker DNA). Whilst further compaction is tight and involves the participation of histone H1 (liker histone), heterochromatin protein 1 (HP1), Mg2+, and positively charged molecules 3. The compaction may generate a 30nm fibre arrangement (though discredited by some authors 4,5) or nucleosome clutches, probably depending on the external ion concentration 6. However, the chromatin arrangements lead to their condensation and formation of the chromosomes during cell division (Figure 1).
“Figure 1. Chromatin structure and
arrangement. Image modified obtained from 7” 1
Nucleosome arrangements affect the chromatin structure categorised in two principal interchangeable states; opened with nucleosome depleted regions (NDR) or nucleosome-free regions (NFR), relaxed for transcription; euchromatin, and compacted, closed, repressed or transcriptionally inactive; heterochromatin. These states have an essential role in the DNA accessibility for transcriptional regulators that modulate DNA replication, recombination and repair 8. The changes in chromatin structure can be regulated by histone post-translational modifications (PTM). The histone N-terminals are often rich in positively charged aminoacids such as lysines and glycines, which are often prone to biochemical modifications such as acetylation, methylation, phosphorylation, sumoylation, or ubiquitination 9. Those PTM affect the chromatin compaction, i.e. it has been proposed that the acetyl groups neutralise the positive charges of aminoacids, hence losing the DNA-histone interaction leading to DNA be more accessible for transcription. At the same time, deacylation tends to compacts the chromatin making it more inaccessible 10. Moreover, RNA-Chromatin interactions have also shown those similar outcomes 11. Besides the change of charge that the addition of histone PTM may generate, their main role is their function as docking sites for proteins that recognise them (readers) and recruit ATP-dependent enzymes (chromatin remodellers) 12,13 that alter the nucleosome arrangements changing their organisation, ejecting them, sliding them or interacting with other proteins to replace the histones for variants (H2A.X/Z/B/1H Macro-H2A, H3.3)14,15. Those alterations allow the binding of transcription factors (TFs) to the DNA 16, regulate the enhancer landscape 17 or provide genome stability in incidental damage 8. The wide range of enzymes that lie down (writers), read (readers) and remove (erasers) the functional groups are group and site-specific. Group-specific refers to the enzymes interacting with specific biochemical marks. For instance, acetylations are laid down by histone acetyltransferases (HATS), removed by histone deacetylases (HDACS) and read by specific proteins called ‘bromodomains’ 18. Whereas histone methylation is laid down by methyltransferases, removed by histone demethylases and read by chromodomains. Besides, methylations laid down by a specific complex (polycomb repression complex 2, PRC2) are associated with heterochromatin states19. Site-specific refers to the aminoacid in which a specific PTM is attached and its potential functional role itself or in combination with others. Histone acetylation has been widely correlated with active transcriptional outcomes, some highly correlated with active gene promoters (H3K9ac), enhancers (H3K27ac), early replication states (H3K9ac, H3K27ac). On the contrary, histone methylation has been correlated to active and repressive transcription states. Some active transcription marks are H3K4me3, H3K36me3, H3K9me3 and H4K20me. Marks associated with enhancer regions are H3K4me1,2, H2K36me, H3K27me and H2K5me. Some associated with early replication are H3K4me1,2,3 and H4K20me, while some found in the last replication are HEK9me2 and H3K9me319. Understanding histone patterns position, dynamics, and correlation with regulation have allowed prediction and categorisation of functional regions in developmental stages, cell differentiation 20 or diseases. Furthermore, human PTM and their associated functions have been compiled by different authors 21-23 and integrated into databases 24.
The information that regulates transcription comprises two elements; cis and trans. Cis-regulatory elements (CREs) are on DNA in sequences containing transcription factors binding sites (TFBS). Promoters and proximal CREs are localised nearby TSS (the point where the transcription machinery ensembles)25. In contrast, distal CREs such as enhancers, super-enhancers insulators, silencers and locus control regions (LCR) are separated ten to thousands of kilobases upstream or downstream away from TSS, sometimes within the target gene26. CREs are recognised by trans elements, referred to as general transcription factors (GTF), which involve a broad range of promoter-activator proteins (activators), coactivators, and mediators. In order to allow poll II transcription from protein-coding genes, it is necessary to form a transcription complex at the TSS. Since nucleosome occupancy or other DNA binding proteins may occlude the CREs, TSS and promoters of active genes are often characterised by NFR. Although silent genes have also shown open and accessible CREs, they have also been associated with being regulated through repressive histone marks such as H3K27me3 27 and key silencer TFs 28. Thereby it has been suggested that nucleosome occupancy is not strictly correlated with repressive or active states 29 but, instead, the states are influenced by histone variants (H2A.Z and H3.3), histone PTM correlated with nucleosome turnover (H3K4me), or short unstable RNA transcripts associated with enhancers activity (eRNAs)30. In 2019 26 proposed four models to explain how accessibility is established by the action of DNA- TF-remodellers. The first one is by parsimonious competition between the TF and nucleosome, in which the higher concentration of TF can bind DNA and recruit stabiliser cofactors (possible only in euchromatin). The second is by cis proximal H1 and architectural proteins displacement, where TFs bind to linker DNA and destabilise them, dispersing the action to proximal nucleosomes. The third one is ‘TF-trans control’, where TFs bind to distal CREs and recruit other cofactors that evict nucleosomes in trans, forming a loop for enhancer-promoters interactions 31. Finally, by the pioneer TF binding to nucleosomal DNA to promote histone core displacement itself or recruiting chromatin remodellers to establish an open chromatin state. However, it has also been proposed that torsional stress may trigger the nucleosome eviction to process transcription 32. Chromatin is not static but dynamically condensed and decondensed to enable transcriptional activity. The transcriptional machinery involves forming an initiation complex and the availability and interaction of CREs, TFs, coactivators and participation histone PTM and chromatin remodeler complexes. Thereby gene transcription regulation is in certain part mediated by the chromatin 31.
Chromatin has been studied by analytic experiments that reflect its organisation33, biochemical marks, associated proteins and arrangements34. The organisation have been studied employing chromosome conformation capture-based techniques. i.e., the Hi-C method shows all the possible interactions between all chromatin fragments simultaneously, producing genome-wide interaction maps35. The study of associated proteins and histone marks have been made employing CHIP-seq experiments36 (although Mass Spectrometry and SELEX methods can also investigate them). The accessible chromatin can be studied by ATAC-seq, DNase-seq, Nice-seq, and FAIRE-seq experiments. Finally, the nucleosome positioning can be visualised using ATAC-seq and MNAse-seq1,37. The integration of the information provided by the previously mentioned assays leads to the identification of CREs and the posterior creation of chromatin state maps38 that, according to the histone marks, proteins associated or accessible regions reveal the gene activity and variability among cell type, tissues, developmental stage or disease conditions. The previously mentioned assays and their improvements are reviewed in the following paragraphs. 3D Chromatin assay One widely used technique to research the spatial organisation of chromatin is Hi-C. The traditional technique involves using formaldehyde to cross-link the DNA interactions, followed by restriction digestion and incorporation of biotin linked nucleotides to repair the cut endings. Posteriorly, they are proximity ligated, and the cross-linking is reversed as well as the biotin. The resulting fragments are paired-end sequenced. However, modifications and improvements that remove steps such as the biotin labelling have been proposed 35 using specialised computational software for the data processing 33. In the same way, protocols that involve the single-cell analysis are available 39 with improvements for the computational analysis that contemplate their possible bias40. Nonetheless, in general terms, the data processing of all Hi-C assays have to undergo a quality control process where the low and wrong signals are filtrated, aligned to a reference genome, compartmentalised in bins and normalised. The programs utilised to analyse and visualise the Hi-C data often already integrate the before mentioned steps. Although the alignment can be performed using common alignment software such as Bowtie, Hi-C disposes of their specific packages that give more priority either to the visualisation resolution (Juicebox) 41, multiscale contact map (HiGlass)42, mitigation of the sequencing depth to detect loops (Coolpup.py) 43 or are adapted to be work with sc-Hi-C data (Galaxy-HiC Explorer)40. The Hi-C analysis reflects the interactions and how they may affect the genome functionality either in rectangular or triangular heatmaps, arc or circular plots with static or interactive zooming, depending on the program utilised. The Hi-C research allowed categorising the chromatin in compartments (A and B) related to active and inactive transcription states, the chromosome territories and the interactions of the genes related to their spatial position44.
DNA hypersensitive regions to the enzyme DNAse I ( EC 3.1.21.1.) has been used to identify regulatory elements such as promoters, insulators, enhancers, locus control regions and silencers45 within open chromatin regions. The method is based on the digestion of chromatin by DNase I and the posterior attachment of biotinylated linker to the cleavaged terminals. The DNA fragments can be identified by next-generation sequencing (NGS). Some protocols employ single-cell populations (scDNase-seq)45,46. However, this method biases are mostly related to the requirement of millions of cells to perform and the data processing limitations, although many bioinformatics software have addressed them.
Transposase-Accessible chromatin followed by sequencing (ATAC-seq) is an Assay developed 47 to detect open/accessible chromatin regions using Tn5 transposase, which cut and attaches adapters into accessible regions of chromatin. Although Initially created to process bulk populations of cells. It has been rapidly and continually improved 48,49 for the identification of a single group of cells (scATAC-seq) 50,51, for their use in mitochondrial genetic material (mtsATAC-seq)52, for it to be suitable for diverse cell/tissue samples in freezing conditions (Omni-ATAC) 53, or to make it more efficient utilising intact nuclei material during droplet barcoding (dscATAC-seq) 54. In addition, simulation programs have been launched (simATAC-seq) to generate in silico scATAC-seq samples that estimate the parameters of reading distributions and generates a count array that depicts the regulatory landscape of cells that share similar characteristics55.
Formaldehyde Assisted Isolation of Regulatory Elements (FAIRE) protocol was introduced in 2009 by 56 to isolate genomic regions depleted of nucleosomes. It is based on the cross-linking of proteins to DNA with formaldehyde, eliminating the chromatin by sonication and performing a DNA sequences extraction followed by NGS (for identifying individual loci, qPCR is suggested). The limitation of this method is related to the large number of cells required to perform. In 2018, Segorbe et al. 57 released a protocol that functions utilising 100‐fold fewer cells for the analysis in yeast cells. However, it has not been tested on human tissues. Due to the low cost that this assay represents, Seuter et al. 58 published laboratory issues that this method may present and how to address them with potential solutions. Furthermore, the author made recommendations on what cases it is more appropriate to use FAIRE-seq instead of ATAC-seq, which depends on the experiment since FAIRE-seq does not allow footprinting nor nucleosome positioning analysis.
Nicking enzyme assisted sequencing (Nice-seq) is another assay proposed for high-resolution open chromatin profiling on living and formaldehyde-fixed cells. In fixed cells, it is needed cross-linking with formaldehyde and biotin labelling while native cells do not. Then it is required the incubation of the samples with the nicking enzyme Nt.CviPII, which nicks the DNA with sequence specificity (CC- A/G/T). After that step, the genomic DNA with the putative open chromatin regions is purified, fragmented, and captured for library construction. The method was compared with ATAC-seq and DNase-seq data showing similar peaks and overlappings over 70%. The advantage of the proposed method is the robustness to be used in different cell preparations 59. Besides, it has been improved for a wide variety of mammalian cells and tissues (UniNicE-seq)60.
Micrococcal nuclease followed by NGS (MNase-seq) is an assay that has been used to identify nucleosome positioning 61. It utilises the endo-exo-nuclease MNase (EC 3.1.31.1 )to digest naked DNA found between nucleosomes. After digestion, the nucleosomes are released from chromatin, and the fragments are subjected to sequencing (MNase-seq). The sequences obtained provide nucleosomes occupancy profiles37 and can also be related to chromatin accessibility areas29. However, this method has been criticised since the results depend on the level of digestion and might not reveal an accurate nucleosome occupancy- DNA accessibility relation; to correct this issue and improved quality control (QC), MACC protocol 29 and CAM pipeline have been proposed62. Furthermore, current protocols have been published to decipher the nucleosome landscape in a more accurate manner 37, though its use on human cells is still untested.
Chromatin immunoprecipitation coupled with sequencing (ChIP-seq) is a technique utilised to identify profiles of histone modifications, transcription factors and nucleosome positioning 63. The method involves the cross-linking with formaldehyde of protein-DNA interactions; then, the complexes are sheared either by sonication or nuclease treatment to obtain small DNA fragments. Antibodies against the DNA binding protein or histone modification are added. After that, the DNA is released from the proteins and analysed using NGS 64. Several protocols modifying the traditional method have been published, focused on; eliminate cross-linking or sonication steps (CUT&RUN)65 and adapt the use of TN5 enzymes; CUT&Tag 66, utilise lower cell number, from 1000 cells; ChIL–seq 67) or few than 100 cells; itCHIP-seq 34, MOWCHiP 68, nanoCHIP-seq 69, ULI-NChIP. Also, to study the marks or TF at individual cell level sc-CHIP-seq 70,coBATCH71, and diminish the potential false negatives or positives (Exo-CHIP-seq) 72 improving the quality control 73. It is important to note that to detect histone marks, specific adaptations to the protocol have emerged utilising microfluidic chambers such as LIFE-Chip 74 and MOWChip75 SurfaceChIP-seq76. The election on which method utilise depends on the regulatory mark desired to study, the availability of reagents, the amount of cell disposed to analyse and the consideration of bias that each assay may show. Nevertheless, DNase-seq and ATAC-seq for chromatin accessibility profiling and CHIP-seq have been the most often utilised.
NGS equipment release several gigabases of data per single experimental run that are processed according to the desired analysis to perform. However, all NGS data is firstly processed communally. The obtained data may content artefacts; read mistakes, base-calling errors, insertions, deletions, poor quality reads, contamination with primer-adaptors. These errors impact the processing analysis, so the reads are subjected to a quality process. Once the quality process is carried out, the reads need to be aligned to the reference genome. Alignment is the process in which two or more sequences (the reference sequence and the NGS obtained sequences in this case) are compared and edited using gaps and substitutions to obtain a close similarity of each other77 that is measured by assigning positive scores to matches and penalties (negative scores) to mismatches 78. The University of California Santa Cruz (UCSC) Genome Browser and Genome Reference Consortium (GRC) are sources for the human reference genome. After alignment, the sequence data is collected to perform a second QC analysis, where improperly paired, blacklisted regions79,80, duplicated reads from PCR artefacts are removed. The sequences are aligned and filtered a ‘peak calling’ analysis is performed. According to the assay performed, peak analysis normalises and counts the fragments in the sequences and compares them with others to find enrichments related either with promoters, enhancers, or TF.
The goal of experiments that analyse protein-DNA interactions obtained from CHIP-seq, SELEX is to discover the affine sequence by the TF. The peak calling process is in charge of that duty. The active binding sites are seen as peaks in enrichment profiles. The intensity of a specific peak depicts the frequency of interaction of the TF and the nucleotide that it binds (interaction strength). Some software employ details about the frequency of nucleotides at a specific site, position weight matrices (PWM) to posteriorly allow their visualisation as a sequence logo (Figure 2). Although a widely used has been MEMEChip81and HOMER82, current software that allow motif discovery de novo are; MotifHyades, based on expectation maximisation 83 BaMM, based on Bayesian Markov model84 SamSelect 85 and STREME, which have demonstrated high accurate performance compared to the previous state-of-the-art computational software used such as MEMEChip and HOMER 86.
Data from experiments that assess the binding affinity of TFs to the DNA sequence (SELEX, CHIP-seq, MNAse-seq) are stored in databases such as JASPAR87, CIS-BP, TRANSFAC and HOMOCO88. Moreover, Gene Ontology, Reactome Pathway Database, Broads MSigDB contain sets of genes associated with their function.
“Figure
2. Motif discovery process from CHIP-seq, SELEX. Own
elaboration, created in Biorender.” 2
Accessible chromatin assays can provide information about transcription factor binding sites (TFBS) present in the chromatin open regions (COR). Nowadays, exist many methods to identify and predict them. One method is based on the screen of the COR to identify putative TFBS utilising previous input of known transcription factor binding motifs in PWM form89. Those reference motifs are obtained from JASPAR, HOMOCO, or TRANSFAC databases and provide the binding affinity of the TF to the DNA sequence (motif) to indirect predict the putative TFBS from accessible DNA experiments 84,90. Nevertheless, some limitations of this method are that the length of the PWM is limited and might be insufficient because of incomplete or half sites identified. Besides, shorter sequences are prone to false positives, and the degenerative regions are usually not considered91.
A second method is based on the search of footprints between the COR
and matching the sites to putative TFs. The footprint is a sequence
pattern that TF leaves when it binds to DNA, preventing the Tn5 or DNAse
I cleavage. The sites can be discovered by visualising the footprint
mark across peaks and matching known motifs or identifying novel ones
92(Figure 3). However, some authors consider that the inconvenience of
using this method is the ambiguity that may create at the assignment of
TF to individual footprints93. “Figure
3. Motif identification from Chromatin accessibility datasets.
Own elaboration, created in Biorender.” 3
Currently, the majority of methods involve the integration of more data to discover TFBS, either utilising motif match score, sequence conservation length, or CpG islands (Mocap)91. Others add genome annotations of gene expression such as FactorNet 94, DeFcom, 95, CENTIPEDE 96; associate eRNAs30, consider changes in DNA accessibility (BagFoot)97 and infer their TFs activity (DAStk)98. Those considering gene expression have even awarded by the ENCODE-DREAM challenge (Anchor)99 100. Although the previous computational programs can work for DNase-seq and ATAC-seq, due to the bias that each protocol may have, some have been developed to only process ATAC-seq readings (HINT-ATAC) 101 It is important to note that a global analysis of footprint in the human genome has recently depicted around 4.5 million sequences 93. Although it utilised incomplete sampling of tissues, it might serve as a good reference of the potential presence of TFBS in distinct cell types. However, a complete footprint analysis that contemplates the most accurate sites and several tissues and cells are still lacking. The integration of the prediction of multiple regulatory elements has allowed the build of computational programs that consider potential motif mutations (MAGGIE) 102 or that model and infer gene regulatory interactions from high-throughput data, such as ISMARA 103, LISA 104 or SMITE105. However, SMITE requires integrating more data such as interaction network, gene annotations and statistical test from epigenomic profiles to model network modules. The TFBS identification allows the posterior prediction of enhancers and super-enhancers. Their experimental identification has been made by knocking out the enhancer sequence and observing the gene expression reduction resulted106 or applying protocols such as STARR-seq to shred and insert in plasmids to detect enhancer activity 107. The attributed categorisation is used to make predictions posteriorly 108. ENCODE phase 3 have identified 926,535 potential CREs109. On the other hand, the CREs predictions utilise computational methods focused on identifying motif sequences conserved through evolution110 or utilising the known CREs obtained experimentally as reference. However, since enhancer activity is associated with histone modifications and chromatin accessibility, some software make predictions of enhancers based on combinatorial features of (DHS sites), histone modification code 111 or known TF. For example, super-enhancers (those that regulate cell identity genes and confer high expression) have been predicted by identifying enrichments of H3K27ac or binding of specific TF112.
The ATAC-seq and MNAse-seq experiments allow obtaining fragments from nucleosome positioning regions. In ATAC-seq, the fragments shorter than 100bp are expected to cluster upstream the TSS, whereas that the fragments of ~200, ~400, or ~600bp corresponding to mono, bi and tri nucleosomes are expected to be depleted from the TSS, displaying periodical peaks upstream or downstream of TSS 113. Nucleosomes can provide information about the occupancy within regulatory regions among a single population of cells because they can exhibit heterogeneity in expression114. Thereby, the regulatory sequences may be hidden by nucleosomes at certain moments while at other uncovered and affecting transcription. Although some authors suggest that software developed for MNase-seq such as DANPOS2,115 PuFFIN116 can be used for ATAC-seq90, some are specific for it, such as NucleoATAC and HMMRATAC117 118. The data is visualised in V-plot graphs where the fragments are ordered according to their length around the TFBS (Figure 4).
“Figure
4. representation of V- plot graph. It illustrates an
arrangement in which two pairs of nucleosomes delimit a chromatin open
region truncated by a TF and the order of the fragments in that
situation. B represent the characteristic V-plot pattern chart where 0
corresponds to the TFBS. Obtained from 1” 4
Once the CHIP-seq datasets for histone marks are processed, the histone PTM are often depicted in the form of pile peaks across the genome. The reads count in promoter regions, which are already pre-established as the +- 2k before or after the TSS identify marked promoters. Bivalent promoters overlap marks such as H3K4me and H3K27me3 for at least 400 bp 76. Enhancers and super-enhancers are predicted using computer programs such as ROSE, which stitches enhancers within 15kb and exclude that tithing 2 kb from annotated TSS. Nonetheless, there are specific data processing of histone modifications that learn the mapping patterns to provide high-quality histone CHIP-seq data such as CODA 119
The identification of PTM (especially acetylation and methylation) and their correlation with specific CREs such as enhancers and promoters, and trans factors as TBP-associated factor 1 (TAF1), RNA polymerase II (RNAPII) and p300 enzyme laid the basis for the categorisation of data in the form of chromatin states 120 as well as their potential to predict regulatory elements in the human genome 121. As they were extended, specific features have characterised regulatory elements, i.e., the H3K4me1 mark for primer enhancers, H3K4me and H3K27ac as marks for active enhancers, H3K4me3 to H3K4Me1 as promoter marks, H3K36me3 with RNAPoll as transcribed regions marks, H3K27me3, H3K9me3 representative of repressive chromatin states36 (Figure 5.). Even the utilisation of one simple signature122 has allowed finding differences in gene expression among disease samples and normal ones 123. However, the integration of more regulatory elements could more accurately reveal the biological differences in pathological and non-affected tissues. The join of all combinatorial epigenetic marks has allowed wider systematic annotations 124 that capable of reflecting their probability of binding at promoters, TSS, transcribed promoters, transcribed regions, strong enhancers, enhancers, heterochromatin, insulator regions, and satellite repeats states to provide a landscape of states that change across the cell fate. On account of their utility, posterior improvements on computational methods to depict more accurately chromatin states have been developed125-130 so far 38,51,131
“Figure 5. Representation of
chromatin states according to their representative marks. Image obtained
from 36 unedited” 5
Chromatin states have allowed the visualisation and discrimination of epigenetic features; distinctive between sex 128, the epigenetic landscape of X chromosome inactivation across tissues and genes affected for it 132, the epigenomic features that differentiate tissues, like gastrointestinal against brain, foetal against adult tissues, or tumours against normal samples 128. Chromatin profiling has also allowed identifying the epigenetic differences in disease cells with or without treatment; i.e., Grosselin et al. found the loss of HEK27me2 in chemotherapy resistance cells and non treated cells (‘resistant like cells’), suggesting that the epigenetic features typical of treatment resistance cells are already found in some tumour cells 133, being a possible feature that predicts the prognosis of some cell tumours under treatment before their implementation.
Public research consortiums have enabled the storage and identification of functional elements discovered throughout the human genome. ENCODE 134, Roadmap Epigenomics135, International Human epigenome consortium (IHEC)136 are databases containing the data of several chromatin experiments. Multiple analysis and predictions of regulatory regions have been conducted from their data. Databases and data browsers (DB) platforms focus on storing epigenetic predicted marks to effectively standardise and process human epigenetic regulatory data and classify it according to cell, tissues, or diseases conditions38. Some DB have been specifically developed to explore trans regulatory elements (TFs); the hTFtarget database contains 3.4 million predicted records of candidate TF-target regulations obtained from only CHIP-seq data 137. KnockTF provides, besides TFs, details about binding at CREs, although from fewer TF samples 138. However, others’ approach has been the cis-regulatory elements (LncRNAS) 139 based on enhancer (ENdb) 140 or super-enhance (SEanalysis)141 data. One of relevance is ATACdb since it contains chromatin accessibility data processed with enhancers, super-enhancers, TF, SNPs, eQTLs, methylation sites, chromatin interactions, TF footprints and their relation with gene expression to annotate and illustrate the CREs potential roles per tissue or cell 142. Although the specific analysis of trans / cis-regulatory elements is relevant, the integration of multiple omics techniques to demonstrate a wide regulatory landscape have also been proposed. The cistrome DB was launched to provide CHIP-seq, DNase-seq and ATAC-seq curated and processed data143 that depicts cis-regulatory information 144. Nonetheless, the QC has not been that robust and miss-annotation, ambiguity or incompleteness of the data is expected 142. Another database that includes several omics experiments is the gene transcription regulation database (GTRD) that contains processed data from RNA-seq, MNAse-seq, CHIP-seq, and DNase-seq experiments with all cell lines and tissues and comprises TFBS, histone marks, nucleosome landscapes and gene ontology through cell and tissues 25. The human epigenome reference, epimap, illustrates a compendium of 10,000 epigenomic maps across 800 samples. The epigenome integration database contains chromatin states, cis-regulatory elements predictions throughout the different cell and tissues developmental stages with already processed quality control. The limitations of this project are related to the tissue samples not studied at a single cell level and the lack of tissues, environmental conditions to compare, as well as developmental stages 38. Nonetheless, it appears to be the most completed DB with integrated regulatory elements and chromatin states landscapes per tissue and condition.
The relationship between TF, gene expression and chromatin accessibility has enabled the study of gene expression in normal conditions, human developmental stages, diseases and give insights into therapeutics. In normal condition studies, chromatin datasets provide insights into the potential epigenetics mechanism that leads to dynamic changes in gene expression. Su et al. performed a study to decipher the epigenetic landscape shifts before and after stimulation, finding the typical enrichment of the gained open sites and their relation with changes in gene expression 145. Moreover, Tyssowski et al. analysed the DHS, histone PTM and CpG content in mouse neurons revealing the neuronal activity patterns and the separate activation of enhancers by eRNA and H3K27ac marks146. The creation of atlas of chromatin accessibility and foetal gene expression 147 have identified 657 cell subtypes and potential new TF that regulate cell fate specification, illustrating the expression dynamics from embryonic to foetal developmental stages. Moreover, since the study is based on scATAC-seq and scRNA-seq the expression patterns identify the differences in expression in the same organ. Since the atlas illustrates the regulatory landscape of human development in normal conditions, it can predict gene regulation in other stages of development or disease conditions. Regarding disease conditions, epigenetic profiling has been used to compare the transcriptional responses of myeloid cells in 69 different neurodegenerative diseases, providing 336 expression profiles and their categorisation according to the genes that demonstrate similar responses. This profiling has suggested the possible association between human Alzheimer disease and inflammatory signalling (due to the function of genes activated)148. Current studies have shown that profiling chromatin accessibility integrated with other omics assays (RNA-seq, CHIP-seq) have allowed to categorise cancer subtypes depending on their DNA regulatory elements, discover putative non-coding mutations related to clinical prognosis146. Either way, integration of epigenetic marks such as DNA methylation patterns with CHIP-seq, histone PTM and transcriptome single-cell analysis have allowed decipher how the coordination of epigenetic modifications (chromatin states) allow cells to activate alternate gene regulatory pathways that lead to transcriptional heterogeneity in chronic lymphocytic leukaemias (CLL), contributing with knowledge of this specific disease clinical behaviour 149. Lastly, epigenetic profiling also aids to reveal the functional consequences of gene mutations. The integration of RNA-seq, ATAC-seq and CHIP-seq datasets provided the elucidation of the circumstances in which specific genes mutations (SY24CS) show the ability to open distinct chromatin regions and active alternative transcriptomes. In contrast, others (wing2) does not affect the chromatin accessibility but can increase the protein binding at estrogenic receptor loci under estrogenic therapy, providing insights into the mechanism by which FOX1 mutations perturb its function and lead to breast cancer progression and response to therapy 150.
The human body contains trillions of different specialised cells that carry out distinct functions despite having the same genetic information. Since the chromatin structure and their alterations allow gene regulation, the epigenome study helps determine the gene expression outcome and cell fate129. The chromatin assays and the computational software have allowed a more integrative way to visualise and understand the regulatory network that directs the cell functions in distinct conditions, either in normal states, disease, developmental stages or under treatment22,38. The experimental assays rely on several laboratory conditions such as collection, storage, or sample processing that can affect the results. They are addressed by several computational software that aim to provide the most accurate interpretation of the regulatory mechanisms and prediction of the dynamics that may occur36,113. Although each one has its limitations, they have the potential to provide us with schemes and frameworks to find the genetic and epigenetic targets for therapeutics. For instance, nowadays, medicines targeting epigenetic marks, such as DNA methylation representative of certain cancers151 have been developed and commercialised. The epigenetic field has been widely studied. Several protocols have been constantly improved to reduce long-lasting steps and work with fewer cell number and single-cell conditions152,153. In the same way, there has been an increase in the number of improved computational programs to increment their accuracy and process the data reducing the experimental bias. However, it is still necessary their leverage for further studies based on the implementation of single-cell experiments comprising all tissues differences, as well as the utilisation of the multiple databases information to create a better comprehension of the epigenetic landscapes in developmental stages, normal adult stages, a broader range of diseases, as well as investigations based on therapeutic approaches of the epigenome and their possible implementation in clinical trials for the posterior development of medicines that target epigenetic features further than DNA methylation.
Chromatin structure and arrangement. Image modified obtained from 7↩︎
Figure 2. Motif discovery process from CHIP-seq, SELEX. Own elaboration, created in Biorender.↩︎
Figure 3. Motif identification from Chromatin accessibility datasets. Own elaboration, created in Biorender.↩︎
Figure 4. representation of V- plot graph. It illustrates an arrangement in which two pairs of nucleosomes delimit a chromatin open region truncated by a TF and the order of the fragments in that situation. B represent the characteristic V-plot pattern chart where 0 corresponds to the TFBS. Obtained from 1↩︎
Figure 5. Representation of chromatin states according to their representative marks. Image obtained from 36 unedited.↩︎