Introduction to ChIP-Seq

Chromatin Immunoprecipitation followed by high throughput sequencing, colloquially known as ChIP-Seq, identifies the location DNA-protein binding sites on the genome. This collection of ChIP-Seq data is done through the procedure outlined in the diagram below.

An Outline of ChIP-Seq Data Collection

An Outline of ChIP-Seq Data Collection

As seen in the diagram, the protein-of-interest (e.g., histone-modification or transcription factor) is glued to the chromatin and then an antibody, that is known to bind with the protein-of-interest, is injected into the chromatin such that the genomic regions with the protein-of-interest are extracted. This extraction process generates reads which are then aligned and mapped to the genome (StatQuest, n.d.).

Once the ChIP-Seq data is processed, it must undergo some preliminary analysis before biological conclusions can be drawn. For instance, peaks must be called. In calling peaks, background noise is supposedly separated from true biological signal. This step is vital to generate the consensus peak set, which is a union of all called-peaks across conditions. Oftentimes in consensus peak sets, called-peaks that overlap in a sufficient proportion of replicates are combined into one peak. The consensus peak set is then used for downstream ChIP-Seq analysis of protein-binding sites in the genome (Höllbacher et al. 2020).

Recently, ChIP-Seq analyses have been focused on comparing binding sites for proteins-of-interest between different biological conditions. This characterization of the change in the binding sites between samples is known as “Differential Binding Analysis.” By the nature of Differential Binding Analysis, normalization techniques are required to make an apples-to-apples comparison of reads between biological conditions (Höllbacher et al. 2020).

Focus for my SURP

For my Summer Undergraduate Research Project (SURP), I will be analyzing between-sample ChIP-Seq normalization techniques through the lens of their biological assumptions. This project mirrors Evans, Hardin, and Stoebel (2016)’s paper, Selecting between-sample RNA-Seq normalization methods from the perspective of their assumptions, which discusses the biological assumptions in between-sample normalization techniques for RNA-Seq, another high throughput sequencing method (Evans, Hardin, and Stoebel 2016). By the end of my SURP, I hope to understand the following:

  1. What are the normalization techniques for ChIP-Seq? How are they different from those of RNA-Seq?

  2. What are the biological assumptions associated with ChIP-Seq normalization techniques? Again, how are they different from the biological assumptions for RNA-Seq normalization techniques?

  3. How can we use coding software, such as R (and potentially MatLab) to demonstrate the biological assumptions each ChIP-Seq normalization technique makes (e.g., simulating ChIP-Seq maps)?

In this write-up, I will summarize my findings from the past three weeks and provide insight into where I hope to take this project in the coming weeks in order to better understand the answers to the questions posed above.

Common ChIP-Seq Normalization Techniques

Listed below are the normalization techniques that I have found so far in literature. These are described in detail in the next section, “Summary of Normalization Methods.”

The following definitions are important to ChIP-Seq normalization and come from the DiffBind vignette, a popular bioconductor package for ChIP-Seq analysis (Stark and Brown 2022a).

  IMPORTANT DEFINITIONS:
  
  FULL: uses all the reads in each sample.
  
  READS IN PEAKS: just uses the reads in each sample that overlap the consensus peak set.
  
  BACKGROUND BINS: just uses the reads in the chromosomes that contain peaks in the consensus peak set. These chromsomes are then partitioned into segments that are ~15,000bp.
  1. RPKM

  2. Library Size

    2a. Full

    2b. Reads in Peaks

    2c. Background Bins

  3. Trimmed Mean of M-Values (TMM)

    3a. Full

    3b. Reads in Peaks

    3c. Background Bins

  4. Relative Log Expression (RLE)

    4a. Full

    4b. Reads in Peaks

    4c. Background Bins

  5. Quantile Normalization

  6. ChIPNorm

  7. MANorm

  8. Loess Normalization (Taslim et al.)

  9. Parallel Factors

  10. Spike-ins

Summary of Normalization Techniques

Once collecting a curated list of ChIP-Seq normalization techniques, I then found literature which defined each technique and provided the mathematical algorithm to normalize raw ChIP-Seq Data. This section is my first draft of the summary of ChIP-Seq Normalization methods common in literature.

The following important definition also comes from the DiffBind Vignette (Stark and Brown 2022a) as well as the DiffBind Manual (Stark and Brown 2022b).

IMPORTANT DEFINITIONS:

CONSENSUS PEAKS = peaks that overlap across a certain proportion of replicates (different samples under the same condition) and then typically the determined peaks in each condition are unioned to create a complete list of consensus peaks.

Overlapping can be defined in many different ways If one wants to be more confident in consensus peaks then they may require more overlap (e.g., 10 bps rather than 1 bp) or require a higher proportion of replicates having this magnitude of overlap (e.g., 50% of replicates versus 75% of replicates). This is done to minimize the number of false positives (i.e. FDR).

Normalization by Library Size

RPKM

RPKM stands for “reads per kilobase per million mapped reads” RPKM normalized ChIP-Seq data by taking every read count and dividing by the number of reads in a sample (in millions) and the region check length (in kilobases). By dividing by both metrics, RPKM normalizes by region length and total number of reads in each sample (Evans, Hardin, and Stoebel 2016).

LIBRARY SIZE (FULL)

Library Size (Full) normalizes between-sample ChIP-Seq data by dividing each sample by its full library size. The full library size of each sample is the total number of aligned reads in the sample, which is alternatively known as sequencing depth (Evans, Hardin, and Stoebel 2016).

LIBRARY SIZE (READS IN PEAKS)

Library Size (Reads in Peaks) is a variation of Library Size (Full) that calculates library size by summing the reads that overlap consensus peaks in each sample. Each sample is then normalized by dividing every read count in a sample by its number of reads in consensus peak regions. Unlike, full library size, library size found with reads peaks also is affected by the ChIP’s “efficiency” (i.e., the proportion of reads in enriched peaks; a high proportion of reads not in enriched peaks signifies an inefficient ChIP experiment) (Stark and Brown 2022a).

LIBRARY SIZE (BACKGROUND BINS)

Library Size (Background Bins) is another variation of the library size normalization technique. In Library Size using Background Bins – which are large regions (approximately 15,000 bp) on chromosomes with a peak in sample j. The reads in these background bins for sample j are summed and then serve divisor for every read count in sample j (Stark and Brown 2022b).

Normalization by Distribution/Testing

TMM (FULL)

TMM stands for Trimmed Mean of M-Values and is a normalization technique native to the R-package “EdgeR.” TMM sets on sample as a reference and compares it to every other sample, estimating the ratio of sequencing depths between each sample and the reference one (Robinson and Oshlack 2010).

The trimming of sections of the genome sequenced with ChIP-seq occurs twice: (1) based on fold-change in the signal in a specific region between samples. (2) based on signal magnitudes for a specific genomic region between samples. TMM trims twice in order to remove a majority of differentially bound regions so that the mean is only over non-differentially bound genomic regions. The ratio of read counts in a sample compared to the reference sample can be expressed as a ratio of sequencing depths for the non-differentially bound genomic regions (Robinson and Oshlack 2010).

\(k_{ij}\) is the number of reads aligned to region i under sample j

\(u_{ij}\) is the true signal magnitude of region i under sample j.

\(N_j\) is the total number of reads for sample j. Thus, \(N_j = \sum_{i} K_{ij}\)

Let r be the fixed reference sample.

Then, the fold-changes are defined as followed for regions:

\[M_{ij}^r = \log_2\frac{k_{ij}/N_j}{k_{ir}/N_r}\]

Additionally, the absolute signal magnitude is given as:

\[A_{ij}^r = \frac{1}{2}\log_2(\frac{k_{ij}}{N_j}\cdot\frac{k_{ir}}{N_r})\] As alluded to above, the absolute signal magnitude and fold-change of a genomic region (i.e. \(A_{ij}^r\) and \(M_{ij}^r\), respectively) are trimmed independently. This creates a finalized list of regions such that neither its \(A_{ij}^r\) or \(M_{ij}^r\) was removed; denote this sample as \(R\). smallest largest values??

Set \(R\) is then used to calculate the scaling factor, \(TMM_j^{(r)}\) for sample j through a weighted mean:

\[\log_2(TMM_j^{(r)})=\frac{\sum_{i \in R}w_{ij}^rM_{ij}^r}{\sum_{i \in R}w_{ij}^r}\]

where,

\[w_{ij}^r = \frac{N_j - k_{ij}}{N_jk_{ij}} - \frac{N_r - k_{ir}}{N_rk_{ir}}\] As we see in the \(\log_2(TMM_j^{(r)})\) calculation, to find the normalization factors we divide by library size of each sample (i.e., \(N_j\)). Consequently, the relative magnitude of the normalization factor is positively correlated to the relative library size of sample j relative to the other samples. Thus, we divide each sample by \(TMM_j^{(r)}\cdot\frac{N_j}{N_r}\) to normalize the ChIP-seq data, where \(N_r\) is the library size of the reference sample (Robinson and Oshlack 2010).

TMM (READS IN PEAKS):

TMM (Trimmed Mean of the M-values) on peak reads extends the same algorithm as TMM on the full library described in Robinson and Oshlack (2010) but only looks at the subset of reads that overlap with the consensus peaks (at least in DiffBind). Like on TMM full, one sample is selected as a reference sample and then the peak regions are trimmed twice. First, they are trimmed by fold-change between consensus peak in the sample versus peak in sample j. Secondly, the regions are trimmed based on true signal magnitude of some consensus peak region in sample j compared to the reference sample.

One reason to use TMM on only peak reads is because studies, such as Tu et al. (2020), suggest that there is high regularity at common peak (i.e. consensus peak) regions. Therefore, the scaling factors (\(TMM_j^{(r)}\)) may be more reliable when only reads in consensus peak regions are used rather than the full library of reads. In turn, this would leads to increased confidence in our normalization. For sites that are differentially bound, we would expect large ratios between their sequencing depths. Thus, the smallest \(A_{ij}^r\) and \(M_{ij}^r\) are independently removed, leaving us with a list of regions we will denote as \(R_p\), which has the same functional role as \(R\) in the calculations for “TMM on Full Library” (Tu et al. 2020).

The mathematical process presented in “TMM on Full Library” to normalize ChIP-seq data also applies to “TMM on Peak Reads.” The only difference is only the subset of reads that overlap with the consensus peaks are considered. Therefore, \(k_{ij}\) is the number of reads aligned to consensus peak region i under sample j; \(u_{ij}\) is the true signal magnitude of consensus peak region i under sample j; \(N_j\) is the total number of reads that overlap consensus peak regions for sample j. r still denotes the fixed reference sample for “TMM on Peak Reads” just as it did for “TMM on Full Library.”

TMM (BACKGROUND BINS)

TMM on Background Bins is another variation of TMM for normalizing ChIP-Seq Data. Background bins use all chromosomes for which there are peaks in the consensus set. These chromosomes are then partitioned into large bins (DiffBind’s default size is 15,000bp) and the reads that overlap these bins are counted (Stark and Brown 2022b). These large stretches in the chromosome are expected to not have differential enrichment between samples and therefore help ensure the mean is calculated over only non-differentially bound genomic regions for which the ratio of read counts in a sample compared to the reference sample can be expressed as a ratio of sequencing depths for the non-differentially bound genomic regions (Stark and Brown 2022a).

Like in the two TMM methods above, the TMM on background bins trims twice: (1) based on fold-change in the signal in a specific “background” bin between samples. (2) based on signal magnitudes for a specific “background” bins between samples [Robinson and Oshlack (2010)}. These trims are independently and quantified via \(A_{ij}^r\) (absolute signal magnitude) and \(M_{ij}^r\) (fold-change). Like before, we are left with list of background bins we will denote as \(R_b\) with sufficient \(A_{ij}^r\) and \(M_{ij}^r\) values. \(R_b\) has the same functional role as \(R\) in the calculations for “TMM on Full Library.”

The mathematical process presented in “TMM on Full Library” to normalize ChIP-seq data also applies to “TMM on Background Bins.” The only difference is that the background bins are used as opposed to potential enriched regions in the genome. Therefore, \(k_{ij}\) is the number of reads aligned to background bin i under sample j; \(u_{ij}\) is the true signal magnitude of background bin i under sample j; \(N_j\) is the total number of reads in background bins for sample j. r still denotes the fixed reference sample for “TMM on Background” just as it did for “TMM on Full Library.”

RLE (FULL)

RLE (Relative Log Expression) on the full library is a normalization technique native to the Bioconductor packages DESeq and DESeq2. It finds the ratio of full read count for a sample to the geometric mean of all read counts for the specified region across all samples. The median of these ratios for a sample is then set as the size factor that is then applied to the full library of reads for that sample to account for differences in sequencing depths between-samples (Love, Huber, and Anders 2014).

The idea is that most regions are not differentially bound and thus have no significant fold-changes between-samples. So, the median of the ratio of total reads in a region between a sample and a pseudo-sample serves as a good estimate of sequencing depth (Evans, Hardin, and Stoebel 2016).

Let \(s_j\) denote the size factor for sample j (found through median-of-ratios method). \(K_{ij}\) represents the total read count for region i under sample j. m is the total number of samples across every condition.

\[s_{j} = (median)_{i}[\frac{K_{ij}}{(\prod_{j =1}^{m} K_{ij})^{1/m}}]\]

The denominator inside \(median_{i}\) represents an arbitrary reference sample that each sample j can be compared to in the ratio above.

RLE (READS IN PEAKS)

Similar to the distinction between TMM on Full Library and TMM on Reads in Peaks, the math for RLE on Reads in Peaks is the same as the math of RLE on Full Library provided in Love, Huber, and Anders (2014). The only distinction is the set of reads considered in the ratio. In RLE on Reads in Peaks, only reads that overlap consensus peaks in the sample j are considered in the median to find the size factor for the sample j (Stark and Brown 2022a). Like before, let \(s_j\) denote the size factor for sample j. \(K_{ij}\) now represents the number of reads that overlap with consensus peak region i under sample j. m is still the total number of samples across every condition. The denominator, \((\prod_{j =1}^{m} K_{ij})^{1/m}\), still represents a pseudo-reference sample. However, now this geometric mean is only over consensus peak region i present across all samples.

RLE (BACKGROUND BINS)

As aforementioned, background bins use all chromosomes for which there are peaks in the consensus set. These chromosomes are then partitioned into large bins (DiffBind’s default size is 15,000bp) and the reads that overlap these bins are counted (Stark and Brown 2022b). These large stretches in the chromosome are expected to not have differential enrichment between samples and therefore help ensure the mean is calculated over only non-differentially bound genomic regions for which the ratio of read counts in a sample compared to the reference sample can be expressed as a ratio of sequencing depths for the non-differentially bound genomic regions (Stark and Brown 2022a).

Therefore, in RLE on Background Bins, only the reads on these chromosomes with peaks are considered. Again, the mathematical algorithm is the same it is in Love, Huber, and Anders (2014). \(s_j\) denotes the size factor for sample j. \(K_{ij}\) now represents the number of reads in background bin i (i.e., a region that is in some chromosome for which there is a peak) under sample j. m is still the total number of samples across every condition. The denominator, \((\prod_{j =1}^{m} K_{ij})^{1/m}\), still represents a pseudo-reference sample. However, now this geometric mean is only takes into account the number of reads in background bin i across all samples.

QUANTILE

Quantile normalization is a technique adapted from normalization of microarray data (Kruczyk et al. 2013). Quantile normalization uses the read count matrix of the consensus peaks and then sorts the columns of the matrix such that every row corresponds to a specific quantile of signal magnitude. We then find the average of each row and replace each entry in the sorted consensus peak matrix with its respective row’s average. The matrix is finally unsorted such that the entries are back in their original position in the matrix (Kruczyk et al. 2013).

Note that other measures of center, like medians, can be the replacement value in quantile normalization (Evans, Hardin, and Stoebel 2016). However, our analysis focuses on the mean as the measure of center in quantile normalization.

ChIPNorm

ChIPNorm is a normalization technique developed by Nair et al. (2012). While ChIPNorm can be applied to any ChIP-seq data set, its focus is on locating differentially-bound histone-modification sites.

ChIPNorm applies an iterative normalization process to identify differentially bound sites. First, the ChIP-seq data is compared to the”Amplified Binomial Distribution”, which is the estimated null distribution with parameters \(N_{bino}\) (total number of reads) and \(\alpha\) (amplification factor). ChIP-seq bins with a false discovery rate (FDR) less than or equal to 0.05 are considered significant.

Next, multiple instances of quantile normalization are applied. The first instance of quantile normalization classifies bins with a fold-change > 1 as outliers. The non-outlier bins then undergo a second round of quantile normalization to finalize the normalization function.

This normalization factor is then applied to normalize (via quantiles) all the bins, including the “outlier” ones. A bin is considered enriched if its ChIP-seq data and quantile normalized control data both have fold changes > 1.

If a bin is significant and enriched, then it is called “enriched-significant.” The enriched-significant bins of each library are moved into the “stage two” where they undergo another round of normalization with their original bin counts. Once again, quantile normalization (with a method similar to Bolstad et al. 2003) is used to give the libraries the same probability distribution so they can be directly compared to determine differentially bound bins.

According to Nair et al. (2012), ChIPNorm’s normalization process reduces the effect of bias when comparing data sets with different levels of background noise and/or signal-to-noise ratios. Furthermore, the mathematics of ChIPNorm were developed such that it be extended to compare more than two data sets, which methods, such as quantile normalization, are unable to do.

Normalization by Linear/Polynomial Regression

MAnorm

MAnorm is a between-sample normalization technique that utilizes the common peaks across samples to create a rescaling model for normalization (Shao et al. 2012). A common peak refers to a peak that overlaps by at least nucleotide in differential binding analysis for two samples (Shao et al. 2012). To create the normalization scaling equation, the MA values for each peak (unique and common) are calculated, where \(R_1\) is the read density at this peak region in sample 1 and \(R_2\) sample 2’s associated read density.

\[M = log2(R_1/R_2)\]

\[A = log2(R_1*R_2)/2\]

Note that 1 was addeed to the real number of reads for each peak to avoid an input of 0 into the log function.

“An iterative re-weighted least squares with a bi-square function” was then applied to the MA values for the common peaks (Shao et al. 2012). This led to a linear model to fit “global dependence” for the common peak’s MA values:

\[ M = a + b*A\]

The linear model was then applied to all the peaks, including those unique to a sample. Alongside the normalized MA value outputted by MAnorm, a corresponding p-value of each peak is found using a Bayesian model by Audic and Claverie which quantifies the magnitude of differential binding at the peak’s locus on the chromosome (Shao et al. 2012).

\[p(y|x) = (x + y)!/x!y!2^{x=y=1}\] where x and y denote the normalized the read count at a common peak for sample 1 and 2, respectively.

LOESS NORMALIZATION (Taslim et al.)

While there are many possible LOESS normalization techniques, one possible LOESS (locally weighted smoothing regression) normalization technique for ChIP-Seq was introduced by Taslim et al. (2009). Their LOESS normalization method does not assume that the total number of DNA bindings is similar across conditions nor that the noise is linear. Rather, their LOESS normalization assumes that the difference in means of non-differential reads will be zero (Taslim et al. 2009). Their method involves a two-stage LOESS normalization. In stage one, the fitted values are estimated by a LOESS regression with span 60% on the observed difference in mean counts:

\[\hat{Y}_{mean} = loess[(x_{i2} - x_{i1}) \sim (\frac{x_{i2} + x_{i1}}{2})] \] \(x_{i1}\) and \(x_{i2}\) represent the observed fragment counts in control and treatment samples. \(\hat{Y}_{mean}\) is then used the find \(D_{meannorm}\), or the data after correction with respect to the mean.

\[D_{meannorm} = (x_{i2} - x_{i1}) - \hat{Y}_{mean}\] After the normalization with respect to the mean is calculated, the next step is to conduct a loess regression with span 10% to find the fitted Y-value by regressing the transformed data on the average of the observed fragment counts.

\[\hat{Y}_{var} = loess[|D_{meannorm}|\sim (\frac{x_{i2} + x_{i1}}{2})]\] This then creates, \(D_{meanvanorm}\), which is the data normalized with respect to mean and variance, where:

\[D_{meanvanorm} = \frac{D_{meannorm}}{\hat{Y}_{var}}\]

Normalization by Controls

PARALLEL FACTORS:

Parallel factors use a second antibody to serve as an internal control “to normalize the signal” such that there are unchanged peaks between-samples that can be used as reference (Guertin et al. 2018). To use parallel factors, it must be a priori known that most of the binding sites for the secondary antibody are unaltered between the compared conditions. For example, Guertin et al. (2018), use CTCF (CCCTC binding factor) as a control antibody for ER (estrogen receptors) given CTCF affects a small proportion of ER binding sites. The peaks which are associated with the control antibody in each condition are then compared and a normalization factor is developed using typical normalization techniques. The obtained normalization factor is then applied to the reads associated with the transcription factor or histone modification of interest to normalize its interesting regions such that we can determine differential binding sites.

SPIKE-IN CONTROLS

Spike-in controls insert chromatin from another organism into the genome with the transcription factor or histone of interest to serve as an external control across the sample conditions (Evans, Hardin, and Stoebel 2016). In spike-in controls, it is important that enough of the external chromatin is added to conduct normalization but not too much such that the total tag counts change. Similarly, for the external chromatin to serve as a control, it must not affect the antibody for the transcription factor or histone of interest (Bonhoure et al. 2014). Similar to parallel factors, the external control is then used to generate a normalization factor and then this factor is applied to the chromosomal reads with the transcription factor or histone of interest (Stark and Brown 2022a).

Conducting Normalizing Techniques on Real-Life Data

I owe the the following code chunks in large part to Ashby (2021), who supplied me with this code via an email on 05/20/2022. Additionally, I would like to recognize Danae Schulz and her student, Lucy, who collected the Cut-and-Run data that is used in the consequent real-life data analysis.

Set-up Code

To see the normalization techniques in action we applied some common techniques (supported by the DiffBind package) to Cut-and-Run data produced by Danae Schulz and Lucy at Harvey Mudd College.

To replicate the data analysis steps provided below for yourself, please refer to "file_directory_walkthrough.md" in the github repo.

The first step was to read the peaks and supplementary information into a sample sheet.

#####
# The goal of this chunk: to make a sample sheet containing file paths to all relevant files

#identify names of sorted spike-in BAM files
spikes<-list.files("data/bam_spikein",  ".bam")
spikes<-spikes[str_detect(spikes, "sort")]
spikes<-spikes[!str_detect(spikes, "bam.bai")]

#create sample sheet
sample_sheet<-data.frame(
  #SampleID denotes unique identifier for each sample
  SampleID= c(paste(1:3, paste(rep(c(0, 1, 3, 24, 76), each=3), "h", sep=""), rep(1:3, 3), sep="_")),
  #CONDITION denotes timepoint
  Condition= rep(c('0', '1', '3', '24', '76'), each=3),
  #Treatment included just for yucks
  Treatment= rep("Bromodomain Inhibition", 15),
  #Replicate denotes replicate number
  Replicate= rep(1:3, 5),
  #bamReads denotes paths to BAM files containing reads aligned to T brucei genome
  bamReads= paste("reads/", list.files("./reads", pattern="*.bam")[!str_detect(list.files("./reads", pattern="*.bam"), "bam.bai")] %>% head(15), sep=""),
  #Spikein denotes file paths to BAM files containing spike-in reads
  Spikein=paste("data/bam_spikein/", spikes[c(1,6,11,2,7,12,3,8,13,4,9,14,5,10,15)], sep=""),
  #Unique identifier for control sample
  ControlID= rep("16_IgG_1", 15),
  #bamControl contains the file path to the control BAM file
  bamControl= rep(paste("reads/", list.files("./reads", pattern="*.bam")[!str_detect(list.files("./reads", pattern="*.bam"), "bam.bai")] %>% tail(1), sep=""), 15),
  #Peaks contains the file paths to the peak files produced by MACS
  Peaks= paste("nomodel_broad_Res/", list.files("./nomodel_broad_Res/", pattern="*.broadPeak"), sep="") %>% head(15), 
  #PeakCaller denotes the file type of the files in `Peaks`
  PeakCaller= rep("narrow", 15)
)

#set up DBA object... most of all these settings are defaults. Make sure you check doGreyList=TRUE
res_s<- dba(sampleSheet=sample_sheet, config=data.frame(AnalysisMethod=DBA_DESEQ2, th=0.05,
                      DataType=DBA_DATA_GRANGES, RunParallel=TRUE, 
                      minQCth=15, 
                      bCorPlot=FALSE, reportInit="DBA", 
                      bUsePval=FALSE, design=TRUE,
                      doBlacklist=FALSE, doGreylist=TRUE))

Below is what the head of the sample sheet generated by the last step of code looks like:

sample_sheet %>%
    head()

After generating the sample sheet, Ashby (2021) then made a greylist which included all regions for which there is a disproportionate degree of signal. These regions are then excluded from future analysis to avoid biasing differential binding analysis.

###### Goal of this chunk: greylist suspect regions of the genome with high tag
###### counts in the control sample

set.seed(4747)

karyo <- read.table("karyo_file.txt")

# Step 1: create greylist object
gl <- new("GreyList", karyoFile = "./karyo_file.txt")
# Step 2: count the reads from the control file
gl <- countReads(gl, "./reads/sort_16_IgG_1_CR_diff_comb_140804_tb927_v5.1_m1_v2.bam")  #where sorted T brucei BAM file is
# Step 3: calculate read count threshold
gl <- calcThreshold(gl, reps = 100, sampleSize = 1000, p = 0.99, cores = 1)
# Step 4:
gl <- makeGreyList(gl, maxGap = 10000)
## coverage: 175103 bp (0.66%)
export(gl, con = "greylist.bed")

res_s <- invisible(suppressMessages(dba.blacklist(res_s, greylist = gl@regions)))

Next, Ashby (2021) wrote code to generate a consensus peak set. Please refer to the top of the previous section describing the mathematics behind selected ChIP-Seq normalization techniques for a definition of consensus peaks.

##### The goal of this chunk: generate the consensus peakset for peaks
##### identified in all 3 replicates from each timepoint No need to change
##### paths OUTPUT: writes 'consensus.bed' file to home directory containing
##### the positions of all the consensus peaks

# adds consensus peaks: those that overlap with all three
res_consensus <- dba.peakset(res_s, consensus = c(DBA_CONDITION), minOverlap = 3)
# removes other peaksets, focuses only on consensus
res_consensus <- dba(res_consensus, mask = res_consensus$masks$Consensus, minOverlap = 1)
# creates an object containing the consensus peaks, and writes it out as a bed
# file
consensus_peaks <- dba.peakset(res_consensus, bRetrieve = TRUE, writeFile = "consensus.bed",
    numCols = 3)

Below is a Jaccard distance plot which gives the similarity between consensus peaks at each recorded time interval in the experiment.

##### The goal of this chunk: create dendrogram illustrating Jaccard distance
##### between consensus peaks between timepoints You don't need to change a
##### thing!

# creates list of consensus peaks per TP
ll <- list(res_consensus$called[, 1], res_consensus$called[, 2], res_consensus$called[,
    3], res_consensus$called[, 4], res_consensus$called[, 5])

# calculates jaccard similarities between all combinations of timepoints
dist <- unlist(lapply(combn(ll, 2, simplify = FALSE), function(x) {
    sum(x[[1]] == x[[2]])/(sum(x[[1]] != 0) + sum(x[[1]] != 0))
}))

# initialize matrix to store distances
to_make_mat <- cbind(t(combn(5, 2)), dist)
d_mat <- matrix(ncol = 5, nrow = 5)
# make diagonal entries 1
diag(d_mat) <- 1
# loop through rows of the matrix
for (i in 1:dim(to_make_mat)[1]) {
    # fill d_mat with jaccard similarities at each particular index
    d_mat[as.numeric(to_make_mat[i, 2]), as.numeric(to_make_mat[i, 1])] <- to_make_mat[i,
        3]
}
colnames(d_mat) <- c("0hr", "1hr", "3hr", "24hr", "76hr")
rownames(d_mat) <- c("0hr", "1hr", "3hr", "24hr", "76hr")
# convert jaccard similarity to jaccard DISTANCE by taking 1-sim.
d_mat <- as.dist(1 - d_mat)

# use hierarchical clustering with complete linkage and plot
hc <- hclust(d_mat)
plot(hc, main = "Consensus Peaks", sub = "", xlab = "", ylab = "Jaccard Distance")

Next, Ashby (2021) counted the number of reads that align to summit (explain) windows within peaks. These reads are counted twice, once with normalized reads and once with RPKM reads.

##### The goal of this chunk: count number of reads that align to summit
##### windows within peaks (and greylist the resulting objects) NOTE: I perform
##### this step twice. First with score=DBA_SCORE_NORMALIZED, specifying that
##### normalized reads should be stored in the resulting object, second with
##### score=DBA_SCORE_RPKM, specifying that RPKMs should be stored.

# This code prints a summary of the peak widths in our dataset. We want a
# summits value that lies within [1/2 * min peak width, 1/2 * 1st quartile peak
# width]. summits=200 fits the bill!
summary(res_s$binding[, 3] - res_s$binding[, 2])
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   200.0   402.0   688.5  3246.8  1957.5 37882.0
# count normalized reads and greylist
res_c <- dba.count(res_s, score = DBA_SCORE_NORMALIZED, peaks = consensus_peaks,
    summits = 200)
res_c <- invisible(suppressMessages(dba.blacklist(res_c, greylist = gl@regions)))

# count RPKMs and greylist
res_rpkm <- dba.count(res_s, score = DBA_SCORE_RPKM, peaks = consensus_peaks, summits = 200)
res_rpkm <- invisible(suppressMessages(dba.blacklist(res_rpkm, greylist = gl@regions)))

As Ashby (2021) states, “the following code constructs contrasts between 0 hr and subsequent time points within the object containing normalized reads and rpkm reads separately.”

##### The goal of this chunk: specify contrasts between 0 hr and subsequent
##### timepoints NOTE: I perform this step twice. First with on the normalized
##### read count data and second on the rpkm data

res_c <- dba.contrast(res_c, contrast = c("Condition", "1", "0"))
res_c <- dba.contrast(res_c, contrast = c("Condition", "3", "0"))
res_c <- dba.contrast(res_c, contrast = c("Condition", "24", "0"))
res_c <- dba.contrast(res_c, contrast = c("Condition", "76", "0"))

res_rpkm <- dba.contrast(res_rpkm, contrast = c("Condition", "1", "0"))
res_rpkm <- dba.contrast(res_rpkm, contrast = c("Condition", "3", "0"))
res_rpkm <- dba.contrast(res_rpkm, contrast = c("Condition", "24", "0"))
res_rpkm <- dba.contrast(res_rpkm, contrast = c("Condition", "76", "0"))

Then, Ashby (2021) generated a .csv data file that contains the spike-in reads.

###### Spike-in lib size
res_lib <- dba.normalize(res_c, spikein = TRUE)
res_lib <- dba.analyze(res_lib)
dba.peakset(res_lib, bRetrieve = TRUE, writeFile = "spikeinlibcounts.csv")
## GRanges object with 499 ranges and 15 metadata columns:
##            seqnames          ranges strand |   X1_0h_1   X2_0h_2   X3_0h_3
##               <Rle>       <IRanges>  <Rle> | <numeric> <numeric> <numeric>
##     1 Tb927_01_v5.1       7538-7938      * |   293.759   449.689  1807.758
##     2 Tb927_01_v5.1     46067-46467      * |   287.283   501.620  1682.085
##     3 Tb927_01_v5.1     47790-48190      * |   439.020   525.588  1282.509
##     4 Tb927_01_v5.1     49105-49505      * |   477.417   350.962   610.642
##     5 Tb927_01_v5.1     52601-53001      * |  1191.691   805.216  2297.560
##   ...           ...             ...    ... .       ...       ...       ...
##   495 Tb927_11_v5.1 4828256-4828656      * |   160.527   344.685  1330.845
##   496 Tb927_11_v5.1 4831133-4831533      * |   141.560   223.703   918.380
##   497 Tb927_11_v5.1 4833088-4833488      * |   265.540   240.252   718.592
##   498 Tb927_11_v5.1 4835964-4836364      * |   154.975   291.042  1071.443
##   499 Tb927_11_v5.1 4841270-4841670      * |   344.184   430.286  1725.587
##         X1_1h_1   X2_1h_2   X3_1h_3   X1_3h_1   X2_3h_2   X3_3h_3  X1_24h_1
##       <numeric> <numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
##     1   849.837   527.300  1229.037  1180.024  1421.388   839.623   592.209
##     2   831.100   588.194  1143.595   847.805   861.891   974.356   792.958
##     3  1270.071   616.299   871.937  1380.650  1422.720  1127.164   989.246
##     4  1381.152   411.535   415.156  1103.441  1099.011  1229.037   828.647
##     5  3447.526   944.189  1562.037  3875.526  4177.574  5990.731  3440.612
##   ...       ...       ...       ...       ...       ...       ...       ...
##   495   464.399   404.174   904.799   618.056   651.414   483.070   430.495
##   496   409.528   262.312   624.377   575.990   598.128   532.363   388.114
##   497   768.199   281.718   488.547  1188.653  1330.803  1501.791   788.497
##   498   448.339   341.273   728.440  1388.200  1542.612  1610.235   692.583
##   499   995.714   504.549  1173.171  2067.738  2527.059  2906.638  1853.581
##        X2_24h_2  X3_24h_3  X1_76h_1  X2_76h_2  X3_76h_3
##       <numeric> <numeric> <numeric> <numeric> <numeric>
##     1   380.299  1269.326   397.037   417.356   383.932
##     2   622.938  1633.624   658.010   685.825   516.092
##     3   584.314  1301.998   623.437   670.584   780.411
##     4   774.464   668.152   347.965   434.942   498.526
##     5  3876.282  2896.415  3421.652  3691.729  3370.907
##   ...       ...       ...       ...       ...       ...
##   495   281.263  1471.895   428.264   459.561   338.764
##   496   249.572   883.790   274.357   255.572   234.207
##   497   845.770   668.152   481.797   562.728   523.620
##   498   843.789  1323.235  1051.701  1145.386   841.472
##   499  2010.437  1870.499  1566.956  1672.943  1291.484
##   -------
##   seqinfo: 11 sequences from an unspecified genome; no seqlengths

PCA Plots

For the continuation of normalization analysis Ashby (2021) conducted, I examined the following normalization techniques on the Cut-and_Run Data:

  1. Library Size

    1a. Full

    1b. Background Bins

    1c. Reads in Peaks

    1d. Spike-ins

  2. Relative Log Expression (RLE)

    2a. Full

    2b. Background Bins

    2c. Reads in Peaks

    2d. Spike-ins

  3. Reads per Kilobase per Million Mapped Reads (RPKM)

The following code generates a PCA plot for normalization using spike-ins with library size. Ideally, we want the same colors to be grouped together and the two axes to explain a high proportion of the variability.

# retrieve list of differentially bound sites (padj<0.05)
spikeinlibsize <- rbind(res_lib$contrasts[[1]]$DESeq2$de, res_lib$contrasts[[2]]$DESeq2$de,
    res_lib$contrasts[[3]]$DESeq2$de, res_lib$contrasts[[4]]$DESeq2$de) %>%
    dplyr::filter(padj < 0.05) %>%
    .$id %>%
    unique()

# Generate PCA plot
p <- dba.plotPCA(res_lib, label = DBA_REPLICATE, vColors = c("navy", "magenta", "cyan",
    "red", "green"), labelSize = 0.7, alpha = 0.75, score = DBA_SCORE_NORMALIZED,
    sub = "Spike-in lib. size")

The PCA plots were then generated for other normalization methods, which are denoted in the bolded title below the x-axis.

###### Spike-in RLE
res_rle <- dba.normalize(res_c, normalize = DBA_NORM_RLE, spikein = TRUE)
res_rle <- dba.analyze(res_rle)
dba.peakset(res_rle, bRetrieve = TRUE, writeFile = "spikeinRLEcounts.csv")
## GRanges object with 499 ranges and 15 metadata columns:
##            seqnames          ranges strand |   X1_0h_1   X2_0h_2   X3_0h_3
##               <Rle>       <IRanges>  <Rle> | <numeric> <numeric> <numeric>
##     1 Tb927_01_v5.1       7538-7938      * |   293.759   449.689  1807.758
##     2 Tb927_01_v5.1     46067-46467      * |   287.283   501.620  1682.085
##     3 Tb927_01_v5.1     47790-48190      * |   439.020   525.588  1282.509
##     4 Tb927_01_v5.1     49105-49505      * |   477.417   350.962   610.642
##     5 Tb927_01_v5.1     52601-53001      * |  1191.691   805.216  2297.560
##   ...           ...             ...    ... .       ...       ...       ...
##   495 Tb927_11_v5.1 4828256-4828656      * |   160.527   344.685  1330.845
##   496 Tb927_11_v5.1 4831133-4831533      * |   141.560   223.703   918.380
##   497 Tb927_11_v5.1 4833088-4833488      * |   265.540   240.252   718.592
##   498 Tb927_11_v5.1 4835964-4836364      * |   154.975   291.042  1071.443
##   499 Tb927_11_v5.1 4841270-4841670      * |   344.184   430.286  1725.587
##         X1_1h_1   X2_1h_2   X3_1h_3   X1_3h_1   X2_3h_2   X3_3h_3  X1_24h_1
##       <numeric> <numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
##     1   849.837   527.300  1229.037  1180.024  1421.388   839.623   592.209
##     2   831.100   588.194  1143.595   847.805   861.891   974.356   792.958
##     3  1270.071   616.299   871.937  1380.650  1422.720  1127.164   989.246
##     4  1381.152   411.535   415.156  1103.441  1099.011  1229.037   828.647
##     5  3447.526   944.189  1562.037  3875.526  4177.574  5990.731  3440.612
##   ...       ...       ...       ...       ...       ...       ...       ...
##   495   464.399   404.174   904.799   618.056   651.414   483.070   430.495
##   496   409.528   262.312   624.377   575.990   598.128   532.363   388.114
##   497   768.199   281.718   488.547  1188.653  1330.803  1501.791   788.497
##   498   448.339   341.273   728.440  1388.200  1542.612  1610.235   692.583
##   499   995.714   504.549  1173.171  2067.738  2527.059  2906.638  1853.581
##        X2_24h_2  X3_24h_3  X1_76h_1  X2_76h_2  X3_76h_3
##       <numeric> <numeric> <numeric> <numeric> <numeric>
##     1   380.299  1269.326   397.037   417.356   383.932
##     2   622.938  1633.624   658.010   685.825   516.092
##     3   584.314  1301.998   623.437   670.584   780.411
##     4   774.464   668.152   347.965   434.942   498.526
##     5  3876.282  2896.415  3421.652  3691.729  3370.907
##   ...       ...       ...       ...       ...       ...
##   495   281.263  1471.895   428.264   459.561   338.764
##   496   249.572   883.790   274.357   255.572   234.207
##   497   845.770   668.152   481.797   562.728   523.620
##   498   843.789  1323.235  1051.701  1145.386   841.472
##   499  2010.437  1870.499  1566.956  1672.943  1291.484
##   -------
##   seqinfo: 11 sequences from an unspecified genome; no seqlengths
# retrieve list of differentially bound sites (padj<0.05)
spikeinrle <- rbind(res_rle$contrasts[[1]]$DESeq2$de, res_rle$contrasts[[2]]$DESeq2$de,
    res_rle$contrasts[[3]]$DESeq2$de, res_rle$contrasts[[4]]$DESeq2$de) %>%
    dplyr::filter(padj < 0.05) %>%
    .$id %>%
    unique()

# generate PCA plot
q <- dba.plotPCA(res_rle, label = DBA_REPLICATE, vColors = c("navy", "magenta", "cyan",
    "red", "green"), labelSize = 0.7, alpha = 0.75, score = DBA_SCORE_NORMALIZED,
    sub = "RLE spike-in")

###### Background RLE

res_background <- dba.normalize(res_c, normalize = DBA_NORM_RLE, background = TRUE,
    spikein = FALSE)
res_background <- dba.analyze(res_background)
dba.peakset(res_background, bRetrieve = TRUE, writeFile = "backgroundRLEcounts.csv")
## GRanges object with 499 ranges and 15 metadata columns:
##            seqnames          ranges strand |   X1_0h_1   X2_0h_2   X3_0h_3
##               <Rle>       <IRanges>  <Rle> | <numeric> <numeric> <numeric>
##     1 Tb927_01_v5.1       7538-7938      * |   746.392   672.272   583.236
##     2 Tb927_01_v5.1     46067-46467      * |   729.936   749.907   542.690
##     3 Tb927_01_v5.1     47790-48190      * |  1115.474   785.739   413.775
##     4 Tb927_01_v5.1     49105-49505      * |  1213.033   524.679   197.011
##     5 Tb927_01_v5.1     52601-53001      * |  3027.882  1203.776   741.261
##   ...           ...             ...    ... .       ...       ...       ...
##   495 Tb927_11_v5.1 4828256-4828656      * |   407.871   515.295   429.370
##   496 Tb927_11_v5.1 4831133-4831533      * |   359.679   334.430   296.296
##   497 Tb927_11_v5.1 4833088-4833488      * |   674.691   359.171   231.839
##   498 Tb927_11_v5.1 4835964-4836364      * |   393.766   435.100   345.679
##   499 Tb927_11_v5.1 4841270-4841670      * |   874.512   643.265   556.725
##         X1_1h_1   X2_1h_2   X3_1h_3   X1_3h_1   X2_3h_2   X3_3h_3  X1_24h_1
##       <numeric> <numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
##     1   746.392   672.272   583.236   1546.69   1799.16   726.660   728.727
##     2   729.936   749.907   542.690   1111.24   1090.96   843.267   975.753
##     3  1115.474   785.739   413.775   1809.65   1800.84   975.516  1217.290
##     4  1213.033   524.679   197.011   1446.31   1391.10  1063.682  1019.669
##     5  3027.882  1203.776   741.261   5079.75   5287.86  5184.741  4233.753
##   ...       ...       ...       ...       ...       ...       ...       ...
##   495   407.871   515.295   429.370   810.102   824.543   418.078   529.734
##   496   359.679   334.430   296.296   754.964   757.095   460.739   477.584
##   497   674.691   359.171   231.839  1557.997  1684.495  1299.740   970.264
##   498   393.766   435.100   345.679  1819.548  1952.598  1393.595   852.240
##   499   874.512   643.265   556.725  2710.235  3198.686  2515.581  2280.874
##        X2_24h_2  X3_24h_3  X1_76h_1  X2_76h_2  X3_76h_3
##       <numeric> <numeric> <numeric> <numeric> <numeric>
##     1   497.554   399.523   398.829   379.656   501.203
##     2   815.005   514.186   660.981   623.873   673.730
##     3   764.472   409.806   626.251   610.009  1018.784
##     4  1013.249   210.302   349.535   395.653   650.799
##     5  5071.427   911.652  3437.099  3358.248  4400.537
##   ...       ...       ...       ...       ...       ...
##   495   367.983   463.282   430.198   418.048   442.238
##   496   326.520   278.175   275.595   232.486   305.744
##   497  1106.540   210.302   483.972   511.896   683.557
##   498  1103.949   416.491  1056.449  1041.921  1098.496
##   499  2630.301   588.743  1574.030  1521.823  1685.962
##   -------
##   seqinfo: 11 sequences from an unspecified genome; no seqlengths
# get list of db sites
background <- rbind(res_background$contrasts[[1]]$DESeq2$de, res_background$contrasts[[2]]$DESeq2$de,
    res_background$contrasts[[3]]$DESeq2$de, res_background$contrasts[[4]]$DESeq2$de) %>%
    dplyr::filter(padj < 0.05) %>%
    .$id %>%
    unique()

# generate pca plot
r <- dba.plotPCA(res_background, label = DBA_REPLICATE, vColors = c("navy", "magenta",
    "cyan", "red", "green"), labelSize = 0.7, alpha = 0.75, score = DBA_SCORE_NORMALIZED,
    sub = "RLE background")

###### RPKM 'Normalization' was already done in counting step

res_rpkm <- dba.analyze(res_rpkm)
dba.peakset(res_rpkm, bRetrieve = TRUE, writeFile = "rpkmcounts.csv")
## GRanges object with 499 ranges and 15 metadata columns:
##            seqnames          ranges strand |   X1_0h_1   X2_0h_2   X3_0h_3
##               <Rle>       <IRanges>  <Rle> | <numeric> <numeric> <numeric>
##     1 Tb927_01_v5.1       7538-7938      * |   157.615   151.126  143.1016
##     2 Tb927_01_v5.1     46067-46467      * |   154.140   168.578  133.1533
##     3 Tb927_01_v5.1     47790-48190      * |   235.554   176.633  101.5230
##     4 Tb927_01_v5.1     49105-49505      * |   256.156   117.947   48.3382
##     5 Tb927_01_v5.1     52601-53001      * |   639.397   270.608  181.8742
##   ...           ...             ...    ... .       ...       ...       ...
##   495 Tb927_11_v5.1 4828256-4828656      * |   86.1299  115.8378  105.3493
##   496 Tb927_11_v5.1 4831133-4831533      * |   75.9532   75.1795   72.6986
##   497 Tb927_11_v5.1 4833088-4833488      * |  142.4743   80.7412   56.8835
##   498 Tb927_11_v5.1 4835964-4836364      * |   83.1514   97.8100   84.8151
##   499 Tb927_11_v5.1 4841270-4841670      * |  184.6705  144.6054  136.5969
##         X1_1h_1   X2_1h_2   X3_1h_3   X1_3h_1   X2_3h_2   X3_3h_3  X1_24h_1
##       <numeric> <numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
##     1   157.615   151.126  143.1016   169.823   187.087   93.0972   84.6144
##     2   154.140   168.578  133.1533   122.012   113.444  108.0364  113.2972
##     3   235.554   176.633  101.5230   198.696   187.262  124.9798  141.3427
##     4   256.156   117.947   48.3382   158.801   144.655  136.2753  118.3964
##     5   639.397   270.608  181.8742   557.746   549.863  664.2510  491.5921
##   ...       ...       ...       ...       ...       ...       ...       ...
##   495   86.1299  115.8378  105.3493   88.9474   85.7407   53.5628   61.5088
##   496   75.9532   75.1795   72.6986   82.8934   78.7271   59.0283   55.4535
##   497  142.4743   80.7412   56.8835  171.0647  175.1635  166.5182  112.6598
##   498   83.1514   97.8100   84.8151  199.7825  203.0424  178.5425   98.9558
##   499  184.6705  144.6054  136.5969  297.5781  332.6178  322.2875  264.8383
##        X2_24h_2  X3_24h_3  X1_76h_1  X2_76h_2  X3_76h_3
##       <numeric> <numeric> <numeric> <numeric> <numeric>
##     1   64.6001   92.4897   59.1065   56.5911   76.2605
##     2  105.8164  119.0344   97.9574   92.9937  102.5115
##     3   99.2554   94.8704   92.8105   90.9272  155.0133
##     4  131.5555   48.6851   51.8012   58.9755   99.0224
##     5  658.4505  211.0480  509.3787  500.5765  669.5643
##   ...       ...       ...       ...       ...       ...
##   495   47.7772  107.2500   63.7554   62.3137   67.2887
##   496   42.3938   64.3976   40.8433   34.6541   46.5206
##   497  143.6680   48.6851   71.7248   76.3025  104.0068
##   498  143.3316   96.4178  156.5659  155.3075  167.1419
##   499  341.5060  136.2944  233.2715  226.8411  256.5279
##   -------
##   seqinfo: 11 sequences from an unspecified genome; no seqlengths
# generate list of db sites
rpkm <- rbind(res_rpkm$contrasts[[1]]$DESeq2$de, res_rpkm$contrasts[[2]]$DESeq2$de,
    res_rpkm$contrasts[[3]]$DESeq2$de, res_rpkm$contrasts[[4]]$DESeq2$de) %>%
    dplyr::filter(padj < 0.05) %>%
    .$id %>%
    unique()

# generate pca plot
s <- dba.plotPCA(res_rpkm, label = DBA_REPLICATE, vColors = c("navy", "magenta",
    "cyan", "red", "green"), labelSize = 0.7, alpha = 0.75, score = DBA_SCORE_RPKM,
    sub = "RPKM")

## library size full, code based on
## https://bioconductor.org/packages/devel/bioc/manuals/DiffBind/man/DiffBind.pdf

res_lib_background <- dba.normalize(res_c, normalize = DBA_NORM_LIB, background = TRUE,
    spikein = FALSE)
res_lib_background <- dba.analyze(res_lib_background)
dba.peakset(res_lib_background, bRetrieve = TRUE, writeFile = "backgroundlibcounts.csv")
## GRanges object with 499 ranges and 15 metadata columns:
##            seqnames          ranges strand |   X1_0h_1   X2_0h_2   X3_0h_3
##               <Rle>       <IRanges>  <Rle> | <numeric> <numeric> <numeric>
##     1 Tb927_01_v5.1       7538-7938      * |   953.621   914.359   865.808
##     2 Tb927_01_v5.1     46067-46467      * |   932.596  1019.952   805.619
##     3 Tb927_01_v5.1     47790-48190      * |  1425.175  1068.686   614.246
##     4 Tb927_01_v5.1     49105-49505      * |  1549.822   713.618   292.461
##     5 Tb927_01_v5.1     52601-53001      * |  3868.548  1637.260  1100.395
##   ...           ...             ...    ... .       ...       ...       ...
##   495 Tb927_11_v5.1 4828256-4828656      * |   521.113   700.854   637.395
##   496 Tb927_11_v5.1 4831133-4831533      * |   459.540   454.859   439.849
##   497 Tb927_11_v5.1 4833088-4833488      * |   862.013   488.509   344.163
##   498 Tb927_11_v5.1 4835964-4836364      * |   503.091   591.781   513.157
##   499 Tb927_11_v5.1 4841270-4841670      * |  1117.314   874.907   826.453
##         X1_1h_1   X2_1h_2   X3_1h_3   X1_3h_1   X2_3h_2   X3_3h_3  X1_24h_1
##       <numeric> <numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
##     1   953.621   914.359   865.808  1027.481  1131.931   563.266   511.943
##     2   932.596  1019.952   805.619   738.208   686.373   653.654   685.483
##     3  1425.175  1068.686   614.246  1202.171  1132.992   756.166   855.167
##     4  1549.822   713.618   292.461   960.798   875.205   824.507   716.335
##     5  3868.548  1637.260  1100.395  3374.532  3326.838  4018.923  2974.283
##   ...       ...       ...       ...       ...       ...       ...       ...
##   495   521.113   700.854   637.395   538.159   518.758   324.071   372.147
##   496   459.540   454.859   439.849   501.531   476.323   357.140   335.511
##   497   862.013   488.509   344.163  1034.994  1059.793  1007.486   681.627
##   498   503.091   591.781   513.157  1208.745  1228.469  1080.237   598.713
##   499  1117.314   874.907   826.453  1800.439  2012.440  1949.938  1602.353
##        X2_24h_2  X3_24h_3  X1_76h_1  X2_76h_2  X3_76h_3
##       <numeric> <numeric> <numeric> <numeric> <numeric>
##     1   390.851   559.591   357.613   342.393   461.400
##     2   640.222   720.195   592.673   562.641   620.226
##     3   600.526   573.995   561.532   550.137   937.878
##     4   795.951   294.560   313.413   356.820   599.116
##     5  3983.828  1276.905  3081.898  3028.642  4051.070
##   ...       ...       ...       ...       ...       ...
##   495   289.067   648.895   385.739   377.017   407.117
##   496   256.496   389.625   247.114   209.668   281.464
##   497   869.236   294.560   433.957   461.654   629.273
##   498   867.200   583.358   947.272   939.658  1011.260
##   499  2066.216   824.623  1411.365  1372.458  1552.072
##   -------
##   seqinfo: 11 sequences from an unspecified genome; no seqlengths
background_lib <- rbind(res_lib_background$contrasts[[1]]$DESeq2$de, res_lib_background$contrasts[[2]]$DESeq2$de,
    res_lib_background$contrasts[[3]]$DESeq2$de, res_lib_background$contrasts[[4]]$DESeq2$de) %>%
    dplyr::filter(padj < 0.05) %>%
    .$id %>%
    unique()

t <- dba.plotPCA(res_lib_background, label = DBA_REPLICATE, vColors = c("navy", "magenta",
    "cyan", "red", "green"), labelSize = 0.7, alpha = 0.75, score = DBA_SCORE_NORMALIZED,
    sub = "Library Size Background")

## total RLE

res_RLE <- dba.normalize(res_c, normalize = DBA_NORM_RLE, background = FALSE, spikein = FALSE)
res_RLE <- dba.analyze(res_RLE)
dba.peakset(res_RLE, bRetrieve = TRUE, writeFile = "fullRLEcounts.csv")
## GRanges object with 499 ranges and 15 metadata columns:
##            seqnames          ranges strand |   X1_0h_1   X2_0h_2   X3_0h_3
##               <Rle>       <IRanges>  <Rle> | <numeric> <numeric> <numeric>
##     1 Tb927_01_v5.1       7538-7938      * |   1096.77  1200.759  1385.132
##     2 Tb927_01_v5.1     46067-46467      * |   1072.59  1339.425  1288.839
##     3 Tb927_01_v5.1     47790-48190      * |   1639.11  1403.425   982.678
##     4 Tb927_01_v5.1     49105-49505      * |   1782.47   937.141   467.883
##     5 Tb927_01_v5.1     52601-53001      * |   4449.27  2150.090  1760.426
##   ...           ...             ...    ... .       ...       ...       ...
##   495 Tb927_11_v5.1 4828256-4828656      * |   599.338   920.379  1019.714
##   496 Tb927_11_v5.1 4831133-4831533      * |   528.523   597.332   703.677
##   497 Tb927_11_v5.1 4833088-4833488      * |   991.413   641.522   550.596
##   498 Tb927_11_v5.1 4835964-4836364      * |   578.612   777.141   820.956
##   499 Tb927_11_v5.1 4841270-4841670      * |  1285.037  1148.950  1322.172
##         X1_1h_1   X2_1h_2   X3_1h_3   X1_3h_1   X2_3h_2   X3_3h_3  X1_24h_1
##       <numeric> <numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
##     1   1096.77  1200.759  1385.132   687.443   735.043   382.562   479.747
##     2   1072.59  1339.425  1288.839   493.904   445.710   443.951   642.374
##     3   1639.11  1403.425   982.678   804.321   735.732   513.576   801.386
##     4   1782.47   937.141   467.883   642.829   568.332   559.993   671.285
##     5   4449.27  2150.090  1760.426  2257.755  2160.352  2729.589  2787.233
##   ...       ...       ...       ...       ...       ...       ...       ...
##   495   599.338   920.379  1019.714   360.059   336.866   220.104   348.743
##   496   528.523   597.332   703.677   335.553   309.311   242.564   314.411
##   497   991.413   641.522   550.596   692.470   688.199   684.269   638.760
##   498   578.612   777.141   820.956   808.720   797.732   733.680   561.061
##   499  1285.037  1148.950  1322.172  1204.597  1306.820  1324.367  1501.582
##        X2_24h_2  X3_24h_3  X1_76h_1  X2_76h_2  X3_76h_3
##       <numeric> <numeric> <numeric> <numeric> <numeric>
##     1   278.975   791.943   289.201   276.936   365.574
##     2   456.967  1019.232   479.294   455.078   491.414
##     3   428.633   812.328   454.111   444.965   743.094
##     4   568.121   416.866   253.457   288.605   474.688
##     5  2843.509  1807.098  2492.331  2449.643  3209.721
##   ...       ...       ...       ...       ...       ...
##   495   206.325   918.328   311.948   304.941   322.565
##   496   183.077   551.405   199.841   169.585   223.008
##   497   620.428   416.866   350.941   373.397   498.582
##   498   618.975   825.578   766.059   760.019   801.236
##   499  1474.789  1167.021  1141.371  1110.080  1229.729
##   -------
##   seqinfo: 11 sequences from an unspecified genome; no seqlengths
# get list of db sites
RLE_full <- rbind(res_RLE$contrasts[[1]]$DESeq2$de, res_RLE$contrasts[[2]]$DESeq2$de,
    res_RLE$contrasts[[3]]$DESeq2$de, res_RLE$contrasts[[4]]$DESeq2$de) %>%
    dplyr::filter(padj < 0.05) %>%
    .$id %>%
    unique()

# generate pca plot
u <- dba.plotPCA(res_RLE, label = DBA_REPLICATE, vColors = c("navy", "magenta", "cyan",
    "red", "green"), labelSize = 0.7, alpha = 0.75, score = DBA_SCORE_NORMALIZED,
    sub = "RLE full")

## library size full, looks similar to lib background

## not sure diff between library=DBA_LIBSIZE_FULL and no library command

res_lib <- dba.normalize(res_c, normalize = DBA_NORM_LIB, background = FALSE, spikein = FALSE)
res_lib <- dba.analyze(res_lib)
dba.peakset(res_lib, bRetrieve = TRUE, writeFile = "fulllibcounts.csv")
## GRanges object with 499 ranges and 15 metadata columns:
##            seqnames          ranges strand |   X1_0h_1   X2_0h_2   X3_0h_3
##               <Rle>       <IRanges>  <Rle> | <numeric> <numeric> <numeric>
##     1 Tb927_01_v5.1       7538-7938      * |   953.621   914.359   865.808
##     2 Tb927_01_v5.1     46067-46467      * |   932.596  1019.952   805.619
##     3 Tb927_01_v5.1     47790-48190      * |  1425.175  1068.686   614.246
##     4 Tb927_01_v5.1     49105-49505      * |  1549.822   713.618   292.461
##     5 Tb927_01_v5.1     52601-53001      * |  3868.548  1637.260  1100.395
##   ...           ...             ...    ... .       ...       ...       ...
##   495 Tb927_11_v5.1 4828256-4828656      * |   521.113   700.854   637.395
##   496 Tb927_11_v5.1 4831133-4831533      * |   459.540   454.859   439.849
##   497 Tb927_11_v5.1 4833088-4833488      * |   862.013   488.509   344.163
##   498 Tb927_11_v5.1 4835964-4836364      * |   503.091   591.781   513.157
##   499 Tb927_11_v5.1 4841270-4841670      * |  1117.314   874.907   826.453
##         X1_1h_1   X2_1h_2   X3_1h_3   X1_3h_1   X2_3h_2   X3_3h_3  X1_24h_1
##       <numeric> <numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
##     1   953.621   914.359   865.808  1027.481  1131.931   563.266   511.943
##     2   932.596  1019.952   805.619   738.208   686.373   653.654   685.483
##     3  1425.175  1068.686   614.246  1202.171  1132.992   756.166   855.167
##     4  1549.822   713.618   292.461   960.798   875.205   824.507   716.335
##     5  3868.548  1637.260  1100.395  3374.532  3326.838  4018.923  2974.283
##   ...       ...       ...       ...       ...       ...       ...       ...
##   495   521.113   700.854   637.395   538.159   518.758   324.071   372.147
##   496   459.540   454.859   439.849   501.531   476.323   357.140   335.511
##   497   862.013   488.509   344.163  1034.994  1059.793  1007.486   681.627
##   498   503.091   591.781   513.157  1208.745  1228.469  1080.237   598.713
##   499  1117.314   874.907   826.453  1800.439  2012.440  1949.938  1602.353
##        X2_24h_2  X3_24h_3  X1_76h_1  X2_76h_2  X3_76h_3
##       <numeric> <numeric> <numeric> <numeric> <numeric>
##     1   390.851   559.591   357.613   342.393   461.400
##     2   640.222   720.195   592.673   562.641   620.226
##     3   600.526   573.995   561.532   550.137   937.878
##     4   795.951   294.560   313.413   356.820   599.116
##     5  3983.828  1276.905  3081.898  3028.642  4051.070
##   ...       ...       ...       ...       ...       ...
##   495   289.067   648.895   385.739   377.017   407.117
##   496   256.496   389.625   247.114   209.668   281.464
##   497   869.236   294.560   433.957   461.654   629.273
##   498   867.200   583.358   947.272   939.658  1011.260
##   499  2066.216   824.623  1411.365  1372.458  1552.072
##   -------
##   seqinfo: 11 sequences from an unspecified genome; no seqlengths
lib_full <- rbind(res_lib$contrasts[[1]]$DESeq2$de, res_lib$contrasts[[2]]$DESeq2$de,
    res_lib$contrasts[[3]]$DESeq2$de, res_lib$contrasts[[4]]$DESeq2$de) %>%
    dplyr::filter(padj < 0.05) %>%
    .$id %>%
    unique()

v <- dba.plotPCA(res_lib, label = DBA_REPLICATE, vColors = c("navy", "magenta", "cyan",
    "red", "green"), labelSize = 0.7, alpha = 0.75, score = DBA_SCORE_NORMALIZED,
    sub = "Library Size Full")

# RLE in peaks
res_RLE_peaks <- dba.normalize(res_c, normalize = DBA_NORM_RLE, library = DBA_LIBSIZE_PEAKREADS,
    background = FALSE, spikein = FALSE)
res_RLE_peaks <- dba.analyze(res_RLE_peaks)
dba.peakset(res_RLE_peaks, bRetrieve = TRUE, writeFile = "peakRLEcounts.csv")
## GRanges object with 499 ranges and 15 metadata columns:
##            seqnames          ranges strand |   X1_0h_1   X2_0h_2   X3_0h_3
##               <Rle>       <IRanges>  <Rle> | <numeric> <numeric> <numeric>
##     1 Tb927_01_v5.1       7538-7938      * |   1096.77  1200.759  1385.132
##     2 Tb927_01_v5.1     46067-46467      * |   1072.59  1339.425  1288.839
##     3 Tb927_01_v5.1     47790-48190      * |   1639.11  1403.425   982.678
##     4 Tb927_01_v5.1     49105-49505      * |   1782.47   937.141   467.883
##     5 Tb927_01_v5.1     52601-53001      * |   4449.27  2150.090  1760.426
##   ...           ...             ...    ... .       ...       ...       ...
##   495 Tb927_11_v5.1 4828256-4828656      * |   599.338   920.379  1019.714
##   496 Tb927_11_v5.1 4831133-4831533      * |   528.523   597.332   703.677
##   497 Tb927_11_v5.1 4833088-4833488      * |   991.413   641.522   550.596
##   498 Tb927_11_v5.1 4835964-4836364      * |   578.612   777.141   820.956
##   499 Tb927_11_v5.1 4841270-4841670      * |  1285.037  1148.950  1322.172
##         X1_1h_1   X2_1h_2   X3_1h_3   X1_3h_1   X2_3h_2   X3_3h_3  X1_24h_1
##       <numeric> <numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
##     1   1096.77  1200.759  1385.132   687.443   735.043   382.562   479.747
##     2   1072.59  1339.425  1288.839   493.904   445.710   443.951   642.374
##     3   1639.11  1403.425   982.678   804.321   735.732   513.576   801.386
##     4   1782.47   937.141   467.883   642.829   568.332   559.993   671.285
##     5   4449.27  2150.090  1760.426  2257.755  2160.352  2729.589  2787.233
##   ...       ...       ...       ...       ...       ...       ...       ...
##   495   599.338   920.379  1019.714   360.059   336.866   220.104   348.743
##   496   528.523   597.332   703.677   335.553   309.311   242.564   314.411
##   497   991.413   641.522   550.596   692.470   688.199   684.269   638.760
##   498   578.612   777.141   820.956   808.720   797.732   733.680   561.061
##   499  1285.037  1148.950  1322.172  1204.597  1306.820  1324.367  1501.582
##        X2_24h_2  X3_24h_3  X1_76h_1  X2_76h_2  X3_76h_3
##       <numeric> <numeric> <numeric> <numeric> <numeric>
##     1   278.975   791.943   289.201   276.936   365.574
##     2   456.967  1019.232   479.294   455.078   491.414
##     3   428.633   812.328   454.111   444.965   743.094
##     4   568.121   416.866   253.457   288.605   474.688
##     5  2843.509  1807.098  2492.331  2449.643  3209.721
##   ...       ...       ...       ...       ...       ...
##   495   206.325   918.328   311.948   304.941   322.565
##   496   183.077   551.405   199.841   169.585   223.008
##   497   620.428   416.866   350.941   373.397   498.582
##   498   618.975   825.578   766.059   760.019   801.236
##   499  1474.789  1167.021  1141.371  1110.080  1229.729
##   -------
##   seqinfo: 11 sequences from an unspecified genome; no seqlengths
RLE_peaks <- rbind(res_RLE_peaks$contrasts[[1]]$DESeq2$de, res_RLE_peaks$contrasts[[2]]$DESeq2$de,
    res_RLE_peaks$contrasts[[3]]$DESeq2$de, res_RLE_peaks$contrasts[[4]]$DESeq2$de) %>%
    dplyr::filter(padj < 0.05) %>%
    .$id %>%
    unique()

w <- dba.plotPCA(res_RLE_peaks, label = DBA_REPLICATE, vColors = c("navy", "magenta",
    "cyan", "red", "green"), labelSize = 0.7, alpha = 0.75, score = DBA_SCORE_NORMALIZED,
    sub = "RLE Reads in Peaks")

## library size reads in peaks

res_lib_peaks <- dba.normalize(res_c, normalize = DBA_NORM_LIB, library = DBA_LIBSIZE_PEAKREADS,
    background = FALSE, spikein = FALSE)
res_lib_peaks <- dba.analyze(res_lib_peaks)
dba.peakset(res_lib_peaks, bRetrieve = TRUE, writeFile = "peaklibcounts.csv")
## GRanges object with 499 ranges and 15 metadata columns:
##            seqnames          ranges strand |   X1_0h_1   X2_0h_2   X3_0h_3
##               <Rle>       <IRanges>  <Rle> | <numeric> <numeric> <numeric>
##     1 Tb927_01_v5.1       7538-7938      * |   1097.45   1287.78  1493.182
##     2 Tb927_01_v5.1     46067-46467      * |   1073.26   1436.49  1389.378
##     3 Tb927_01_v5.1     47790-48190      * |   1640.13   1505.13  1059.334
##     4 Tb927_01_v5.1     49105-49505      * |   1783.57   1005.05   504.382
##     5 Tb927_01_v5.1     52601-53001      * |   4452.02   2305.90  1897.752
##   ...           ...             ...    ... .       ...       ...       ...
##   495 Tb927_11_v5.1 4828256-4828656      * |   599.710   987.077  1099.259
##   496 Tb927_11_v5.1 4831133-4831533      * |   528.851   640.619   758.569
##   497 Tb927_11_v5.1 4833088-4833488      * |   992.027   688.012   593.547
##   498 Tb927_11_v5.1 4835964-4836364      * |   578.971   833.459   884.997
##   499 Tb927_11_v5.1 4841270-4841670      * |  1285.833  1232.212  1425.310
##         X1_1h_1   X2_1h_2   X3_1h_3   X1_3h_1   X2_3h_2   X3_3h_3  X1_24h_1
##       <numeric> <numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
##     1   1097.45   1287.78  1493.182   751.197   798.939   413.939   502.809
##     2   1073.26   1436.49  1389.378   539.708   484.455   480.363   673.252
##     3   1640.13   1505.13  1059.334   878.914   799.688   555.699   839.908
##     4   1783.57   1005.05   504.382   702.444   617.737   605.922   703.553
##     5   4452.02   2305.90  1897.752  2467.139  2348.148  2953.465  2921.214
##   ...       ...       ...       ...       ...       ...       ...       ...
##   495   599.710   987.077  1099.259   393.451   366.149   238.157   365.507
##   496   528.851   640.619   758.569   366.672   336.199   262.458   329.524
##   497   992.027   688.012   593.547   756.690   748.023   740.391   669.465
##   498   578.971   833.459   884.997   883.720   867.078   793.855   588.030
##   499  1285.833  1232.212  1425.310  1316.311  1420.420  1432.989  1573.762
##        X2_24h_2  X3_24h_3  X1_76h_1  X2_76h_2  X3_76h_3
##       <numeric> <numeric> <numeric> <numeric> <numeric>
##     1   290.605   865.928   284.315   274.116   363.082
##     2   476.017  1114.450   471.196   450.444   488.064
##     3   446.502   888.217   446.439   440.434   738.029
##     4   591.805   455.810   249.175   285.666   471.453
##     5  2962.050  1975.920  2450.221  2424.697  3187.844
##   ...       ...       ...       ...       ...       ...
##   495   214.926  1004.119   306.677   301.836   320.366
##   496   190.709   602.917   196.465   167.858   221.488
##   497   646.293   455.810   345.012   369.595   495.184
##   498   644.779   902.705   753.115   752.280   795.775
##   499  1536.270  1276.045  1122.086  1098.775  1221.348
##   -------
##   seqinfo: 11 sequences from an unspecified genome; no seqlengths
lib_peaks <- rbind(res_lib_peaks$contrasts[[1]]$DESeq2$de, res_lib_peaks$contrasts[[2]]$DESeq2$de,
    res_lib_peaks$contrasts[[3]]$DESeq2$de, res_lib_peaks$contrasts[[4]]$DESeq2$de) %>%
    dplyr::filter(padj < 0.05) %>%
    .$id %>%
    unique()

y <- dba.plotPCA(res_lib_peaks, label = DBA_REPLICATE, vColors = c("navy", "magenta",
    "cyan", "red", "green"), labelSize = 0.7, alpha = 0.75, score = DBA_SCORE_NORMALIZED,
    sub = "LIbrary Size Reads in Peaks")

Jaccard Distance Plot of Normalization Techniques

## jaccard distance plot

# list of peak calls per method
ll_2 <- list(spikeinlibsize, spikeinrle, background, rpkm, lib_peaks, RLE_peaks,
    RLE_full, lib_full, background_lib)

# enumerate combinations of peak calling methods and calculate jaccard
# similarities

dist <- unlist(lapply(combn(ll_2, 2, simplify = FALSE), function(x) {
    length(intersect(x[[1]], x[[2]]))/length(union(x[[1]], x[[2]]))
}))
# initialize matrix to store results
to_make_mat <- cbind(t(combn(9, 2)), dist)
d_mat <- matrix(ncol = 9, nrow = 9)
diag(d_mat) <- 1


# add jaccard similarities to matrix
for (i in 1:dim(to_make_mat)[1]) {
    d_mat[as.numeric(to_make_mat[i, 2]), as.numeric(to_make_mat[i, 1])] <- to_make_mat[i,
        3]
}
colnames(d_mat) <- c("Spike-in\nLib. Size", "Spike-in\nRLE", "Background\nRLE", "RPKM",
    "Library Peak", "RLE Peak", "RLE Full", "Library Full", "Background\nLibrary")
rownames(d_mat) <- c("Spike-in\nLib. Size", "Spike-in\nRLE", "Background\nRLE", "RPKM",
    "Library Peak", "RLE Peak", "RLE Full", "Library Full", "Background\nLibrary")
# convert jaccard similarity to jaccard distance

d_mat <- as.dist(1 - d_mat)

hc <- hclust(d_mat)
plot(hc, main = "", sub = "", xlab = "", ylab = "Jaccard Distance")

Biological Assumptions for ChIP-Seq Normalization

In Ciaran’s Paper, Selecting between-sample RNA-Seq normalization methods from the perspective of their assumptions, he talks about to main biological assumptions made in RNA-Seq normalization Techniques (Evans, Hardin, and Stoebel 2016).

  1. Symmetry in Differential Expression

  2. Same total amount of mRNA/cell

Figure 1: Assumptions Diagram

As illustrated in the diagram above, symmetry focuses on the number of genes up-regulated between two conditions. For instance, cond. A in (a) above has three up-regulated whereas cond. B only has one so it is asymmetric differential expression (Evans, Hardin, and Stoebel 2016).

Meanwhile, total amount of mRNA/cell is the summation of the bars under one condition in the diagram. This is because total mRNA/cell quantifies the total level of expression of all genes considered under some condition. For example, summing up the bars in cond. A in picture (a) is equivalent to the sum of the bars under cond. B in (a) even though its differential expression between the conditions is asymmetric (Evans, Hardin, and Stoebel 2016).

Parallel Assumptions in ChIP-Seq to RNA-Seq

Now, we can translate these RNA-seq assumptions to ChIP-Seq data given ChIP-Seq and RNA-Seq have the same structure once the data is collected and aligned.

As an example, here are mapped reads for ChIP-Seq:

ChIP

These have the same structure as the mapped reads for RNA-Seq:

RNA The key difference between RNA-Seq and ChIP-Seq is what these maps signify. In ChIP-Seq, the map is over a chromosomes, therefore each read is mapped to a genomic region (i.e. region on a chromosome). Ideally, peaks represent regions for which there is a binding site present for our transcription factor or histone of interest (StatQuest, n.d.). Rather than looking for between-sample differential expression, like RNA-Seq, ChIP-seq wants to find between-sample differentially-bound regions, that is, the regions in which the signal significantly changes between conditions (Höllbacher et al. 2020).

Using this biological description of ChIP-Seq, we can then translate the RNA-Seq assumptions that Evans, Hardin, and Stoebel (2016) talks about in his paper.

Symmetry

Symmetry is virtually the same as it was for RNA-seq. For ChIP-Seq, symmetry measures if the number of up-regulated genomic regions is consistent between conditions. When this number is not consistent, there is asymmetry in the differentially-bound regions between conditions.

mRNA/cell

There is no assumption of same total number of mRNA/cell for ChIP-Seq. This is because ChIP-Seq aims to find binding sites on DNA. Therefore, there is no mRNA present in ChIP-seq like there is for RNA-seq. As previously stated, amount of mRNA/cell quantifies the magnitude of gene expression in RNA-Seq (Wu et al. 2015).

Therefore, an equivalent assumption in ChIP-Seq is that the amount of DNA binding is equivalent across conditions. That is, the summation of signal at a specified region (equivalent to gene expression level in RNA-seq) is the same across conditions).

ChIP-Seq Assumptions Diagram

Using RNA-Seq for ChIP-Seq Normalization Categorization

Now that it has been established that similar assumptions about total signal magnitude and symmetry in differentially-bound regions exist in ChIP-Seq, we can begin to easily categorize ChIP-Seq normalization techniques by their biological assumptions. Note that Evans, Hardin, and Stoebel (2016) categorized RNA-seq normalization techniques by their biological assumptions using their mathematical algorithms which he gives in the “Supplementary data” section. Many RNA-Seq normalization techniques were adopted in ChIP-Seq, as evident through Stark and Brown (2022a)’s DiffBind bioconductor package. As such, the mathematics of these ChIP-Seq normalization techniques are the same as RNA-Seq (though the context has changed).

Thus, we can use Ciaran’s work to establish the following:

Full Library Size Normalization

Assumptions

  1. Same total DNA-protein binding: the amount of total DNA binding is the same between different experimental conditions.

RPKM

Assumptions

  1. Same total DNA-protein binding: the amount of total DNA binding is the same between different experimental conditions.

Quantile Normalization

Assumptions

  1. Technical effects are the same for differentially bound and non-differentially bound genomic regions

  2. Roughly symmetric differentially-bound regions across conditions (same number of up and down-regulated genomic regions)

Full Relative Log Expression (DESeq and DESeq2)

DESeq2 and DESeq’s normalization technique is synonymous to RLE (Relative Log Expression) normalization. So, we can use Evans, Hardin, and Stoebel (2016)’s discussion of DESeq’s assumptions for RLE on full library.

Assumptions

  1. Technical effects are the same for differentially bound and non-differentially bound genomic regions

  2. Roughly symmetric differentially-bound regions across conditions (same number of up and down-regulated genomic regions)

Full Trimmed Mean of the M-Values (TMM)

Assumptions

  1. Technical effects are the same for differentially bound and non-differentially bound genomic regions

  2. Roughly symmetric differentially-bound regions across conditions (same number of up and down-regulated genomic regions)

For some RNA-Seq techniques, two additional variations were added when translating it to ChIP-Seq normalization. These involved looking at the “Reads in Peaks” and “Background Bins” subsets of the full libraries. As these appear to be fundamentally different than the standard RNA-Seq normalization techniques that use the full library, their assumptions will not be derived from Evans, Hardin, and Stoebel (2016)’s paper.

We can also normalize using experimental controls. These controls are experimental and are used to normalize the data. With controls, other standard normalization techniques can be used (e.g., quantile normalization, library size normalization, or RLE normalization).

Spike-Ins

Like in RNA-Seq, there are also spike-in controls for ChIP-Seq. Spike-ins are external chromatin added to the experiment that are supposed to be unaffected by the antibody added, so there are no biological peaks from the spike-in chromatin (Bonhoure et al. 2014); the only enriched loci for the spike-in control would be from technical effects. In this sense, the operate equivalently to spike-ins for RNA-seq.

Assumptions

  1. Controls do exist! Their signal also behaves as expected (e.g., adding a negative control that is supposed to have no enriched peaks across conditions actually does have no enriched peaks across conditions)

  2. Controls behave like non-control chromatin (i.e., genome): The technical effects on the controls represent the technical effects to all chromatin in the experiment. Thus, the controls can help us normalize the data.

Parallel Factors

Parallel factors add a second antibody, which has known peaks, to the experiment (Guertin et al. 2018). As the peaks for parallel factors are known a priori, parallel factors can be used to detect technical effects. The impetus behind parallel factors aligns with the mission of Housekeeping genes in RNA-seq to use an internal a priori control to detect technical effects. Thus, the assumptions listed below align with those for RNA-Seq’s house keeping genes.

Assumptions

  1. Controls do exist! The second antibody behaves as expected and generates peaks of known location and magnitude.

  2. The control reads (i.e. those associated with the second antibody) behave like the reads associated with the non-control antibody: the technical effects on the controls represent the technical effects to all chromatin in the experiment. Thus, the controls can help us normalize the data.

Biological Assumptions for Novel ChIP-Seq Normalization Techniques

Loess Normalization (Taslim et al.)

Loess normalization tethers points to a Loess smoothing curve and then taughts the loess curve such that it is centered at 0. In Taslim et al. (2009)’s Loess Normalization, they create two loess curves: (1) with respect to the mean and (2) with respect to the variance, to normalize both statistics.

Assumptions

  1. Roughly symmetric differentially-bound regions across conditions (same number of up and down-regulated genomic regions)

MAnorm

MAnorm is a normalization technique specific for ChIP-Seq data that uses the common peaks to create a linear model to normalize the data and then applies this model to all the peaks (Shao et al. 2012). As such, MAnorm is a varation of “Reads in Peaks” normalization that only looks at the common peaks.

Assumptions

  1. Most common/shared peaks are not differentially bound. This assumption relates to the x-axis value being less than 50% in the simulation diagram Evans, Hardin, and Stoebel (2016) provides:

  1. Technical effects are the same for differentially bound and non-differentially bound genomic regions (Shao et al. 2012)

    “Second, the observed differences in sequence read density in common peaks are presumed to reflect the scaling relationship of ChIP-Seq signals between two samples, which can thus be applied to all peaks.”

  2. Same total DNA-protein binding: the amount of total DNA binding is the same between different experimental conditions.

Reads in Peaks

Reads in peaks is a normalization technqiue unique to ChIP-Seq data. Reads in Peak normalization techniques only look at the subset of reads that overlap with the consensus peak set in each replicate/sample (Stark and Brown 2022a). An example of this is Shao et al. (2012), which only looks at common peaks to form a normalization equation. The goal of normalizing with reads in peaks is that not only sequencing depth will be corrected, but also ChIP efficiency (that is measured through the proportion of reads in enriched peak regions) (Stark and Brown 2022a). As such, Stark and Brown (2022a) recommends using “RiP” library sizes for native normalization methods, developed originally for RNA-seq.

However, efficiency may not always be an issue for ChIP-Seq. For example, if different replicates have fairly consistent FRiP numbers, ChIP efficiency does not seem to be an issue (Stark and Brown 2022a).

Effective library size and reads in peaks are equivalent for ChIP-Seq as both look at consensus peak set (Wu et al. 2015). It appears that effective library size/reads in peaks assumes comparable DNA-protein binding between conditions according to a paper by Wu et al. (2015) and Stark and Brown (2022a). So…

Assumptions:

  1. Same total DNA-protein binding: the amount of total DNA binding is the same between different experimental conditions.

Background Bins

Using only background bins to normalize data is also novel to ChIP-Seq data. As aforementioned, normalization using background bins only considers reads that are in chromosomes with peaks in the consensus peak set. These chromosomes are then partitioned into segments that are ~15,000 base pairs long (Stark and Brown 2022b). Background bins are considered a more “neutral” set of reads (Stark and Brown 2022a). 15,000bp segments are chosen because peaks in ChIP-Seq are expected to be around 100-600bp (Stark and Brown 2022a). Thus, it is assumed no global change between biological conditions will occur over a 15,000bp interval. Thus, any large-scale changes we see in these intervals are expected to be from technical artifacts rather than biological shifts.

Assumptions

  1. Same total DNA-protein binding: the amount of total DNA binding is the same between different experimental conditions (i.e., there is no global shift between conditions)

  2. Technical effects are the same for differentially bound and non-differentially bound (i.e. background bin) genomic regions.

Math Memes (bc why not)

This one reminded me of the joy that was trying to cross-validate for Ridge Regression and Lasso lol

References

Ashby, Ethan. 2021. 01-02-2022-DiffBind-Analysis-Methods.
Bonhoure, Nicolas, Gergana Bounova, David Bernasconi, Viviane Praz, Fabienne Lammers, Donatella Canella, Ian M Willis, et al. 2014. “Quantifying ChIP-Seq Data: A Spiking Method Providing an Internal Reference for Sample-to-Sample Normalization.” Genome Res. 7 (July). https://doi.org/10.1101/gr.168260.113.
Evans, Ciaran, Johanna Hardin, and Daniel Stoebel. 2016. “Selecting Between-Sample RNA-Seq Normalization Methods from the Perspective of Their Assumptions.” Briefings in Bioinformatics 19 (September). https://doi.org/10.1093/bib/bbx008.
Guertin, Michael, Amy Cullen, Florian Markowetz, and Andrew Holding. 2018. “Parallel Factor ChIP Provides Essential Internal Control for Quantitative Differential ChIP-Seq.” Nucleic Acids Research 46 (April). https://doi.org/10.1093/nar/gky252.
Höllbacher, Barbara, Kinga Balázsa, Matthias Heinig, and N. Henriette Uhlenhaut. 2020. “Seq-Ing Answers: Current Data Integration Approaches to Uncover Mechanisms of Transcriptional Regulation.” Computational and Structural Biotechnology Journal 18. https://doi.org/10.1016/j.csbj.2020.05.018.
Kruczyk, Marcin, Husen M Umer, Stefan Enroth, and Jan Komorowski. 2013. “Peak Finder Metaserver - a Novel Application for Finding Peaks in ChIP-Seq Data.” BMC Bioinformatics 14 (March). https://doi.org/10.1186/1471-2105-14-280.
Love, Michael I, Wolfgang Huber, and Simon Anders. 2014. “Moderated Estimation of Fold Change and Dispersion for RNA-Seq Data with DESeq2.” Genome Biology 15 (December). https://doi.org/10.1186/s13059-014-0550-8.
Nair, Nishanth Ulhas, Avinash Das Sahu, Philipp Bucher, and Bernard M. E. Moret. 2012. “ChIPnorm: A Statistical Method for Normalizing and Identifying Differential Regions in Histone Modification ChIP-Seq Libraries.” PLos One 7 (August). https://doi.org/10.1371/journal.pone.0039573.
Robinson, Mark D, and Alicia Oshlack. 2010. “A Scaling Normalization Method for Differential Expression Analysis of RNA-Seq Data.” Genome Biology 11 (March). https://doi.org/10.1186/gb-2010-11-3-r25.
Shao, Zhen, Yijing Zhang, Guo-Cheng Yuan, Stuart H Orkin, and David J Waxman. 2012. “MAnorm: A Robust Model for Quantitative Comparison of ChIP-Seq Data Sets.” Genome Biology 13 (March). https://doi.org/10.1186/gb-2012-13-3-r16.
Stark, Rory, and Gord Brown. 2022a. DiffBind: Differential Binding Analysis of ChIPSeq Peak Data. Cancer Research UK - Cambridge Institute. https://bioconductor.org/packages/devel/bioc/vignettes/DiffBind/inst/doc/DiffBind.pdf.
———. 2022b. Differential Binding Analysis of ChIP-Seq Peak Data. Cancer Research UK - Cambridge Institute. https://bioconductor.org/packages/devel/bioc/manuals/DiffBind/man/DiffBind.pdf.
StatQuest. n.d. “StatQuest: A Gentle Introduction to ChIP-Seq.” YouTube. https://www.youtube.com/watch?v=nkWGmaYRues.
Taslim, Cenny, Jiejun Wu, Pearlly Yan, Greg Singer, Jeffrey Parvin, Tim HUang, Shili Lin, and Kun Huang. 2009. “Comparative Study on ChIP-Seq Data: Normalization and Binding Pattern Characterization.” Bioinformatics 25 (September): 2334–40. https://doi.org/10.1093/bioinformatics/btp384.
Tu, Shiqi, Mushan Li, Chen Haojie, Fengxiang Tan, Jian Xu, David Waxman, Yijing Zhang, and Zhen Shao. 2020. “MAnorm2 for Quantitatively Comparing Groups of ChIP-Seq Samples.” Genome Research 31 (November): gr.262675.120. https://doi.org/10.1101/gr.262675.120.
Wu, Dai-Ying, Danielle Bittencourt, Michael R. Stallcup, and Kimberly D. Siegmund. 2015. “Identifying Differential Transcription Factor Binding in ChIP-Seq.” Frontiers in Genetics 6 (April). https://doi.org/10.3389/fgene.2015.00169.