Chromatin Immunoprecipitation followed by high throughput sequencing, colloquially known as ChIP-Seq, identifies the location DNA-protein binding sites on the genome. This collection of ChIP-Seq data is done through the procedure outlined in the diagram below.
An Outline of ChIP-Seq Data Collection
As seen in the diagram, the protein-of-interest (e.g., histone-modification or transcription factor) is glued to the chromatin and then an antibody, that is known to bind with the protein-of-interest, is injected into the chromatin such that the genomic regions with the protein-of-interest are extracted. This extraction process generates reads which are then aligned and mapped to the genome (StatQuest, n.d.).
Once the ChIP-Seq data is processed, it must undergo some preliminary analysis before biological conclusions can be drawn. For instance, peaks must be called. In calling peaks, background noise is supposedly separated from true biological signal. This step is vital to generate the consensus peak set, which is a union of all called-peaks across conditions. Oftentimes in consensus peak sets, called-peaks that overlap in a sufficient proportion of replicates are combined into one peak. The consensus peak set is then used for downstream ChIP-Seq analysis of protein-binding sites in the genome (Höllbacher et al. 2020).
Recently, ChIP-Seq analyses have been focused on comparing binding sites for proteins-of-interest between different biological conditions. This characterization of the change in the binding sites between samples is known as “Differential Binding Analysis.” By the nature of Differential Binding Analysis, normalization techniques are required to make an apples-to-apples comparison of reads between biological conditions (Höllbacher et al. 2020).
For my Summer Undergraduate Research Project (SURP), I will be analyzing between-sample ChIP-Seq normalization techniques through the lens of their biological assumptions. This project mirrors Evans, Hardin, and Stoebel (2016)’s paper, Selecting between-sample RNA-Seq normalization methods from the perspective of their assumptions, which discusses the biological assumptions in between-sample normalization techniques for RNA-Seq, another high throughput sequencing method (Evans, Hardin, and Stoebel 2016). By the end of my SURP, I hope to understand the following:
What are the normalization techniques for ChIP-Seq? How are they different from those of RNA-Seq?
What are the biological assumptions associated with ChIP-Seq normalization techniques? Again, how are they different from the biological assumptions for RNA-Seq normalization techniques?
How can we use coding software, such as R (and potentially MatLab) to demonstrate the biological assumptions each ChIP-Seq normalization technique makes (e.g., simulating ChIP-Seq maps)?
In this write-up, I will summarize my findings from the past three weeks and provide insight into where I hope to take this project in the coming weeks in order to better understand the answers to the questions posed above.
Listed below are the normalization techniques that I have found so far in literature. These are described in detail in the next section, “Summary of Normalization Methods.”
The following definitions are important to ChIP-Seq normalization and come from the DiffBind vignette, a popular bioconductor package for ChIP-Seq analysis (Stark and Brown 2022a).
IMPORTANT DEFINITIONS:
FULL: uses all the reads in each sample.
READS IN PEAKS: just uses the reads in each sample that overlap the consensus peak set.
BACKGROUND BINS: just uses the reads in the chromosomes that contain peaks in the consensus peak set. These chromsomes are then partitioned into segments that are ~15,000bp.
RPKM
Library Size
2a. Full
2b. Reads in Peaks
2c. Background Bins
Trimmed Mean of M-Values (TMM)
3a. Full
3b. Reads in Peaks
3c. Background Bins
Relative Log Expression (RLE)
4a. Full
4b. Reads in Peaks
4c. Background Bins
Quantile Normalization
ChIPNorm
MANorm
Loess Normalization (Taslim et al.)
Parallel Factors
Spike-ins
Once collecting a curated list of ChIP-Seq normalization techniques, I then found literature which defined each technique and provided the mathematical algorithm to normalize raw ChIP-Seq Data. This section is my first draft of the summary of ChIP-Seq Normalization methods common in literature.
The following important definition also comes from the DiffBind Vignette (Stark and Brown 2022a) as well as the DiffBind Manual (Stark and Brown 2022b).
IMPORTANT DEFINITIONS:
CONSENSUS PEAKS = peaks that overlap across a certain proportion of replicates (different samples under the same condition) and then typically the determined peaks in each condition are unioned to create a complete list of consensus peaks.
Overlapping can be defined in many different ways If one wants to be more confident in consensus peaks then they may require more overlap (e.g., 10 bps rather than 1 bp) or require a higher proportion of replicates having this magnitude of overlap (e.g., 50% of replicates versus 75% of replicates). This is done to minimize the number of false positives (i.e. FDR).
RPKM stands for “reads per kilobase per million mapped reads” RPKM normalized ChIP-Seq data by taking every read count and dividing by the number of reads in a sample (in millions) and the region check length (in kilobases). By dividing by both metrics, RPKM normalizes by region length and total number of reads in each sample (Evans, Hardin, and Stoebel 2016).
Library Size (Full) normalizes between-sample ChIP-Seq data by dividing each sample by its full library size. The full library size of each sample is the total number of aligned reads in the sample, which is alternatively known as sequencing depth (Evans, Hardin, and Stoebel 2016).
Library Size (Reads in Peaks) is a variation of Library Size (Full) that calculates library size by summing the reads that overlap consensus peaks in each sample. Each sample is then normalized by dividing every read count in a sample by its number of reads in consensus peak regions. Unlike, full library size, library size found with reads peaks also is affected by the ChIP’s “efficiency” (i.e., the proportion of reads in enriched peaks; a high proportion of reads not in enriched peaks signifies an inefficient ChIP experiment) (Stark and Brown 2022a).
Library Size (Background Bins) is another variation of the library size normalization technique. In Library Size using Background Bins – which are large regions (approximately 15,000 bp) on chromosomes with a peak in sample j. The reads in these background bins for sample j are summed and then serve divisor for every read count in sample j (Stark and Brown 2022b).
TMM stands for Trimmed Mean of M-Values and is a normalization technique native to the R-package “EdgeR.” TMM sets on sample as a reference and compares it to every other sample, estimating the ratio of sequencing depths between each sample and the reference one (Robinson and Oshlack 2010).
The trimming of sections of the genome sequenced with ChIP-seq occurs twice: (1) based on fold-change in the signal in a specific region between samples. (2) based on signal magnitudes for a specific genomic region between samples. TMM trims twice in order to remove a majority of differentially bound regions so that the mean is only over non-differentially bound genomic regions. The ratio of read counts in a sample compared to the reference sample can be expressed as a ratio of sequencing depths for the non-differentially bound genomic regions (Robinson and Oshlack 2010).
\(k_{ij}\) is the number of reads aligned to region i under sample j
\(u_{ij}\) is the true signal magnitude of region i under sample j.
\(N_j\) is the total number of reads for sample j. Thus, \(N_j = \sum_{i} K_{ij}\)
Let r be the fixed reference sample.
Then, the fold-changes are defined as followed for regions:
\[M_{ij}^r = \log_2\frac{k_{ij}/N_j}{k_{ir}/N_r}\]
Additionally, the absolute signal magnitude is given as:
\[A_{ij}^r = \frac{1}{2}\log_2(\frac{k_{ij}}{N_j}\cdot\frac{k_{ir}}{N_r})\] As alluded to above, the absolute signal magnitude and fold-change of a genomic region (i.e. \(A_{ij}^r\) and \(M_{ij}^r\), respectively) are trimmed independently. This creates a finalized list of regions such that neither its \(A_{ij}^r\) or \(M_{ij}^r\) was removed; denote this sample as \(R\). smallest largest values??
Set \(R\) is then used to calculate the scaling factor, \(TMM_j^{(r)}\) for sample j through a weighted mean:
\[\log_2(TMM_j^{(r)})=\frac{\sum_{i \in R}w_{ij}^rM_{ij}^r}{\sum_{i \in R}w_{ij}^r}\]
where,
\[w_{ij}^r = \frac{N_j - k_{ij}}{N_jk_{ij}} - \frac{N_r - k_{ir}}{N_rk_{ir}}\] As we see in the \(\log_2(TMM_j^{(r)})\) calculation, to find the normalization factors we divide by library size of each sample (i.e., \(N_j\)). Consequently, the relative magnitude of the normalization factor is positively correlated to the relative library size of sample j relative to the other samples. Thus, we divide each sample by \(TMM_j^{(r)}\cdot\frac{N_j}{N_r}\) to normalize the ChIP-seq data, where \(N_r\) is the library size of the reference sample (Robinson and Oshlack 2010).
TMM (Trimmed Mean of the M-values) on peak reads extends the same algorithm as TMM on the full library described in Robinson and Oshlack (2010) but only looks at the subset of reads that overlap with the consensus peaks (at least in DiffBind). Like on TMM full, one sample is selected as a reference sample and then the peak regions are trimmed twice. First, they are trimmed by fold-change between consensus peak in the sample versus peak in sample j. Secondly, the regions are trimmed based on true signal magnitude of some consensus peak region in sample j compared to the reference sample.
One reason to use TMM on only peak reads is because studies, such as Tu et al. (2020), suggest that there is high regularity at common peak (i.e. consensus peak) regions. Therefore, the scaling factors (\(TMM_j^{(r)}\)) may be more reliable when only reads in consensus peak regions are used rather than the full library of reads. In turn, this would leads to increased confidence in our normalization. For sites that are differentially bound, we would expect large ratios between their sequencing depths. Thus, the smallest \(A_{ij}^r\) and \(M_{ij}^r\) are independently removed, leaving us with a list of regions we will denote as \(R_p\), which has the same functional role as \(R\) in the calculations for “TMM on Full Library” (Tu et al. 2020).
The mathematical process presented in “TMM on Full Library” to normalize ChIP-seq data also applies to “TMM on Peak Reads.” The only difference is only the subset of reads that overlap with the consensus peaks are considered. Therefore, \(k_{ij}\) is the number of reads aligned to consensus peak region i under sample j; \(u_{ij}\) is the true signal magnitude of consensus peak region i under sample j; \(N_j\) is the total number of reads that overlap consensus peak regions for sample j. r still denotes the fixed reference sample for “TMM on Peak Reads” just as it did for “TMM on Full Library.”
TMM on Background Bins is another variation of TMM for normalizing ChIP-Seq Data. Background bins use all chromosomes for which there are peaks in the consensus set. These chromosomes are then partitioned into large bins (DiffBind’s default size is 15,000bp) and the reads that overlap these bins are counted (Stark and Brown 2022b). These large stretches in the chromosome are expected to not have differential enrichment between samples and therefore help ensure the mean is calculated over only non-differentially bound genomic regions for which the ratio of read counts in a sample compared to the reference sample can be expressed as a ratio of sequencing depths for the non-differentially bound genomic regions (Stark and Brown 2022a).
Like in the two TMM methods above, the TMM on background bins trims twice: (1) based on fold-change in the signal in a specific “background” bin between samples. (2) based on signal magnitudes for a specific “background” bins between samples [Robinson and Oshlack (2010)}. These trims are independently and quantified via \(A_{ij}^r\) (absolute signal magnitude) and \(M_{ij}^r\) (fold-change). Like before, we are left with list of background bins we will denote as \(R_b\) with sufficient \(A_{ij}^r\) and \(M_{ij}^r\) values. \(R_b\) has the same functional role as \(R\) in the calculations for “TMM on Full Library.”
The mathematical process presented in “TMM on Full Library” to normalize ChIP-seq data also applies to “TMM on Background Bins.” The only difference is that the background bins are used as opposed to potential enriched regions in the genome. Therefore, \(k_{ij}\) is the number of reads aligned to background bin i under sample j; \(u_{ij}\) is the true signal magnitude of background bin i under sample j; \(N_j\) is the total number of reads in background bins for sample j. r still denotes the fixed reference sample for “TMM on Background” just as it did for “TMM on Full Library.”
RLE (Relative Log Expression) on the full library is a normalization technique native to the Bioconductor packages DESeq and DESeq2. It finds the ratio of full read count for a sample to the geometric mean of all read counts for the specified region across all samples. The median of these ratios for a sample is then set as the size factor that is then applied to the full library of reads for that sample to account for differences in sequencing depths between-samples (Love, Huber, and Anders 2014).
The idea is that most regions are not differentially bound and thus have no significant fold-changes between-samples. So, the median of the ratio of total reads in a region between a sample and a pseudo-sample serves as a good estimate of sequencing depth (Evans, Hardin, and Stoebel 2016).
Let \(s_j\) denote the size factor for sample j (found through median-of-ratios method). \(K_{ij}\) represents the total read count for region i under sample j. m is the total number of samples across every condition.
\[s_{j} = (median)_{i}[\frac{K_{ij}}{(\prod_{j =1}^{m} K_{ij})^{1/m}}]\]
The denominator inside \(median_{i}\) represents an arbitrary reference sample that each sample j can be compared to in the ratio above.
Similar to the distinction between TMM on Full Library and TMM on Reads in Peaks, the math for RLE on Reads in Peaks is the same as the math of RLE on Full Library provided in Love, Huber, and Anders (2014). The only distinction is the set of reads considered in the ratio. In RLE on Reads in Peaks, only reads that overlap consensus peaks in the sample j are considered in the median to find the size factor for the sample j (Stark and Brown 2022a). Like before, let \(s_j\) denote the size factor for sample j. \(K_{ij}\) now represents the number of reads that overlap with consensus peak region i under sample j. m is still the total number of samples across every condition. The denominator, \((\prod_{j =1}^{m} K_{ij})^{1/m}\), still represents a pseudo-reference sample. However, now this geometric mean is only over consensus peak region i present across all samples.
As aforementioned, background bins use all chromosomes for which there are peaks in the consensus set. These chromosomes are then partitioned into large bins (DiffBind’s default size is 15,000bp) and the reads that overlap these bins are counted (Stark and Brown 2022b). These large stretches in the chromosome are expected to not have differential enrichment between samples and therefore help ensure the mean is calculated over only non-differentially bound genomic regions for which the ratio of read counts in a sample compared to the reference sample can be expressed as a ratio of sequencing depths for the non-differentially bound genomic regions (Stark and Brown 2022a).
Therefore, in RLE on Background Bins, only the reads on these chromosomes with peaks are considered. Again, the mathematical algorithm is the same it is in Love, Huber, and Anders (2014). \(s_j\) denotes the size factor for sample j. \(K_{ij}\) now represents the number of reads in background bin i (i.e., a region that is in some chromosome for which there is a peak) under sample j. m is still the total number of samples across every condition. The denominator, \((\prod_{j =1}^{m} K_{ij})^{1/m}\), still represents a pseudo-reference sample. However, now this geometric mean is only takes into account the number of reads in background bin i across all samples.
Quantile normalization is a technique adapted from normalization of microarray data (Kruczyk et al. 2013). Quantile normalization uses the read count matrix of the consensus peaks and then sorts the columns of the matrix such that every row corresponds to a specific quantile of signal magnitude. We then find the average of each row and replace each entry in the sorted consensus peak matrix with its respective row’s average. The matrix is finally unsorted such that the entries are back in their original position in the matrix (Kruczyk et al. 2013).
Note that other measures of center, like medians, can be the replacement value in quantile normalization (Evans, Hardin, and Stoebel 2016). However, our analysis focuses on the mean as the measure of center in quantile normalization.
ChIPNorm is a normalization technique developed by Nair et al. (2012). While ChIPNorm can be applied to any ChIP-seq data set, its focus is on locating differentially-bound histone-modification sites.
ChIPNorm applies an iterative normalization process to identify differentially bound sites. First, the ChIP-seq data is compared to the”Amplified Binomial Distribution”, which is the estimated null distribution with parameters \(N_{bino}\) (total number of reads) and \(\alpha\) (amplification factor). ChIP-seq bins with a false discovery rate (FDR) less than or equal to 0.05 are considered significant.
Next, multiple instances of quantile normalization are applied. The first instance of quantile normalization classifies bins with a fold-change > 1 as outliers. The non-outlier bins then undergo a second round of quantile normalization to finalize the normalization function.
This normalization factor is then applied to normalize (via quantiles) all the bins, including the “outlier” ones. A bin is considered enriched if its ChIP-seq data and quantile normalized control data both have fold changes > 1.
If a bin is significant and enriched, then it is called “enriched-significant.” The enriched-significant bins of each library are moved into the “stage two” where they undergo another round of normalization with their original bin counts. Once again, quantile normalization (with a method similar to Bolstad et al. 2003) is used to give the libraries the same probability distribution so they can be directly compared to determine differentially bound bins.
According to Nair et al. (2012), ChIPNorm’s normalization process reduces the effect of bias when comparing data sets with different levels of background noise and/or signal-to-noise ratios. Furthermore, the mathematics of ChIPNorm were developed such that it be extended to compare more than two data sets, which methods, such as quantile normalization, are unable to do.
MAnorm is a between-sample normalization technique that utilizes the common peaks across samples to create a rescaling model for normalization (Shao et al. 2012). A common peak refers to a peak that overlaps by at least nucleotide in differential binding analysis for two samples (Shao et al. 2012). To create the normalization scaling equation, the MA values for each peak (unique and common) are calculated, where \(R_1\) is the read density at this peak region in sample 1 and \(R_2\) sample 2’s associated read density.
\[M = log2(R_1/R_2)\]
\[A = log2(R_1*R_2)/2\]
Note that 1 was addeed to the real number of reads for each peak to avoid an input of 0 into the log function.
“An iterative re-weighted least squares with a bi-square function” was then applied to the MA values for the common peaks (Shao et al. 2012). This led to a linear model to fit “global dependence” for the common peak’s MA values:
\[ M = a + b*A\]
The linear model was then applied to all the peaks, including those unique to a sample. Alongside the normalized MA value outputted by MAnorm, a corresponding p-value of each peak is found using a Bayesian model by Audic and Claverie which quantifies the magnitude of differential binding at the peak’s locus on the chromosome (Shao et al. 2012).
\[p(y|x) = (x + y)!/x!y!2^{x=y=1}\] where x and y denote the normalized the read count at a common peak for sample 1 and 2, respectively.
While there are many possible LOESS normalization techniques, one possible LOESS (locally weighted smoothing regression) normalization technique for ChIP-Seq was introduced by Taslim et al. (2009). Their LOESS normalization method does not assume that the total number of DNA bindings is similar across conditions nor that the noise is linear. Rather, their LOESS normalization assumes that the difference in means of non-differential reads will be zero (Taslim et al. 2009). Their method involves a two-stage LOESS normalization. In stage one, the fitted values are estimated by a LOESS regression with span 60% on the observed difference in mean counts:
\[\hat{Y}_{mean} = loess[(x_{i2} - x_{i1}) \sim (\frac{x_{i2} + x_{i1}}{2})] \] \(x_{i1}\) and \(x_{i2}\) represent the observed fragment counts in control and treatment samples. \(\hat{Y}_{mean}\) is then used the find \(D_{meannorm}\), or the data after correction with respect to the mean.
\[D_{meannorm} = (x_{i2} - x_{i1}) - \hat{Y}_{mean}\] After the normalization with respect to the mean is calculated, the next step is to conduct a loess regression with span 10% to find the fitted Y-value by regressing the transformed data on the average of the observed fragment counts.
\[\hat{Y}_{var} = loess[|D_{meannorm}|\sim (\frac{x_{i2} + x_{i1}}{2})]\] This then creates, \(D_{meanvanorm}\), which is the data normalized with respect to mean and variance, where:
\[D_{meanvanorm} = \frac{D_{meannorm}}{\hat{Y}_{var}}\]
Parallel factors use a second antibody to serve as an internal control “to normalize the signal” such that there are unchanged peaks between-samples that can be used as reference (Guertin et al. 2018). To use parallel factors, it must be a priori known that most of the binding sites for the secondary antibody are unaltered between the compared conditions. For example, Guertin et al. (2018), use CTCF (CCCTC binding factor) as a control antibody for ER (estrogen receptors) given CTCF affects a small proportion of ER binding sites. The peaks which are associated with the control antibody in each condition are then compared and a normalization factor is developed using typical normalization techniques. The obtained normalization factor is then applied to the reads associated with the transcription factor or histone modification of interest to normalize its interesting regions such that we can determine differential binding sites.
Spike-in controls insert chromatin from another organism into the genome with the transcription factor or histone of interest to serve as an external control across the sample conditions (Evans, Hardin, and Stoebel 2016). In spike-in controls, it is important that enough of the external chromatin is added to conduct normalization but not too much such that the total tag counts change. Similarly, for the external chromatin to serve as a control, it must not affect the antibody for the transcription factor or histone of interest (Bonhoure et al. 2014). Similar to parallel factors, the external control is then used to generate a normalization factor and then this factor is applied to the chromosomal reads with the transcription factor or histone of interest (Stark and Brown 2022a).
I owe the the following code chunks in large part to Ashby (2021), who supplied me with this code via an email on 05/20/2022. Additionally, I would like to recognize Danae Schulz and her student, Lucy, who collected the Cut-and-Run data that is used in the consequent real-life data analysis.
To see the normalization techniques in action we applied some common techniques (supported by the DiffBind package) to Cut-and-Run data produced by Danae Schulz and Lucy at Harvey Mudd College.
To replicate the data analysis steps provided below for yourself, please refer to "file_directory_walkthrough.md" in the github repo.
The first step was to read the peaks and supplementary information into a sample sheet.
#####
# The goal of this chunk: to make a sample sheet containing file paths to all relevant files
#identify names of sorted spike-in BAM files
spikes<-list.files("data/bam_spikein", ".bam")
spikes<-spikes[str_detect(spikes, "sort")]
spikes<-spikes[!str_detect(spikes, "bam.bai")]
#create sample sheet
sample_sheet<-data.frame(
#SampleID denotes unique identifier for each sample
SampleID= c(paste(1:3, paste(rep(c(0, 1, 3, 24, 76), each=3), "h", sep=""), rep(1:3, 3), sep="_")),
#CONDITION denotes timepoint
Condition= rep(c('0', '1', '3', '24', '76'), each=3),
#Treatment included just for yucks
Treatment= rep("Bromodomain Inhibition", 15),
#Replicate denotes replicate number
Replicate= rep(1:3, 5),
#bamReads denotes paths to BAM files containing reads aligned to T brucei genome
bamReads= paste("reads/", list.files("./reads", pattern="*.bam")[!str_detect(list.files("./reads", pattern="*.bam"), "bam.bai")] %>% head(15), sep=""),
#Spikein denotes file paths to BAM files containing spike-in reads
Spikein=paste("data/bam_spikein/", spikes[c(1,6,11,2,7,12,3,8,13,4,9,14,5,10,15)], sep=""),
#Unique identifier for control sample
ControlID= rep("16_IgG_1", 15),
#bamControl contains the file path to the control BAM file
bamControl= rep(paste("reads/", list.files("./reads", pattern="*.bam")[!str_detect(list.files("./reads", pattern="*.bam"), "bam.bai")] %>% tail(1), sep=""), 15),
#Peaks contains the file paths to the peak files produced by MACS
Peaks= paste("nomodel_broad_Res/", list.files("./nomodel_broad_Res/", pattern="*.broadPeak"), sep="") %>% head(15),
#PeakCaller denotes the file type of the files in `Peaks`
PeakCaller= rep("narrow", 15)
)
#set up DBA object... most of all these settings are defaults. Make sure you check doGreyList=TRUE
res_s<- dba(sampleSheet=sample_sheet, config=data.frame(AnalysisMethod=DBA_DESEQ2, th=0.05,
DataType=DBA_DATA_GRANGES, RunParallel=TRUE,
minQCth=15,
bCorPlot=FALSE, reportInit="DBA",
bUsePval=FALSE, design=TRUE,
doBlacklist=FALSE, doGreylist=TRUE))
Below is what the head of the sample sheet generated by the last step of code looks like:
sample_sheet %>%
head()
After generating the sample sheet, Ashby (2021) then made a greylist which included all regions for which there is a disproportionate degree of signal. These regions are then excluded from future analysis to avoid biasing differential binding analysis.
###### Goal of this chunk: greylist suspect regions of the genome with high tag
###### counts in the control sample
set.seed(4747)
karyo <- read.table("karyo_file.txt")
# Step 1: create greylist object
gl <- new("GreyList", karyoFile = "./karyo_file.txt")
# Step 2: count the reads from the control file
gl <- countReads(gl, "./reads/sort_16_IgG_1_CR_diff_comb_140804_tb927_v5.1_m1_v2.bam") #where sorted T brucei BAM file is
# Step 3: calculate read count threshold
gl <- calcThreshold(gl, reps = 100, sampleSize = 1000, p = 0.99, cores = 1)
# Step 4:
gl <- makeGreyList(gl, maxGap = 10000)
## coverage: 175103 bp (0.66%)
export(gl, con = "greylist.bed")
res_s <- invisible(suppressMessages(dba.blacklist(res_s, greylist = gl@regions)))
Next, Ashby (2021) wrote code to generate a consensus peak set. Please refer to the top of the previous section describing the mathematics behind selected ChIP-Seq normalization techniques for a definition of consensus peaks.
##### The goal of this chunk: generate the consensus peakset for peaks
##### identified in all 3 replicates from each timepoint No need to change
##### paths OUTPUT: writes 'consensus.bed' file to home directory containing
##### the positions of all the consensus peaks
# adds consensus peaks: those that overlap with all three
res_consensus <- dba.peakset(res_s, consensus = c(DBA_CONDITION), minOverlap = 3)
# removes other peaksets, focuses only on consensus
res_consensus <- dba(res_consensus, mask = res_consensus$masks$Consensus, minOverlap = 1)
# creates an object containing the consensus peaks, and writes it out as a bed
# file
consensus_peaks <- dba.peakset(res_consensus, bRetrieve = TRUE, writeFile = "consensus.bed",
numCols = 3)
Below is a Jaccard distance plot which gives the similarity between consensus peaks at each recorded time interval in the experiment.
##### The goal of this chunk: create dendrogram illustrating Jaccard distance
##### between consensus peaks between timepoints You don't need to change a
##### thing!
# creates list of consensus peaks per TP
ll <- list(res_consensus$called[, 1], res_consensus$called[, 2], res_consensus$called[,
3], res_consensus$called[, 4], res_consensus$called[, 5])
# calculates jaccard similarities between all combinations of timepoints
dist <- unlist(lapply(combn(ll, 2, simplify = FALSE), function(x) {
sum(x[[1]] == x[[2]])/(sum(x[[1]] != 0) + sum(x[[1]] != 0))
}))
# initialize matrix to store distances
to_make_mat <- cbind(t(combn(5, 2)), dist)
d_mat <- matrix(ncol = 5, nrow = 5)
# make diagonal entries 1
diag(d_mat) <- 1
# loop through rows of the matrix
for (i in 1:dim(to_make_mat)[1]) {
# fill d_mat with jaccard similarities at each particular index
d_mat[as.numeric(to_make_mat[i, 2]), as.numeric(to_make_mat[i, 1])] <- to_make_mat[i,
3]
}
colnames(d_mat) <- c("0hr", "1hr", "3hr", "24hr", "76hr")
rownames(d_mat) <- c("0hr", "1hr", "3hr", "24hr", "76hr")
# convert jaccard similarity to jaccard DISTANCE by taking 1-sim.
d_mat <- as.dist(1 - d_mat)
# use hierarchical clustering with complete linkage and plot
hc <- hclust(d_mat)
plot(hc, main = "Consensus Peaks", sub = "", xlab = "", ylab = "Jaccard Distance")
Next, Ashby (2021) counted the number of reads that align to summit (explain) windows within peaks. These reads are counted twice, once with normalized reads and once with RPKM reads.
##### The goal of this chunk: count number of reads that align to summit
##### windows within peaks (and greylist the resulting objects) NOTE: I perform
##### this step twice. First with score=DBA_SCORE_NORMALIZED, specifying that
##### normalized reads should be stored in the resulting object, second with
##### score=DBA_SCORE_RPKM, specifying that RPKMs should be stored.
# This code prints a summary of the peak widths in our dataset. We want a
# summits value that lies within [1/2 * min peak width, 1/2 * 1st quartile peak
# width]. summits=200 fits the bill!
summary(res_s$binding[, 3] - res_s$binding[, 2])
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 200.0 402.0 688.5 3246.8 1957.5 37882.0
# count normalized reads and greylist
res_c <- dba.count(res_s, score = DBA_SCORE_NORMALIZED, peaks = consensus_peaks,
summits = 200)
res_c <- invisible(suppressMessages(dba.blacklist(res_c, greylist = gl@regions)))
# count RPKMs and greylist
res_rpkm <- dba.count(res_s, score = DBA_SCORE_RPKM, peaks = consensus_peaks, summits = 200)
res_rpkm <- invisible(suppressMessages(dba.blacklist(res_rpkm, greylist = gl@regions)))
As Ashby (2021) states, “the following code constructs contrasts between 0 hr and subsequent time points within the object containing normalized reads and rpkm reads separately.”
##### The goal of this chunk: specify contrasts between 0 hr and subsequent
##### timepoints NOTE: I perform this step twice. First with on the normalized
##### read count data and second on the rpkm data
res_c <- dba.contrast(res_c, contrast = c("Condition", "1", "0"))
res_c <- dba.contrast(res_c, contrast = c("Condition", "3", "0"))
res_c <- dba.contrast(res_c, contrast = c("Condition", "24", "0"))
res_c <- dba.contrast(res_c, contrast = c("Condition", "76", "0"))
res_rpkm <- dba.contrast(res_rpkm, contrast = c("Condition", "1", "0"))
res_rpkm <- dba.contrast(res_rpkm, contrast = c("Condition", "3", "0"))
res_rpkm <- dba.contrast(res_rpkm, contrast = c("Condition", "24", "0"))
res_rpkm <- dba.contrast(res_rpkm, contrast = c("Condition", "76", "0"))
Then, Ashby (2021) generated a .csv data file that contains the spike-in reads.
###### Spike-in lib size
res_lib <- dba.normalize(res_c, spikein = TRUE)
res_lib <- dba.analyze(res_lib)
dba.peakset(res_lib, bRetrieve = TRUE, writeFile = "spikeinlibcounts.csv")
## GRanges object with 499 ranges and 15 metadata columns:
## seqnames ranges strand | X1_0h_1 X2_0h_2 X3_0h_3
## <Rle> <IRanges> <Rle> | <numeric> <numeric> <numeric>
## 1 Tb927_01_v5.1 7538-7938 * | 293.759 449.689 1807.758
## 2 Tb927_01_v5.1 46067-46467 * | 287.283 501.620 1682.085
## 3 Tb927_01_v5.1 47790-48190 * | 439.020 525.588 1282.509
## 4 Tb927_01_v5.1 49105-49505 * | 477.417 350.962 610.642
## 5 Tb927_01_v5.1 52601-53001 * | 1191.691 805.216 2297.560
## ... ... ... ... . ... ... ...
## 495 Tb927_11_v5.1 4828256-4828656 * | 160.527 344.685 1330.845
## 496 Tb927_11_v5.1 4831133-4831533 * | 141.560 223.703 918.380
## 497 Tb927_11_v5.1 4833088-4833488 * | 265.540 240.252 718.592
## 498 Tb927_11_v5.1 4835964-4836364 * | 154.975 291.042 1071.443
## 499 Tb927_11_v5.1 4841270-4841670 * | 344.184 430.286 1725.587
## X1_1h_1 X2_1h_2 X3_1h_3 X1_3h_1 X2_3h_2 X3_3h_3 X1_24h_1
## <numeric> <numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
## 1 849.837 527.300 1229.037 1180.024 1421.388 839.623 592.209
## 2 831.100 588.194 1143.595 847.805 861.891 974.356 792.958
## 3 1270.071 616.299 871.937 1380.650 1422.720 1127.164 989.246
## 4 1381.152 411.535 415.156 1103.441 1099.011 1229.037 828.647
## 5 3447.526 944.189 1562.037 3875.526 4177.574 5990.731 3440.612
## ... ... ... ... ... ... ... ...
## 495 464.399 404.174 904.799 618.056 651.414 483.070 430.495
## 496 409.528 262.312 624.377 575.990 598.128 532.363 388.114
## 497 768.199 281.718 488.547 1188.653 1330.803 1501.791 788.497
## 498 448.339 341.273 728.440 1388.200 1542.612 1610.235 692.583
## 499 995.714 504.549 1173.171 2067.738 2527.059 2906.638 1853.581
## X2_24h_2 X3_24h_3 X1_76h_1 X2_76h_2 X3_76h_3
## <numeric> <numeric> <numeric> <numeric> <numeric>
## 1 380.299 1269.326 397.037 417.356 383.932
## 2 622.938 1633.624 658.010 685.825 516.092
## 3 584.314 1301.998 623.437 670.584 780.411
## 4 774.464 668.152 347.965 434.942 498.526
## 5 3876.282 2896.415 3421.652 3691.729 3370.907
## ... ... ... ... ... ...
## 495 281.263 1471.895 428.264 459.561 338.764
## 496 249.572 883.790 274.357 255.572 234.207
## 497 845.770 668.152 481.797 562.728 523.620
## 498 843.789 1323.235 1051.701 1145.386 841.472
## 499 2010.437 1870.499 1566.956 1672.943 1291.484
## -------
## seqinfo: 11 sequences from an unspecified genome; no seqlengths
For the continuation of normalization analysis Ashby (2021) conducted, I examined the following normalization techniques on the Cut-and_Run Data:
Library Size
1a. Full
1b. Background Bins
1c. Reads in Peaks
1d. Spike-ins
Relative Log Expression (RLE)
2a. Full
2b. Background Bins
2c. Reads in Peaks
2d. Spike-ins
Reads per Kilobase per Million Mapped Reads (RPKM)
The following code generates a PCA plot for normalization using spike-ins with library size. Ideally, we want the same colors to be grouped together and the two axes to explain a high proportion of the variability.
# retrieve list of differentially bound sites (padj<0.05)
spikeinlibsize <- rbind(res_lib$contrasts[[1]]$DESeq2$de, res_lib$contrasts[[2]]$DESeq2$de,
res_lib$contrasts[[3]]$DESeq2$de, res_lib$contrasts[[4]]$DESeq2$de) %>%
dplyr::filter(padj < 0.05) %>%
.$id %>%
unique()
# Generate PCA plot
p <- dba.plotPCA(res_lib, label = DBA_REPLICATE, vColors = c("navy", "magenta", "cyan",
"red", "green"), labelSize = 0.7, alpha = 0.75, score = DBA_SCORE_NORMALIZED,
sub = "Spike-in lib. size")
The PCA plots were then generated for other normalization methods, which are denoted in the bolded title below the x-axis.
###### Spike-in RLE
res_rle <- dba.normalize(res_c, normalize = DBA_NORM_RLE, spikein = TRUE)
res_rle <- dba.analyze(res_rle)
dba.peakset(res_rle, bRetrieve = TRUE, writeFile = "spikeinRLEcounts.csv")
## GRanges object with 499 ranges and 15 metadata columns:
## seqnames ranges strand | X1_0h_1 X2_0h_2 X3_0h_3
## <Rle> <IRanges> <Rle> | <numeric> <numeric> <numeric>
## 1 Tb927_01_v5.1 7538-7938 * | 293.759 449.689 1807.758
## 2 Tb927_01_v5.1 46067-46467 * | 287.283 501.620 1682.085
## 3 Tb927_01_v5.1 47790-48190 * | 439.020 525.588 1282.509
## 4 Tb927_01_v5.1 49105-49505 * | 477.417 350.962 610.642
## 5 Tb927_01_v5.1 52601-53001 * | 1191.691 805.216 2297.560
## ... ... ... ... . ... ... ...
## 495 Tb927_11_v5.1 4828256-4828656 * | 160.527 344.685 1330.845
## 496 Tb927_11_v5.1 4831133-4831533 * | 141.560 223.703 918.380
## 497 Tb927_11_v5.1 4833088-4833488 * | 265.540 240.252 718.592
## 498 Tb927_11_v5.1 4835964-4836364 * | 154.975 291.042 1071.443
## 499 Tb927_11_v5.1 4841270-4841670 * | 344.184 430.286 1725.587
## X1_1h_1 X2_1h_2 X3_1h_3 X1_3h_1 X2_3h_2 X3_3h_3 X1_24h_1
## <numeric> <numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
## 1 849.837 527.300 1229.037 1180.024 1421.388 839.623 592.209
## 2 831.100 588.194 1143.595 847.805 861.891 974.356 792.958
## 3 1270.071 616.299 871.937 1380.650 1422.720 1127.164 989.246
## 4 1381.152 411.535 415.156 1103.441 1099.011 1229.037 828.647
## 5 3447.526 944.189 1562.037 3875.526 4177.574 5990.731 3440.612
## ... ... ... ... ... ... ... ...
## 495 464.399 404.174 904.799 618.056 651.414 483.070 430.495
## 496 409.528 262.312 624.377 575.990 598.128 532.363 388.114
## 497 768.199 281.718 488.547 1188.653 1330.803 1501.791 788.497
## 498 448.339 341.273 728.440 1388.200 1542.612 1610.235 692.583
## 499 995.714 504.549 1173.171 2067.738 2527.059 2906.638 1853.581
## X2_24h_2 X3_24h_3 X1_76h_1 X2_76h_2 X3_76h_3
## <numeric> <numeric> <numeric> <numeric> <numeric>
## 1 380.299 1269.326 397.037 417.356 383.932
## 2 622.938 1633.624 658.010 685.825 516.092
## 3 584.314 1301.998 623.437 670.584 780.411
## 4 774.464 668.152 347.965 434.942 498.526
## 5 3876.282 2896.415 3421.652 3691.729 3370.907
## ... ... ... ... ... ...
## 495 281.263 1471.895 428.264 459.561 338.764
## 496 249.572 883.790 274.357 255.572 234.207
## 497 845.770 668.152 481.797 562.728 523.620
## 498 843.789 1323.235 1051.701 1145.386 841.472
## 499 2010.437 1870.499 1566.956 1672.943 1291.484
## -------
## seqinfo: 11 sequences from an unspecified genome; no seqlengths
# retrieve list of differentially bound sites (padj<0.05)
spikeinrle <- rbind(res_rle$contrasts[[1]]$DESeq2$de, res_rle$contrasts[[2]]$DESeq2$de,
res_rle$contrasts[[3]]$DESeq2$de, res_rle$contrasts[[4]]$DESeq2$de) %>%
dplyr::filter(padj < 0.05) %>%
.$id %>%
unique()
# generate PCA plot
q <- dba.plotPCA(res_rle, label = DBA_REPLICATE, vColors = c("navy", "magenta", "cyan",
"red", "green"), labelSize = 0.7, alpha = 0.75, score = DBA_SCORE_NORMALIZED,
sub = "RLE spike-in")
###### Background RLE
res_background <- dba.normalize(res_c, normalize = DBA_NORM_RLE, background = TRUE,
spikein = FALSE)
res_background <- dba.analyze(res_background)
dba.peakset(res_background, bRetrieve = TRUE, writeFile = "backgroundRLEcounts.csv")
## GRanges object with 499 ranges and 15 metadata columns:
## seqnames ranges strand | X1_0h_1 X2_0h_2 X3_0h_3
## <Rle> <IRanges> <Rle> | <numeric> <numeric> <numeric>
## 1 Tb927_01_v5.1 7538-7938 * | 746.392 672.272 583.236
## 2 Tb927_01_v5.1 46067-46467 * | 729.936 749.907 542.690
## 3 Tb927_01_v5.1 47790-48190 * | 1115.474 785.739 413.775
## 4 Tb927_01_v5.1 49105-49505 * | 1213.033 524.679 197.011
## 5 Tb927_01_v5.1 52601-53001 * | 3027.882 1203.776 741.261
## ... ... ... ... . ... ... ...
## 495 Tb927_11_v5.1 4828256-4828656 * | 407.871 515.295 429.370
## 496 Tb927_11_v5.1 4831133-4831533 * | 359.679 334.430 296.296
## 497 Tb927_11_v5.1 4833088-4833488 * | 674.691 359.171 231.839
## 498 Tb927_11_v5.1 4835964-4836364 * | 393.766 435.100 345.679
## 499 Tb927_11_v5.1 4841270-4841670 * | 874.512 643.265 556.725
## X1_1h_1 X2_1h_2 X3_1h_3 X1_3h_1 X2_3h_2 X3_3h_3 X1_24h_1
## <numeric> <numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
## 1 746.392 672.272 583.236 1546.69 1799.16 726.660 728.727
## 2 729.936 749.907 542.690 1111.24 1090.96 843.267 975.753
## 3 1115.474 785.739 413.775 1809.65 1800.84 975.516 1217.290
## 4 1213.033 524.679 197.011 1446.31 1391.10 1063.682 1019.669
## 5 3027.882 1203.776 741.261 5079.75 5287.86 5184.741 4233.753
## ... ... ... ... ... ... ... ...
## 495 407.871 515.295 429.370 810.102 824.543 418.078 529.734
## 496 359.679 334.430 296.296 754.964 757.095 460.739 477.584
## 497 674.691 359.171 231.839 1557.997 1684.495 1299.740 970.264
## 498 393.766 435.100 345.679 1819.548 1952.598 1393.595 852.240
## 499 874.512 643.265 556.725 2710.235 3198.686 2515.581 2280.874
## X2_24h_2 X3_24h_3 X1_76h_1 X2_76h_2 X3_76h_3
## <numeric> <numeric> <numeric> <numeric> <numeric>
## 1 497.554 399.523 398.829 379.656 501.203
## 2 815.005 514.186 660.981 623.873 673.730
## 3 764.472 409.806 626.251 610.009 1018.784
## 4 1013.249 210.302 349.535 395.653 650.799
## 5 5071.427 911.652 3437.099 3358.248 4400.537
## ... ... ... ... ... ...
## 495 367.983 463.282 430.198 418.048 442.238
## 496 326.520 278.175 275.595 232.486 305.744
## 497 1106.540 210.302 483.972 511.896 683.557
## 498 1103.949 416.491 1056.449 1041.921 1098.496
## 499 2630.301 588.743 1574.030 1521.823 1685.962
## -------
## seqinfo: 11 sequences from an unspecified genome; no seqlengths
# get list of db sites
background <- rbind(res_background$contrasts[[1]]$DESeq2$de, res_background$contrasts[[2]]$DESeq2$de,
res_background$contrasts[[3]]$DESeq2$de, res_background$contrasts[[4]]$DESeq2$de) %>%
dplyr::filter(padj < 0.05) %>%
.$id %>%
unique()
# generate pca plot
r <- dba.plotPCA(res_background, label = DBA_REPLICATE, vColors = c("navy", "magenta",
"cyan", "red", "green"), labelSize = 0.7, alpha = 0.75, score = DBA_SCORE_NORMALIZED,
sub = "RLE background")
###### RPKM 'Normalization' was already done in counting step
res_rpkm <- dba.analyze(res_rpkm)
dba.peakset(res_rpkm, bRetrieve = TRUE, writeFile = "rpkmcounts.csv")
## GRanges object with 499 ranges and 15 metadata columns:
## seqnames ranges strand | X1_0h_1 X2_0h_2 X3_0h_3
## <Rle> <IRanges> <Rle> | <numeric> <numeric> <numeric>
## 1 Tb927_01_v5.1 7538-7938 * | 157.615 151.126 143.1016
## 2 Tb927_01_v5.1 46067-46467 * | 154.140 168.578 133.1533
## 3 Tb927_01_v5.1 47790-48190 * | 235.554 176.633 101.5230
## 4 Tb927_01_v5.1 49105-49505 * | 256.156 117.947 48.3382
## 5 Tb927_01_v5.1 52601-53001 * | 639.397 270.608 181.8742
## ... ... ... ... . ... ... ...
## 495 Tb927_11_v5.1 4828256-4828656 * | 86.1299 115.8378 105.3493
## 496 Tb927_11_v5.1 4831133-4831533 * | 75.9532 75.1795 72.6986
## 497 Tb927_11_v5.1 4833088-4833488 * | 142.4743 80.7412 56.8835
## 498 Tb927_11_v5.1 4835964-4836364 * | 83.1514 97.8100 84.8151
## 499 Tb927_11_v5.1 4841270-4841670 * | 184.6705 144.6054 136.5969
## X1_1h_1 X2_1h_2 X3_1h_3 X1_3h_1 X2_3h_2 X3_3h_3 X1_24h_1
## <numeric> <numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
## 1 157.615 151.126 143.1016 169.823 187.087 93.0972 84.6144
## 2 154.140 168.578 133.1533 122.012 113.444 108.0364 113.2972
## 3 235.554 176.633 101.5230 198.696 187.262 124.9798 141.3427
## 4 256.156 117.947 48.3382 158.801 144.655 136.2753 118.3964
## 5 639.397 270.608 181.8742 557.746 549.863 664.2510 491.5921
## ... ... ... ... ... ... ... ...
## 495 86.1299 115.8378 105.3493 88.9474 85.7407 53.5628 61.5088
## 496 75.9532 75.1795 72.6986 82.8934 78.7271 59.0283 55.4535
## 497 142.4743 80.7412 56.8835 171.0647 175.1635 166.5182 112.6598
## 498 83.1514 97.8100 84.8151 199.7825 203.0424 178.5425 98.9558
## 499 184.6705 144.6054 136.5969 297.5781 332.6178 322.2875 264.8383
## X2_24h_2 X3_24h_3 X1_76h_1 X2_76h_2 X3_76h_3
## <numeric> <numeric> <numeric> <numeric> <numeric>
## 1 64.6001 92.4897 59.1065 56.5911 76.2605
## 2 105.8164 119.0344 97.9574 92.9937 102.5115
## 3 99.2554 94.8704 92.8105 90.9272 155.0133
## 4 131.5555 48.6851 51.8012 58.9755 99.0224
## 5 658.4505 211.0480 509.3787 500.5765 669.5643
## ... ... ... ... ... ...
## 495 47.7772 107.2500 63.7554 62.3137 67.2887
## 496 42.3938 64.3976 40.8433 34.6541 46.5206
## 497 143.6680 48.6851 71.7248 76.3025 104.0068
## 498 143.3316 96.4178 156.5659 155.3075 167.1419
## 499 341.5060 136.2944 233.2715 226.8411 256.5279
## -------
## seqinfo: 11 sequences from an unspecified genome; no seqlengths
# generate list of db sites
rpkm <- rbind(res_rpkm$contrasts[[1]]$DESeq2$de, res_rpkm$contrasts[[2]]$DESeq2$de,
res_rpkm$contrasts[[3]]$DESeq2$de, res_rpkm$contrasts[[4]]$DESeq2$de) %>%
dplyr::filter(padj < 0.05) %>%
.$id %>%
unique()
# generate pca plot
s <- dba.plotPCA(res_rpkm, label = DBA_REPLICATE, vColors = c("navy", "magenta",
"cyan", "red", "green"), labelSize = 0.7, alpha = 0.75, score = DBA_SCORE_RPKM,
sub = "RPKM")
## library size full, code based on
## https://bioconductor.org/packages/devel/bioc/manuals/DiffBind/man/DiffBind.pdf
res_lib_background <- dba.normalize(res_c, normalize = DBA_NORM_LIB, background = TRUE,
spikein = FALSE)
res_lib_background <- dba.analyze(res_lib_background)
dba.peakset(res_lib_background, bRetrieve = TRUE, writeFile = "backgroundlibcounts.csv")
## GRanges object with 499 ranges and 15 metadata columns:
## seqnames ranges strand | X1_0h_1 X2_0h_2 X3_0h_3
## <Rle> <IRanges> <Rle> | <numeric> <numeric> <numeric>
## 1 Tb927_01_v5.1 7538-7938 * | 953.621 914.359 865.808
## 2 Tb927_01_v5.1 46067-46467 * | 932.596 1019.952 805.619
## 3 Tb927_01_v5.1 47790-48190 * | 1425.175 1068.686 614.246
## 4 Tb927_01_v5.1 49105-49505 * | 1549.822 713.618 292.461
## 5 Tb927_01_v5.1 52601-53001 * | 3868.548 1637.260 1100.395
## ... ... ... ... . ... ... ...
## 495 Tb927_11_v5.1 4828256-4828656 * | 521.113 700.854 637.395
## 496 Tb927_11_v5.1 4831133-4831533 * | 459.540 454.859 439.849
## 497 Tb927_11_v5.1 4833088-4833488 * | 862.013 488.509 344.163
## 498 Tb927_11_v5.1 4835964-4836364 * | 503.091 591.781 513.157
## 499 Tb927_11_v5.1 4841270-4841670 * | 1117.314 874.907 826.453
## X1_1h_1 X2_1h_2 X3_1h_3 X1_3h_1 X2_3h_2 X3_3h_3 X1_24h_1
## <numeric> <numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
## 1 953.621 914.359 865.808 1027.481 1131.931 563.266 511.943
## 2 932.596 1019.952 805.619 738.208 686.373 653.654 685.483
## 3 1425.175 1068.686 614.246 1202.171 1132.992 756.166 855.167
## 4 1549.822 713.618 292.461 960.798 875.205 824.507 716.335
## 5 3868.548 1637.260 1100.395 3374.532 3326.838 4018.923 2974.283
## ... ... ... ... ... ... ... ...
## 495 521.113 700.854 637.395 538.159 518.758 324.071 372.147
## 496 459.540 454.859 439.849 501.531 476.323 357.140 335.511
## 497 862.013 488.509 344.163 1034.994 1059.793 1007.486 681.627
## 498 503.091 591.781 513.157 1208.745 1228.469 1080.237 598.713
## 499 1117.314 874.907 826.453 1800.439 2012.440 1949.938 1602.353
## X2_24h_2 X3_24h_3 X1_76h_1 X2_76h_2 X3_76h_3
## <numeric> <numeric> <numeric> <numeric> <numeric>
## 1 390.851 559.591 357.613 342.393 461.400
## 2 640.222 720.195 592.673 562.641 620.226
## 3 600.526 573.995 561.532 550.137 937.878
## 4 795.951 294.560 313.413 356.820 599.116
## 5 3983.828 1276.905 3081.898 3028.642 4051.070
## ... ... ... ... ... ...
## 495 289.067 648.895 385.739 377.017 407.117
## 496 256.496 389.625 247.114 209.668 281.464
## 497 869.236 294.560 433.957 461.654 629.273
## 498 867.200 583.358 947.272 939.658 1011.260
## 499 2066.216 824.623 1411.365 1372.458 1552.072
## -------
## seqinfo: 11 sequences from an unspecified genome; no seqlengths
background_lib <- rbind(res_lib_background$contrasts[[1]]$DESeq2$de, res_lib_background$contrasts[[2]]$DESeq2$de,
res_lib_background$contrasts[[3]]$DESeq2$de, res_lib_background$contrasts[[4]]$DESeq2$de) %>%
dplyr::filter(padj < 0.05) %>%
.$id %>%
unique()
t <- dba.plotPCA(res_lib_background, label = DBA_REPLICATE, vColors = c("navy", "magenta",
"cyan", "red", "green"), labelSize = 0.7, alpha = 0.75, score = DBA_SCORE_NORMALIZED,
sub = "Library Size Background")
## total RLE
res_RLE <- dba.normalize(res_c, normalize = DBA_NORM_RLE, background = FALSE, spikein = FALSE)
res_RLE <- dba.analyze(res_RLE)
dba.peakset(res_RLE, bRetrieve = TRUE, writeFile = "fullRLEcounts.csv")
## GRanges object with 499 ranges and 15 metadata columns:
## seqnames ranges strand | X1_0h_1 X2_0h_2 X3_0h_3
## <Rle> <IRanges> <Rle> | <numeric> <numeric> <numeric>
## 1 Tb927_01_v5.1 7538-7938 * | 1096.77 1200.759 1385.132
## 2 Tb927_01_v5.1 46067-46467 * | 1072.59 1339.425 1288.839
## 3 Tb927_01_v5.1 47790-48190 * | 1639.11 1403.425 982.678
## 4 Tb927_01_v5.1 49105-49505 * | 1782.47 937.141 467.883
## 5 Tb927_01_v5.1 52601-53001 * | 4449.27 2150.090 1760.426
## ... ... ... ... . ... ... ...
## 495 Tb927_11_v5.1 4828256-4828656 * | 599.338 920.379 1019.714
## 496 Tb927_11_v5.1 4831133-4831533 * | 528.523 597.332 703.677
## 497 Tb927_11_v5.1 4833088-4833488 * | 991.413 641.522 550.596
## 498 Tb927_11_v5.1 4835964-4836364 * | 578.612 777.141 820.956
## 499 Tb927_11_v5.1 4841270-4841670 * | 1285.037 1148.950 1322.172
## X1_1h_1 X2_1h_2 X3_1h_3 X1_3h_1 X2_3h_2 X3_3h_3 X1_24h_1
## <numeric> <numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
## 1 1096.77 1200.759 1385.132 687.443 735.043 382.562 479.747
## 2 1072.59 1339.425 1288.839 493.904 445.710 443.951 642.374
## 3 1639.11 1403.425 982.678 804.321 735.732 513.576 801.386
## 4 1782.47 937.141 467.883 642.829 568.332 559.993 671.285
## 5 4449.27 2150.090 1760.426 2257.755 2160.352 2729.589 2787.233
## ... ... ... ... ... ... ... ...
## 495 599.338 920.379 1019.714 360.059 336.866 220.104 348.743
## 496 528.523 597.332 703.677 335.553 309.311 242.564 314.411
## 497 991.413 641.522 550.596 692.470 688.199 684.269 638.760
## 498 578.612 777.141 820.956 808.720 797.732 733.680 561.061
## 499 1285.037 1148.950 1322.172 1204.597 1306.820 1324.367 1501.582
## X2_24h_2 X3_24h_3 X1_76h_1 X2_76h_2 X3_76h_3
## <numeric> <numeric> <numeric> <numeric> <numeric>
## 1 278.975 791.943 289.201 276.936 365.574
## 2 456.967 1019.232 479.294 455.078 491.414
## 3 428.633 812.328 454.111 444.965 743.094
## 4 568.121 416.866 253.457 288.605 474.688
## 5 2843.509 1807.098 2492.331 2449.643 3209.721
## ... ... ... ... ... ...
## 495 206.325 918.328 311.948 304.941 322.565
## 496 183.077 551.405 199.841 169.585 223.008
## 497 620.428 416.866 350.941 373.397 498.582
## 498 618.975 825.578 766.059 760.019 801.236
## 499 1474.789 1167.021 1141.371 1110.080 1229.729
## -------
## seqinfo: 11 sequences from an unspecified genome; no seqlengths
# get list of db sites
RLE_full <- rbind(res_RLE$contrasts[[1]]$DESeq2$de, res_RLE$contrasts[[2]]$DESeq2$de,
res_RLE$contrasts[[3]]$DESeq2$de, res_RLE$contrasts[[4]]$DESeq2$de) %>%
dplyr::filter(padj < 0.05) %>%
.$id %>%
unique()
# generate pca plot
u <- dba.plotPCA(res_RLE, label = DBA_REPLICATE, vColors = c("navy", "magenta", "cyan",
"red", "green"), labelSize = 0.7, alpha = 0.75, score = DBA_SCORE_NORMALIZED,
sub = "RLE full")
## library size full, looks similar to lib background
## not sure diff between library=DBA_LIBSIZE_FULL and no library command
res_lib <- dba.normalize(res_c, normalize = DBA_NORM_LIB, background = FALSE, spikein = FALSE)
res_lib <- dba.analyze(res_lib)
dba.peakset(res_lib, bRetrieve = TRUE, writeFile = "fulllibcounts.csv")
## GRanges object with 499 ranges and 15 metadata columns:
## seqnames ranges strand | X1_0h_1 X2_0h_2 X3_0h_3
## <Rle> <IRanges> <Rle> | <numeric> <numeric> <numeric>
## 1 Tb927_01_v5.1 7538-7938 * | 953.621 914.359 865.808
## 2 Tb927_01_v5.1 46067-46467 * | 932.596 1019.952 805.619
## 3 Tb927_01_v5.1 47790-48190 * | 1425.175 1068.686 614.246
## 4 Tb927_01_v5.1 49105-49505 * | 1549.822 713.618 292.461
## 5 Tb927_01_v5.1 52601-53001 * | 3868.548 1637.260 1100.395
## ... ... ... ... . ... ... ...
## 495 Tb927_11_v5.1 4828256-4828656 * | 521.113 700.854 637.395
## 496 Tb927_11_v5.1 4831133-4831533 * | 459.540 454.859 439.849
## 497 Tb927_11_v5.1 4833088-4833488 * | 862.013 488.509 344.163
## 498 Tb927_11_v5.1 4835964-4836364 * | 503.091 591.781 513.157
## 499 Tb927_11_v5.1 4841270-4841670 * | 1117.314 874.907 826.453
## X1_1h_1 X2_1h_2 X3_1h_3 X1_3h_1 X2_3h_2 X3_3h_3 X1_24h_1
## <numeric> <numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
## 1 953.621 914.359 865.808 1027.481 1131.931 563.266 511.943
## 2 932.596 1019.952 805.619 738.208 686.373 653.654 685.483
## 3 1425.175 1068.686 614.246 1202.171 1132.992 756.166 855.167
## 4 1549.822 713.618 292.461 960.798 875.205 824.507 716.335
## 5 3868.548 1637.260 1100.395 3374.532 3326.838 4018.923 2974.283
## ... ... ... ... ... ... ... ...
## 495 521.113 700.854 637.395 538.159 518.758 324.071 372.147
## 496 459.540 454.859 439.849 501.531 476.323 357.140 335.511
## 497 862.013 488.509 344.163 1034.994 1059.793 1007.486 681.627
## 498 503.091 591.781 513.157 1208.745 1228.469 1080.237 598.713
## 499 1117.314 874.907 826.453 1800.439 2012.440 1949.938 1602.353
## X2_24h_2 X3_24h_3 X1_76h_1 X2_76h_2 X3_76h_3
## <numeric> <numeric> <numeric> <numeric> <numeric>
## 1 390.851 559.591 357.613 342.393 461.400
## 2 640.222 720.195 592.673 562.641 620.226
## 3 600.526 573.995 561.532 550.137 937.878
## 4 795.951 294.560 313.413 356.820 599.116
## 5 3983.828 1276.905 3081.898 3028.642 4051.070
## ... ... ... ... ... ...
## 495 289.067 648.895 385.739 377.017 407.117
## 496 256.496 389.625 247.114 209.668 281.464
## 497 869.236 294.560 433.957 461.654 629.273
## 498 867.200 583.358 947.272 939.658 1011.260
## 499 2066.216 824.623 1411.365 1372.458 1552.072
## -------
## seqinfo: 11 sequences from an unspecified genome; no seqlengths
lib_full <- rbind(res_lib$contrasts[[1]]$DESeq2$de, res_lib$contrasts[[2]]$DESeq2$de,
res_lib$contrasts[[3]]$DESeq2$de, res_lib$contrasts[[4]]$DESeq2$de) %>%
dplyr::filter(padj < 0.05) %>%
.$id %>%
unique()
v <- dba.plotPCA(res_lib, label = DBA_REPLICATE, vColors = c("navy", "magenta", "cyan",
"red", "green"), labelSize = 0.7, alpha = 0.75, score = DBA_SCORE_NORMALIZED,
sub = "Library Size Full")
# RLE in peaks
res_RLE_peaks <- dba.normalize(res_c, normalize = DBA_NORM_RLE, library = DBA_LIBSIZE_PEAKREADS,
background = FALSE, spikein = FALSE)
res_RLE_peaks <- dba.analyze(res_RLE_peaks)
dba.peakset(res_RLE_peaks, bRetrieve = TRUE, writeFile = "peakRLEcounts.csv")
## GRanges object with 499 ranges and 15 metadata columns:
## seqnames ranges strand | X1_0h_1 X2_0h_2 X3_0h_3
## <Rle> <IRanges> <Rle> | <numeric> <numeric> <numeric>
## 1 Tb927_01_v5.1 7538-7938 * | 1096.77 1200.759 1385.132
## 2 Tb927_01_v5.1 46067-46467 * | 1072.59 1339.425 1288.839
## 3 Tb927_01_v5.1 47790-48190 * | 1639.11 1403.425 982.678
## 4 Tb927_01_v5.1 49105-49505 * | 1782.47 937.141 467.883
## 5 Tb927_01_v5.1 52601-53001 * | 4449.27 2150.090 1760.426
## ... ... ... ... . ... ... ...
## 495 Tb927_11_v5.1 4828256-4828656 * | 599.338 920.379 1019.714
## 496 Tb927_11_v5.1 4831133-4831533 * | 528.523 597.332 703.677
## 497 Tb927_11_v5.1 4833088-4833488 * | 991.413 641.522 550.596
## 498 Tb927_11_v5.1 4835964-4836364 * | 578.612 777.141 820.956
## 499 Tb927_11_v5.1 4841270-4841670 * | 1285.037 1148.950 1322.172
## X1_1h_1 X2_1h_2 X3_1h_3 X1_3h_1 X2_3h_2 X3_3h_3 X1_24h_1
## <numeric> <numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
## 1 1096.77 1200.759 1385.132 687.443 735.043 382.562 479.747
## 2 1072.59 1339.425 1288.839 493.904 445.710 443.951 642.374
## 3 1639.11 1403.425 982.678 804.321 735.732 513.576 801.386
## 4 1782.47 937.141 467.883 642.829 568.332 559.993 671.285
## 5 4449.27 2150.090 1760.426 2257.755 2160.352 2729.589 2787.233
## ... ... ... ... ... ... ... ...
## 495 599.338 920.379 1019.714 360.059 336.866 220.104 348.743
## 496 528.523 597.332 703.677 335.553 309.311 242.564 314.411
## 497 991.413 641.522 550.596 692.470 688.199 684.269 638.760
## 498 578.612 777.141 820.956 808.720 797.732 733.680 561.061
## 499 1285.037 1148.950 1322.172 1204.597 1306.820 1324.367 1501.582
## X2_24h_2 X3_24h_3 X1_76h_1 X2_76h_2 X3_76h_3
## <numeric> <numeric> <numeric> <numeric> <numeric>
## 1 278.975 791.943 289.201 276.936 365.574
## 2 456.967 1019.232 479.294 455.078 491.414
## 3 428.633 812.328 454.111 444.965 743.094
## 4 568.121 416.866 253.457 288.605 474.688
## 5 2843.509 1807.098 2492.331 2449.643 3209.721
## ... ... ... ... ... ...
## 495 206.325 918.328 311.948 304.941 322.565
## 496 183.077 551.405 199.841 169.585 223.008
## 497 620.428 416.866 350.941 373.397 498.582
## 498 618.975 825.578 766.059 760.019 801.236
## 499 1474.789 1167.021 1141.371 1110.080 1229.729
## -------
## seqinfo: 11 sequences from an unspecified genome; no seqlengths
RLE_peaks <- rbind(res_RLE_peaks$contrasts[[1]]$DESeq2$de, res_RLE_peaks$contrasts[[2]]$DESeq2$de,
res_RLE_peaks$contrasts[[3]]$DESeq2$de, res_RLE_peaks$contrasts[[4]]$DESeq2$de) %>%
dplyr::filter(padj < 0.05) %>%
.$id %>%
unique()
w <- dba.plotPCA(res_RLE_peaks, label = DBA_REPLICATE, vColors = c("navy", "magenta",
"cyan", "red", "green"), labelSize = 0.7, alpha = 0.75, score = DBA_SCORE_NORMALIZED,
sub = "RLE Reads in Peaks")
## library size reads in peaks
res_lib_peaks <- dba.normalize(res_c, normalize = DBA_NORM_LIB, library = DBA_LIBSIZE_PEAKREADS,
background = FALSE, spikein = FALSE)
res_lib_peaks <- dba.analyze(res_lib_peaks)
dba.peakset(res_lib_peaks, bRetrieve = TRUE, writeFile = "peaklibcounts.csv")
## GRanges object with 499 ranges and 15 metadata columns:
## seqnames ranges strand | X1_0h_1 X2_0h_2 X3_0h_3
## <Rle> <IRanges> <Rle> | <numeric> <numeric> <numeric>
## 1 Tb927_01_v5.1 7538-7938 * | 1097.45 1287.78 1493.182
## 2 Tb927_01_v5.1 46067-46467 * | 1073.26 1436.49 1389.378
## 3 Tb927_01_v5.1 47790-48190 * | 1640.13 1505.13 1059.334
## 4 Tb927_01_v5.1 49105-49505 * | 1783.57 1005.05 504.382
## 5 Tb927_01_v5.1 52601-53001 * | 4452.02 2305.90 1897.752
## ... ... ... ... . ... ... ...
## 495 Tb927_11_v5.1 4828256-4828656 * | 599.710 987.077 1099.259
## 496 Tb927_11_v5.1 4831133-4831533 * | 528.851 640.619 758.569
## 497 Tb927_11_v5.1 4833088-4833488 * | 992.027 688.012 593.547
## 498 Tb927_11_v5.1 4835964-4836364 * | 578.971 833.459 884.997
## 499 Tb927_11_v5.1 4841270-4841670 * | 1285.833 1232.212 1425.310
## X1_1h_1 X2_1h_2 X3_1h_3 X1_3h_1 X2_3h_2 X3_3h_3 X1_24h_1
## <numeric> <numeric> <numeric> <numeric> <numeric> <numeric> <numeric>
## 1 1097.45 1287.78 1493.182 751.197 798.939 413.939 502.809
## 2 1073.26 1436.49 1389.378 539.708 484.455 480.363 673.252
## 3 1640.13 1505.13 1059.334 878.914 799.688 555.699 839.908
## 4 1783.57 1005.05 504.382 702.444 617.737 605.922 703.553
## 5 4452.02 2305.90 1897.752 2467.139 2348.148 2953.465 2921.214
## ... ... ... ... ... ... ... ...
## 495 599.710 987.077 1099.259 393.451 366.149 238.157 365.507
## 496 528.851 640.619 758.569 366.672 336.199 262.458 329.524
## 497 992.027 688.012 593.547 756.690 748.023 740.391 669.465
## 498 578.971 833.459 884.997 883.720 867.078 793.855 588.030
## 499 1285.833 1232.212 1425.310 1316.311 1420.420 1432.989 1573.762
## X2_24h_2 X3_24h_3 X1_76h_1 X2_76h_2 X3_76h_3
## <numeric> <numeric> <numeric> <numeric> <numeric>
## 1 290.605 865.928 284.315 274.116 363.082
## 2 476.017 1114.450 471.196 450.444 488.064
## 3 446.502 888.217 446.439 440.434 738.029
## 4 591.805 455.810 249.175 285.666 471.453
## 5 2962.050 1975.920 2450.221 2424.697 3187.844
## ... ... ... ... ... ...
## 495 214.926 1004.119 306.677 301.836 320.366
## 496 190.709 602.917 196.465 167.858 221.488
## 497 646.293 455.810 345.012 369.595 495.184
## 498 644.779 902.705 753.115 752.280 795.775
## 499 1536.270 1276.045 1122.086 1098.775 1221.348
## -------
## seqinfo: 11 sequences from an unspecified genome; no seqlengths
lib_peaks <- rbind(res_lib_peaks$contrasts[[1]]$DESeq2$de, res_lib_peaks$contrasts[[2]]$DESeq2$de,
res_lib_peaks$contrasts[[3]]$DESeq2$de, res_lib_peaks$contrasts[[4]]$DESeq2$de) %>%
dplyr::filter(padj < 0.05) %>%
.$id %>%
unique()
y <- dba.plotPCA(res_lib_peaks, label = DBA_REPLICATE, vColors = c("navy", "magenta",
"cyan", "red", "green"), labelSize = 0.7, alpha = 0.75, score = DBA_SCORE_NORMALIZED,
sub = "LIbrary Size Reads in Peaks")
## jaccard distance plot
# list of peak calls per method
ll_2 <- list(spikeinlibsize, spikeinrle, background, rpkm, lib_peaks, RLE_peaks,
RLE_full, lib_full, background_lib)
# enumerate combinations of peak calling methods and calculate jaccard
# similarities
dist <- unlist(lapply(combn(ll_2, 2, simplify = FALSE), function(x) {
length(intersect(x[[1]], x[[2]]))/length(union(x[[1]], x[[2]]))
}))
# initialize matrix to store results
to_make_mat <- cbind(t(combn(9, 2)), dist)
d_mat <- matrix(ncol = 9, nrow = 9)
diag(d_mat) <- 1
# add jaccard similarities to matrix
for (i in 1:dim(to_make_mat)[1]) {
d_mat[as.numeric(to_make_mat[i, 2]), as.numeric(to_make_mat[i, 1])] <- to_make_mat[i,
3]
}
colnames(d_mat) <- c("Spike-in\nLib. Size", "Spike-in\nRLE", "Background\nRLE", "RPKM",
"Library Peak", "RLE Peak", "RLE Full", "Library Full", "Background\nLibrary")
rownames(d_mat) <- c("Spike-in\nLib. Size", "Spike-in\nRLE", "Background\nRLE", "RPKM",
"Library Peak", "RLE Peak", "RLE Full", "Library Full", "Background\nLibrary")
# convert jaccard similarity to jaccard distance
d_mat <- as.dist(1 - d_mat)
hc <- hclust(d_mat)
plot(hc, main = "", sub = "", xlab = "", ylab = "Jaccard Distance")
In Ciaran’s Paper, Selecting between-sample RNA-Seq normalization methods from the perspective of their assumptions, he talks about to main biological assumptions made in RNA-Seq normalization Techniques (Evans, Hardin, and Stoebel 2016).
Symmetry in Differential Expression
Same total amount of mRNA/cell
Figure 1: Assumptions Diagram
As illustrated in the diagram above, symmetry focuses on the number of genes up-regulated between two conditions. For instance, cond. A in (a) above has three up-regulated whereas cond. B only has one so it is asymmetric differential expression (Evans, Hardin, and Stoebel 2016).
Meanwhile, total amount of mRNA/cell is the summation of the bars under one condition in the diagram. This is because total mRNA/cell quantifies the total level of expression of all genes considered under some condition. For example, summing up the bars in cond. A in picture (a) is equivalent to the sum of the bars under cond. B in (a) even though its differential expression between the conditions is asymmetric (Evans, Hardin, and Stoebel 2016).
Now, we can translate these RNA-seq assumptions to ChIP-Seq data given ChIP-Seq and RNA-Seq have the same structure once the data is collected and aligned.
As an example, here are mapped reads for ChIP-Seq:
ChIP
These have the same structure as the mapped reads for RNA-Seq:
The key difference between RNA-Seq and ChIP-Seq is what these maps signify. In ChIP-Seq, the map is over a chromosomes, therefore each read is mapped to a genomic region (i.e. region on a chromosome). Ideally, peaks represent regions for which there is a binding site present for our transcription factor or histone of interest (StatQuest, n.d.). Rather than looking for between-sample differential expression, like RNA-Seq, ChIP-seq wants to find between-sample differentially-bound regions, that is, the regions in which the signal significantly changes between conditions (Höllbacher et al. 2020).
Using this biological description of ChIP-Seq, we can then translate the RNA-Seq assumptions that Evans, Hardin, and Stoebel (2016) talks about in his paper.
Symmetry is virtually the same as it was for RNA-seq. For ChIP-Seq, symmetry measures if the number of up-regulated genomic regions is consistent between conditions. When this number is not consistent, there is asymmetry in the differentially-bound regions between conditions.
There is no assumption of same total number of mRNA/cell for ChIP-Seq. This is because ChIP-Seq aims to find binding sites on DNA. Therefore, there is no mRNA present in ChIP-seq like there is for RNA-seq. As previously stated, amount of mRNA/cell quantifies the magnitude of gene expression in RNA-Seq (Wu et al. 2015).
Therefore, an equivalent assumption in ChIP-Seq is that the amount of DNA binding is equivalent across conditions. That is, the summation of signal at a specified region (equivalent to gene expression level in RNA-seq) is the same across conditions).
ChIP-Seq Assumptions Diagram
Now that it has been established that similar assumptions about total signal magnitude and symmetry in differentially-bound regions exist in ChIP-Seq, we can begin to easily categorize ChIP-Seq normalization techniques by their biological assumptions. Note that Evans, Hardin, and Stoebel (2016) categorized RNA-seq normalization techniques by their biological assumptions using their mathematical algorithms which he gives in the “Supplementary data” section. Many RNA-Seq normalization techniques were adopted in ChIP-Seq, as evident through Stark and Brown (2022a)’s DiffBind bioconductor package. As such, the mathematics of these ChIP-Seq normalization techniques are the same as RNA-Seq (though the context has changed).
Thus, we can use Ciaran’s work to establish the following:
Assumptions
Assumptions
Assumptions
Technical effects are the same for differentially bound and non-differentially bound genomic regions
Roughly symmetric differentially-bound regions across conditions (same number of up and down-regulated genomic regions)
DESeq2 and DESeq’s normalization technique is synonymous to RLE (Relative Log Expression) normalization. So, we can use Evans, Hardin, and Stoebel (2016)’s discussion of DESeq’s assumptions for RLE on full library.
Assumptions
Technical effects are the same for differentially bound and non-differentially bound genomic regions
Roughly symmetric differentially-bound regions across conditions (same number of up and down-regulated genomic regions)
Assumptions
Technical effects are the same for differentially bound and non-differentially bound genomic regions
Roughly symmetric differentially-bound regions across conditions (same number of up and down-regulated genomic regions)
For some RNA-Seq techniques, two additional variations were added when translating it to ChIP-Seq normalization. These involved looking at the “Reads in Peaks” and “Background Bins” subsets of the full libraries. As these appear to be fundamentally different than the standard RNA-Seq normalization techniques that use the full library, their assumptions will not be derived from Evans, Hardin, and Stoebel (2016)’s paper.
We can also normalize using experimental controls. These controls are experimental and are used to normalize the data. With controls, other standard normalization techniques can be used (e.g., quantile normalization, library size normalization, or RLE normalization).
Like in RNA-Seq, there are also spike-in controls for ChIP-Seq. Spike-ins are external chromatin added to the experiment that are supposed to be unaffected by the antibody added, so there are no biological peaks from the spike-in chromatin (Bonhoure et al. 2014); the only enriched loci for the spike-in control would be from technical effects. In this sense, the operate equivalently to spike-ins for RNA-seq.
Assumptions
Controls do exist! Their signal also behaves as expected (e.g., adding a negative control that is supposed to have no enriched peaks across conditions actually does have no enriched peaks across conditions)
Controls behave like non-control chromatin (i.e., genome): The technical effects on the controls represent the technical effects to all chromatin in the experiment. Thus, the controls can help us normalize the data.
Parallel factors add a second antibody, which has known peaks, to the experiment (Guertin et al. 2018). As the peaks for parallel factors are known a priori, parallel factors can be used to detect technical effects. The impetus behind parallel factors aligns with the mission of Housekeeping genes in RNA-seq to use an internal a priori control to detect technical effects. Thus, the assumptions listed below align with those for RNA-Seq’s house keeping genes.
Assumptions
Controls do exist! The second antibody behaves as expected and generates peaks of known location and magnitude.
The control reads (i.e. those associated with the second antibody) behave like the reads associated with the non-control antibody: the technical effects on the controls represent the technical effects to all chromatin in the experiment. Thus, the controls can help us normalize the data.
Loess normalization tethers points to a Loess smoothing curve and then taughts the loess curve such that it is centered at 0. In Taslim et al. (2009)’s Loess Normalization, they create two loess curves: (1) with respect to the mean and (2) with respect to the variance, to normalize both statistics.
Assumptions
MAnorm is a normalization technique specific for ChIP-Seq data that uses the common peaks to create a linear model to normalize the data and then applies this model to all the peaks (Shao et al. 2012). As such, MAnorm is a varation of “Reads in Peaks” normalization that only looks at the common peaks.
Assumptions
Technical effects are the same for differentially bound and non-differentially bound genomic regions (Shao et al. 2012)
“Second, the observed differences in sequence read density in common peaks are presumed to reflect the scaling relationship of ChIP-Seq signals between two samples, which can thus be applied to all peaks.”
Same total DNA-protein binding: the amount of total DNA binding is the same between different experimental conditions.
Reads in peaks is a normalization technqiue unique to ChIP-Seq data. Reads in Peak normalization techniques only look at the subset of reads that overlap with the consensus peak set in each replicate/sample (Stark and Brown 2022a). An example of this is Shao et al. (2012), which only looks at common peaks to form a normalization equation. The goal of normalizing with reads in peaks is that not only sequencing depth will be corrected, but also ChIP efficiency (that is measured through the proportion of reads in enriched peak regions) (Stark and Brown 2022a). As such, Stark and Brown (2022a) recommends using “RiP” library sizes for native normalization methods, developed originally for RNA-seq.
However, efficiency may not always be an issue for ChIP-Seq. For example, if different replicates have fairly consistent FRiP numbers, ChIP efficiency does not seem to be an issue (Stark and Brown 2022a).
Effective library size and reads in peaks are equivalent for ChIP-Seq as both look at consensus peak set (Wu et al. 2015). It appears that effective library size/reads in peaks assumes comparable DNA-protein binding between conditions according to a paper by Wu et al. (2015) and Stark and Brown (2022a). So…
Assumptions:
Using only background bins to normalize data is also novel to ChIP-Seq data. As aforementioned, normalization using background bins only considers reads that are in chromosomes with peaks in the consensus peak set. These chromosomes are then partitioned into segments that are ~15,000 base pairs long (Stark and Brown 2022b). Background bins are considered a more “neutral” set of reads (Stark and Brown 2022a). 15,000bp segments are chosen because peaks in ChIP-Seq are expected to be around 100-600bp (Stark and Brown 2022a). Thus, it is assumed no global change between biological conditions will occur over a 15,000bp interval. Thus, any large-scale changes we see in these intervals are expected to be from technical artifacts rather than biological shifts.
Assumptions
Same total DNA-protein binding: the amount of total DNA binding is the same between different experimental conditions (i.e., there is no global shift between conditions)
Technical effects are the same for differentially bound and non-differentially bound (i.e. background bin) genomic regions.
This one reminded me of the joy that was trying to cross-validate for Ridge Regression and Lasso lol