Contents

0.0.1 ChIPseq via GUI:

Tutorial experiment downloadable here:

+ Two ChIPseq one IgG control and an other Prdm5 TF moAb (mouse)

+ single-end mode sequencing, 

+ 1 million reads for each sample.

Experiment description:

+ PRDM family members are transcriptional regulators involved in tissue specific differentiation. PRDM5 has been reported to predominantly repress transcription, but a characterization of its molecular functions in a relevant biological context is lacking. Prdm5 controls both Collagen I transcription and fibrillogenesis by binding inside the Col1a1 gene body and maintaining RNA polymerase II occupancy (Galli et al. PLoS Genet. 2012,e1002711).

+ The toy experiment is organized in three folders: i. one for IgG (igg, containing the igg pool down fastq); ii. one for Prdm5 (prdm5, containing the Prdm5 pool down fastq) and iii. the other for analysis output prdm5.igg, where MACS results will be located. The execution of the ChIPseq workflow using GUI or line command will provide the final annotated table of peaks (mypeaks.xls)

The chipseq workflow can be run using 4SeqGUI graphical interface:

ChIPseq workflow

The ChIPseq is made of two main steps:

0.0.2 Creating a BWA index file for Chipseq:

The index can be easily created using the graphical interface:

Creating a BWA index with Genome indexing BWA

bwaIndexUcsc(group="sudo",genome.folder="/sto2/data/scratch/mm10bwa", uscs.urlgenome=
"http://hgdownload.cse.ucsc.edu/goldenPath/mm10/bigZips/chromFa.tar.gz",
gatk=FALSE)

In brief, bwaIndexUcsc uses UCSC genomic data. User has to provide the URL (uscs.urlgenome) for the file chromFa.tar.gz related to the organism of interest and the path to the folder where the index will be generated (genome.folder). The parameter gatk has to be set to FALSE, it is not required for ChIPseq genomic index creation.

Precompiled index folders are available:

0.0.3 Calling peaks and annotating:

All the parameters needed to run MACS or SICER can be setup using 4SeqGUI

MACS and SICER analysis

A detailed description of the parameters is given below.

0.0.4 Chipseq workflow by line command

The chipseq workflow can be also executed using R and it is completely embedded in a unique function:

system("wget 130.192.119.59/public/test.chipseqCounts.zip")
unzip("test.chipseqCounts.zip")
setwd("test.chipseqCounts")
library(docker4seq)
chipseqCounts(group = "docker", output.folder = "./prdm51.igg",
  mock.folder="./igg", test.folder="./prdm51", scratch.folder=getwd(),
  adapter5 = "AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT",
  adapter3 = "AATGATACGGCGACCACCGAGATCTACACTCTTTCCCTACACGACGCTCTTCCGATCT",
  threads = 8, min.length = 30, genome.folder,
  mock.id = "igg", test.id = "tf", genome, read.size = 50,
  tool = "macs", macs.min.mfold = 10, macs.max.mfold = 30,
  macs.pval = "1e-5", sicer.wsize = 200, sicer.gsize = 200,
  sicer.fdr = 0.1, tss.distance = 0, max.upstream.distance = 10000,
  remove.duplicates = "N")

Specifically user needs to create three folders:

+ mock.folder, where the fastq.gz file for the control sample is located. For control sample we refer to ChIP with IgG only or input DNA.
+ test.folder, where the fastq.gz file for the ChIP of the sample to be analysed.
+ output.folder, where the R script embedding the above script is located.

The scratch.folder can be the same as the output.folder. However, if the system has a high speed disk for temporary calculation, e.g. a SSD disk, the location of the scratch.folder on the SSD will reduce significantly the computing time.

User needs to provide also the sequence of the sequencing adapters, adapter5 and adapter3 parameters. In case Illumina platform the adapters sequences can be easily recovered here.

Threads indicates the max number of cores used by skewer and bwa, all the other steps are done on a single core. The min.length refers to the minimal length that a reads should have after adapters trimming. Since today the average read length for a ChIP experiment is 50 or 75 nts would be better to bring to 40 nts the min.length parameter to increase the precision in assigning the correct position on the genome.

The genome.folder parameter refers to the location of the genomic index generated by bwa using the docker4seq function bwaIndexUcsc.

mock.id and test.id identify the type of sample and are assigned to the ID parameter in the RG field of the bam file.

genome is the parameter referring to the annotation used to associate ChIP peaks with genes. In the present implementation hg38, hg19 for human and mm10 and mm9 for mouse annotations are available.

read.size is a parameter requested by MACS and SICER for their analysis. macs.min.mfold, macs.max.mfold, macs.pval are the default parameters requested for peaks definition for more info please refer to the documentation of MACS 1.4. sicer.wsize, sicer.gsize, sicer.fdr are the default parameters requested for peaks definition for more info please refer to the documentation of SICER 1.1. IMPORTANT: The optimal value for sicer.gsize in case of H3K4Me3 ChIP is 200 and in case of ChIP H3K27Me3 is 600.

tss.distance and max.upstream.distance are parameters required by ChIPseqAnno, which is the Bioconductor package used to assign the peaks to specific genes. Specifically max.upstream.distance refers to the max distance in nts that allows the association of a peak to a specific gene.

remove.duplicates is the parameter that indicates if duplicates have to be removed or not. It has two options: N duplicates are not removed, Y duplicates are removed.

0.0.5 Chipseq workflow output files

The chipseq workflow produces the following output files:

+ README: A file describing the content of the data folder
+ mypeaks.xls: All detected peaks alongside the nearest gene and its annotation
+ mytreat.counts: The total reads count for the provided treatment file
+ mycontrol.counts: The total reads count for the provided control/background file
+ peak_report.xls: Aggregate information regarding the peak and their position relative to the nearest gene
+ chromosome_distribution.pdf: Barplot of the distribution of the peaks on the chromosomes
+ relative_position_distribution.pdf: Barplot of the distribution of the peaks positions relative to their nearest gene
+ peak_width_distribution.pdf: Histogram of the distribution of the width of the peaks
+ distance_from_nearest_gene_distribution.pdf: Histogram of the distribution of the distance of each peak from its nearest gene
+ cumulative_coverage_total.pdf: Cumulative normalized gene coverage
+ cumulative_coverage_chrN.pdf: Cumulative normalized gene coverage for the specific chromosome
+ mycontrol_sorted.bw: bigWig file for UCSC Genome Browser visualization
+ mytreat_sorted.bw: bigWig file for UCSC Genome Browser visualization