ChIP-Seq (Chromatin Immunoprecipitation followed by Sequencing) data analysis aims to identify and understand the binding sites of DNA-associated proteins (such as transcription factors, histones, or chromatin remodelers) across the genome. This information is crucial for studying gene regulation, chromatin states, and epigenetic mechanisms. The specific purposes of ChIP-Seq data analysis include:
Identify Protein-DNA Binding Sites
Understand Gene Regulation
Study Chromatin States
Discover DNA Motifs
Explore Epigenetic Changes
Integrate with Other Genomic Data
Compare Conditions or Treatments
The purpose of this exercise is to investigate the promoter regions in the gene targeted by the DNA-directed RNA polymerase II subunit RPB1 during gene transcription. The raw data consists of four single-end FASTQ files generated by Illumina Genome Analyzer and available at ENCODE database with the accession numbers: ENCFF000XJP, ENCFF000XJS, and ENCFF000XKD, and the accession number of the input data (control) is ENCSR000XGP. The raw data is from a ChIP-Seq experiment with an accession number ENCSR000EZL that includes three samples from HeLa-S3 cell line established from cervical adenocarcinoma.
Data is downloaded from ENCODE and the reference genome hg19 from UCSC.
# getting data
system("wget \
-O 'data/ENCFF000XJP_chp1.fastq.gz' \
'https://www.encodeproject.org/files/ENCFF000XJP/@@download/ENCFF000XJP.fastq.gz' ")
system("wget \
-O "data/ENCFF000XJS_chp2.fastq.gz" \
'https://www.encodeproject.org/files/ENCFF000XJS/@@download/ENCFF000XJS.fastq.gz' ")
system("wget \
-O "data/ENCFF000XKD_chp3.fastq.gz" \
'https://www.encodeproject.org/files/ENCFF000XKD/@@download/ENCFF000XKD.fastq.gz' ")
system("wget \
-O "data/ENCFF000XGP_inp0.fastq.gz" \
"https://www.encodeproject.org/files/ENCFF000XGP/@@download/ENCFF000XGP.fastq.gz' ")
The output file are:
./data/ENCFF000XGP_inp0.fastq
./data/ENCFF000XJS_chp2.fastq
./data/ENCFF000XJP_chp1.fastq
./data/ENCFF000XKD_chp3.fastq
You can also embed plots, for example:
system("cd data")
system("fastqc \
ENCFF000XJP_chp1.fastq \
ENCFF000XJS_chp2.fastq \
ENCFF000XKD_chp3.fastq \
ENCFF000XGP_inp0.fastq ")
It is hg19 from USCS
system("wget https://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/hg19.fa.gz")
system("gunzip -d hg19.fa.gz")
system("samtools faidx hg19.fa")
system("bowtie2-build hg19.fa hg19")
# alignment to reference genome
system("bowtie2 \
-p 4 \
-x /Users/nnthieu/genome_ref/hg19/hg19 \
-U ./data/ENCFF000XGP_inp0.fastq \
-S ./bam/ENCFF000XGP_inp0.sam \
2> ./bam/inp0.log")
system("bowtie2 \
-p 4 \
-x /Users/nnthieu/genome_ref/hg19/hg19 \
-U ./data/ENCFF000XJP_chp1.fastq \
-S ./bam/ENCFF000XJP_chp1.sam \
2> ./bam/chp1.log")
system("bowtie2 \
-p 4 \
-x /Users/nnthieu/genome_ref/hg19/hg19 \
-U ./data/ENCFF000XJS_chp2.fastq \
-S ./bam/ENCFF000XJS_chp2.sam \
2> ./bam/chp2.log")
system("bowtie2 \
-p 4 \
-x /Users/nnthieu/genome_ref/hg19/hg19 \
-U ./data/ENCFF000XKD_chp3.fastq \
-S ./bam/ENCFF000XKD_chp3.sam \
2> ./bam/chp3.log")
The output files are:
ENCFF000XJP_chp1.sam
ENCFF000XJS_chp2.sam
ENCFF000XKD_chp3.sam
ENCFF000XGP_inp0.sam
These files are defined as “chrM” in the chromosome field of the SAM file and the unidentified, random, and haploid reads, which are defined as “chrUn”, “random”, and “hap”, respectively, keeping only the reads aligned to the human chromosomes. I use “sed” Linux command to do that and the filtered alignments are saved in new files.
system("cd bam")
system("sed '/chrM/d;/random/d;/chrUn/d;/hap/d' ENCFF000XGP_inp0.sam > ENCFF000XGP_inp0_filt.sam")
system("sed '/chrM/d;/random/d;/chrUn/d;/hap/d' ENCFF000XJP_chp1.sam > ENCFF000XJP_chp1_filt.sam")
system("sed '/chrM/d;/random/d;/chrUn/d;/hap/d' ENCFF000XJS_chp2.sam > ENCFF000XJS_chp2_filt.sam")
system("sed '/chrM/d;/random/d;/chrUn/d;/hap/d' ENCFF000XKD_chp3.sam > ENCFF000XKD_chp3_filt.sam")
system("samtools view -S -b ENCFF000XGP_inp0_filt.sam > ENCFF000XGP_inp0_filt.bam")
system("samtools view -S -b ENCFF000XJP_chp1_filt.sam > ENCFF000XJP_chp1_filt.bam")
system("samtools view -S -b ENCFF000XJS_chp2_filt.sam > ENCFF000XJS_chp2_filt.bam")
system("samtools view -S -b ENCFF000XKD_chp3_filt.sam > ENCFF000XKD_chp3_filt.bam")
Calculate the number of alignments in each file
# the number of alignments in each file
system("samtools view -c ENCFF000XGP_inp0_filt.bam") # 30802094
system("samtools view -c ENCFF000XJP_chp1_filt.bam") # 8898435
system("samtools view -c ENCFF000XJS_chp2_filt.bam") # 12682086
system("samtools view -c ENCFF000XKD_chp3_filt.bam") # 13148726
the number of alignments in the input file is 30802094 that should be adjusted for equivalenting to each ChIP files.
system(" inpc=$(samtools view -c ENCFF000XGP_inp0_filt.bam) ")
system(" chp1=$(samtools view -c ENCFF000XJP_chp1_filt.bam) ")
system(" fact1=$(echo "scale=6; $chp1/$inpc" | bc) ")
system(" samtools view -b -s $fact1 ENCFF000XGP_inp0_filt.bam > ENCFF000XGP_inp0_filt_inp1.bam ")
system(" chp2=$(samtools view -c ENCFF000XJS_chp2_filt.bam) ")
system(" fact2=$(echo "scale=6; $chp2/$inpc" | bc) ")
system(" samtools view -b -s $fact2 ENCFF000XGP_inp0_filt.bam > ENCFF000XGP_inp0_filt_inp2.bam ")
system(" chp3=$(samtools view -c ENCFF000XKD_chp3_filt.bam) ")
system(" fact3=$(echo "scale=6; $chp3/$inpc" | bc) ")
system(" samtools view -b -s $fact3 ENCFF000XGP_inp0_filt.bam > ENCFF000XGP_inp0_filt_inp3.bam ")
check the counts of new input (control) files
system(" samtools view -c ENCFF000XGP_inp0_filt_inp1.bam") # 8897472
system(" samtools view -c ENCFF000XGP_inp0_filt_inp2.bam") # 12677935
system(" samtools view -c ENCFF000XGP_inp0_filt_inp3.bam") # 13143890
The new inpt files are equivalent to each ChIP-seq files to avoid library coverage bias.
Up to this point, I have three ChIP-Seq BAM files and three control BAM files, one for each ChIP-Seq file.
View the alignment in the BAM files
system(" samtools view -c ENCFF000XGP_inp0_filt.bam")
system(" samtools view -c ENCFF000XJP_chp1_filt.bam")
system(" samtools view -c ENCFF000XJS_chp2_filt.bam")
system(" samtools view -c ENCFF000XKD_chp3_filt.bam")
# chip1
system(" samtools sort ENCFF000XJP_chp1_filt.bam -o ENCFF000XJP_chp1_filt_so.bam")
system(" samtools index ENCFF000XJP_chp1_filt_so.bam")
# chip2
system(" samtools sort ENCFF000XJS_chp2_filt.bam -o ENCFF000XJS_chp2_filt_so.bam")
system(" samtools index ENCFF000XJS_chp2_filt_so.bam")
#chip3
system(" samtools sort ENCFF000XKD_chp3_filt.bam -o ENCFF000XKD_chp3_filt_so.bam")
system(" samtools index ENCFF000XKD_chp3_filt_so.bam")
system(" samtools tview \
ENCFF000XJP_chp1_filt_so.bam \
-p chr17:56084561-56084661 \
/Users/nnthieu/genome_ref/hg19.fa")
system(" macs3 callpeak \
-t ./bam/ENCFF000XJP_chp1_filt_so.bam \
-c ./bam/ENCFF000XGP_inp0_filt_inp1.bam \
-f BAM \
-g hs \
-n chip1 \
-q 0.05 \
--bdg \
--outdir ./macs3output")
system(" macs3 callpeak \
-t bam/ENCFF000XJS_chp2_filt_so.bam \
-c bam/ENCFF000XGP_inp0_filt_inp2.bam \
-f BAM \
-g hs \
-n chip2 \
-q 0.05 \
--bdg \
--outdir macs3output")
system(" macs3 callpeak \
-t bam/ENCFF000XKD_chp3_filt_so.bam \
-c bam/ENCFF000XGP_inp0_filt_inp3.bam \
-f BAM \
-g hs \
-n chip3 \
-q 0.05 \
--bdg \
--outdir macs3output")
Several output files for each ChIP-Seq data have been saved in the specified output directory ‘macs3output’.
chip1_peaks.narrowPeak
chip1_treat_pileup.bdg
chip1_control_lambda.bdg
chip1_peaks.xls
chip1_model.r
chip1_summits.bed
These files are ready for analysis and visualization.
See the part 2 on next post.
#—