ChIP-seq Data Analysis

1. Introduction

ChIP-Seq (Chromatin Immunoprecipitation followed by Sequencing) data analysis aims to identify and understand the binding sites of DNA-associated proteins (such as transcription factors, histones, or chromatin remodelers) across the genome. This information is crucial for studying gene regulation, chromatin states, and epigenetic mechanisms. The specific purposes of ChIP-Seq data analysis include:

Identify Protein-DNA Binding Sites
Understand Gene Regulation
Study Chromatin States
Discover DNA Motifs
Explore Epigenetic Changes
Integrate with Other Genomic Data
Compare Conditions or Treatments

The purpose of this exercise is to investigate the promoter regions in the gene targeted by the DNA-directed RNA polymerase II subunit RPB1 during gene transcription. The raw data consists of four single-end FASTQ files generated by Illumina Genome Analyzer and available at ENCODE database with the accession numbers: ENCFF000XJP, ENCFF000XJS, and ENCFF000XKD, and the accession number of the input data (control) is ENCSR000XGP. The raw data is from a ChIP-Seq experiment with an accession number ENCSR000EZL that includes three samples from HeLa-S3 cell line established from cervical adenocarcinoma.

2. Download data

Data is downloaded from ENCODE and the reference genome hg19 from UCSC.

# getting data
system("wget \
    -O 'data/ENCFF000XJP_chp1.fastq.gz' \
    'https://www.encodeproject.org/files/ENCFF000XJP/@@download/ENCFF000XJP.fastq.gz' ")

system("wget \
    -O "data/ENCFF000XJS_chp2.fastq.gz" \
    'https://www.encodeproject.org/files/ENCFF000XJS/@@download/ENCFF000XJS.fastq.gz' ")

system("wget \
    -O "data/ENCFF000XKD_chp3.fastq.gz" \
    'https://www.encodeproject.org/files/ENCFF000XKD/@@download/ENCFF000XKD.fastq.gz' ")

system("wget \
    -O "data/ENCFF000XGP_inp0.fastq.gz" \
    "https://www.encodeproject.org/files/ENCFF000XGP/@@download/ENCFF000XGP.fastq.gz' ")

The output file are:

./data/ENCFF000XGP_inp0.fastq

./data/ENCFF000XJS_chp2.fastq

./data/ENCFF000XJP_chp1.fastq

./data/ENCFF000XKD_chp3.fastq

3. Fastq file quality check

You can also embed plots, for example:

system("cd data")
system("fastqc \
    ENCFF000XJP_chp1.fastq \
    ENCFF000XJS_chp2.fastq \
    ENCFF000XKD_chp3.fastq \
    ENCFF000XGP_inp0.fastq ")

Reference files

It is hg19 from USCS

system("wget https://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/hg19.fa.gz")
system("gunzip -d hg19.fa.gz")
system("samtools faidx hg19.fa")
system("bowtie2-build hg19.fa hg19")

4. ChIP-Seq and Input Read Mapping

# alignment to reference genome
system("bowtie2 \
    -p 4 \
    -x /Users/nnthieu/genome_ref/hg19/hg19 \
    -U ./data/ENCFF000XGP_inp0.fastq \
    -S ./bam/ENCFF000XGP_inp0.sam \
    2> ./bam/inp0.log")

system("bowtie2 \
    -p 4 \
    -x /Users/nnthieu/genome_ref/hg19/hg19 \
    -U ./data/ENCFF000XJP_chp1.fastq \
    -S ./bam/ENCFF000XJP_chp1.sam \
    2> ./bam/chp1.log")

system("bowtie2 \
    -p 4 \
    -x /Users/nnthieu/genome_ref/hg19/hg19 \
    -U ./data/ENCFF000XJS_chp2.fastq \
    -S ./bam/ENCFF000XJS_chp2.sam \
    2> ./bam/chp2.log")

system("bowtie2 \
    -p 4 \
    -x /Users/nnthieu/genome_ref/hg19/hg19 \
    -U ./data/ENCFF000XKD_chp3.fastq \
    -S ./bam/ENCFF000XKD_chp3.sam \
    2> ./bam/chp3.log")

The output files are:

ENCFF000XJP_chp1.sam

ENCFF000XJS_chp2.sam

ENCFF000XKD_chp3.sam

ENCFF000XGP_inp0.sam

remove the mitochondrion read alignments

These files are defined as “chrM” in the chromosome field of the SAM file and the unidentified, random, and haploid reads, which are defined as “chrUn”, “random”, and “hap”, respectively, keeping only the reads aligned to the human chromosomes. I use “sed” Linux command to do that and the filtered alignments are saved in new files.

system("cd bam")

system("sed '/chrM/d;/random/d;/chrUn/d;/hap/d' ENCFF000XGP_inp0.sam > ENCFF000XGP_inp0_filt.sam")

system("sed '/chrM/d;/random/d;/chrUn/d;/hap/d' ENCFF000XJP_chp1.sam > ENCFF000XJP_chp1_filt.sam")

system("sed '/chrM/d;/random/d;/chrUn/d;/hap/d' ENCFF000XJS_chp2.sam > ENCFF000XJS_chp2_filt.sam")

system("sed '/chrM/d;/random/d;/chrUn/d;/hap/d' ENCFF000XKD_chp3.sam > ENCFF000XKD_chp3_filt.sam")

convert the SAM files into BAM files

system("samtools view -S -b ENCFF000XGP_inp0_filt.sam > ENCFF000XGP_inp0_filt.bam")

system("samtools view -S -b ENCFF000XJP_chp1_filt.sam > ENCFF000XJP_chp1_filt.bam")

system("samtools view -S -b ENCFF000XJS_chp2_filt.sam > ENCFF000XJS_chp2_filt.bam")

system("samtools view -S -b ENCFF000XKD_chp3_filt.sam > ENCFF000XKD_chp3_filt.bam")

Calculate the number of alignments in each file

# the number of alignments in each file

system("samtools view -c ENCFF000XGP_inp0_filt.bam")  # 30802094
system("samtools view -c ENCFF000XJP_chp1_filt.bam")  # 8898435
system("samtools view -c ENCFF000XJS_chp2_filt.bam")  # 12682086
system("samtools view -c ENCFF000XKD_chp3_filt.bam")  # 13148726

the number of alignments in the input file is 30802094 that should be adjusted for equivalenting to each ChIP files.

Subsample files for balancing reads counts

system(" inpc=$(samtools view -c ENCFF000XGP_inp0_filt.bam) ")

system(" chp1=$(samtools view -c ENCFF000XJP_chp1_filt.bam) ")
system(" fact1=$(echo "scale=6; $chp1/$inpc" | bc) ")
system(" samtools view -b -s $fact1 ENCFF000XGP_inp0_filt.bam > ENCFF000XGP_inp0_filt_inp1.bam ")

system(" chp2=$(samtools view -c ENCFF000XJS_chp2_filt.bam) ")
system(" fact2=$(echo "scale=6; $chp2/$inpc" | bc) ")
system(" samtools view -b -s $fact2 ENCFF000XGP_inp0_filt.bam > ENCFF000XGP_inp0_filt_inp2.bam ")

system(" chp3=$(samtools view -c ENCFF000XKD_chp3_filt.bam) ")
system(" fact3=$(echo "scale=6; $chp3/$inpc" | bc) ")
system(" samtools view -b -s $fact3 ENCFF000XGP_inp0_filt.bam > ENCFF000XGP_inp0_filt_inp3.bam ")

check the counts of new input (control) files

system(" samtools view -c ENCFF000XGP_inp0_filt_inp1.bam")  # 8897472
system(" samtools view -c ENCFF000XGP_inp0_filt_inp2.bam")  # 12677935
system(" samtools view -c ENCFF000XGP_inp0_filt_inp3.bam")  # 13143890

The new inpt files are equivalent to each ChIP-seq files to avoid library coverage bias.

Up to this point, I have three ChIP-Seq BAM files and three control BAM files, one for each ChIP-Seq file.

View the alignment in the BAM files

system(" samtools view -c ENCFF000XGP_inp0_filt.bam")
system(" samtools view -c ENCFF000XJP_chp1_filt.bam")
system(" samtools view -c ENCFF000XJS_chp2_filt.bam")
system(" samtools view -c ENCFF000XKD_chp3_filt.bam")

sort and indexing files

# chip1
system(" samtools sort ENCFF000XJP_chp1_filt.bam -o ENCFF000XJP_chp1_filt_so.bam")
system(" samtools index ENCFF000XJP_chp1_filt_so.bam")

# chip2
system(" samtools sort ENCFF000XJS_chp2_filt.bam -o ENCFF000XJS_chp2_filt_so.bam")
system(" samtools index ENCFF000XJS_chp2_filt_so.bam")

#chip3
system(" samtools sort ENCFF000XKD_chp3_filt.bam -o ENCFF000XKD_chp3_filt_so.bam")
system(" samtools index ENCFF000XKD_chp3_filt_so.bam")

quick way to visualize the alignments in a BAM file and to inspect the binding site in a specific location

system(" samtools tview \
    ENCFF000XJP_chp1_filt_so.bam \
    -p chr17:56084561-56084661 \
    /Users/nnthieu/genome_ref/hg19.fa")

5. ChIP-Seq Peak Calling with MACS3

system(" macs3 callpeak \
    -t ./bam/ENCFF000XJP_chp1_filt_so.bam \
    -c ./bam/ENCFF000XGP_inp0_filt_inp1.bam \
    -f BAM \
    -g hs \
    -n chip1 \
    -q 0.05 \
    --bdg \
    --outdir ./macs3output")

system(" macs3 callpeak \
    -t bam/ENCFF000XJS_chp2_filt_so.bam \
    -c bam/ENCFF000XGP_inp0_filt_inp2.bam \
    -f BAM \
    -g hs \
    -n chip2 \
    -q 0.05 \
    --bdg \
    --outdir macs3output")

system(" macs3 callpeak \
    -t bam/ENCFF000XKD_chp3_filt_so.bam \
    -c bam/ENCFF000XGP_inp0_filt_inp3.bam \
    -f BAM \
    -g hs \
    -n chip3 \
    -q 0.05 \
    --bdg \
    --outdir macs3output")

Several output files for each ChIP-Seq data have been saved in the specified output directory ‘macs3output’.

chip1_peaks.narrowPeak

chip1_treat_pileup.bdg

chip1_control_lambda.bdg

chip1_peaks.xls

chip1_model.r

chip1_summits.bed

These files are ready for analysis and visualization.

See the part 2 on next post.

#—

ChIP-seq Data Analysis - Part 1

Thieu Nguyen

2024-11-24