Except for the sequencing applications that use de novo genome assembly, read alignment/mapping to a reference genome is the most fundamental step in the workflow of the sequencing data analysis.
# setwd("/Users/nnthieu/refgenome")
system("mkdir data")
system("cd data")
system("prefetch SRR769545")
system("fasterq-dump SRR769545.sra")
# instead of fasterq-dump --verbose SRR769545 causing some code 3 errors
The outputs are 2 file: SRR769545_1.fastq and SRR769545_2.fastq, total 22G. It is large, so I zip them to save storing capacity.
system("gzip SRR769545_1.fastq")
system("gzip SRR769545_2.fastq")
The output files are:
SRR769545_1.fastq.gz SRR769545_2.fastq.gz
system("fastqc SRR769545_1.fastq.gz SRR769545_2.fastq.gz")
system("bwa index GRCh38.p13_ref.fna")
The output files stored in directory ‘BWAIndex’ are:
GRCh38.p13_ref.fna.amb
GRCh38.p13_ref.fna.ann
GRCh38.p13_ref.fna.bwt
GRCh38.p13_ref.fna.pac
GRCh38.p13_ref.fna.sa
# first input file is indexing 5 files in BWAIndex folder
# second and third input files are .gz files
system("bwa mem \
-t 4 \
./BWAIndex/GRCh38.p13_ref.fna \
./data/SRR769545_1.fastq.gz \
./data/SRR769545_2.fastq.gz \
> ./sam/SRR769545_mem.sam 2> ./sam/SRR769545_mem.log")
Display SAM file
system("cd sam")
system("less -S SRR769545_mem.sam")
Convert sam file to bam file then remove sam file to save storing space and memmory.
system("samtools view \
-uS \
-o SRR769545_mem.bam \
SRR769545_mem.sam")
system("rm SRR769545_mem.sam")
A new version of bwa-mem is bwa-mem2 that runs for the same algorithm but faster and the indexing occupies less storage space and memory.
# install bwa-mem2
system("conda install -c bioconda bwa-mem2")
# runing bwa
system("bwa bwasw \
./BWAIndex/GRCh38.p13_ref.fna \
./data/SRR769545_1.fastq.gz \
./data/SRR769545_2.fastq.gz \
> ./sam/SRR769545_bwasw.sam 2> sam/SRR769545_bwasw.log")
use “bwa aln” to perform the first step of the alignment
system("cd ../")
system("bwa aln \
./BWAIndex/GRCh38.p13_ref.fna \
./data/SRR769545_1.fastq.gz \
> ./data/SRR769545_1.sai")
system("bwa aln \
./BWAIndex/GRCh38.p13_ref.fna \
./data/SRR769545_2.fastq.gz \
> ./data/SRR769545_2.sai ")
Then, I can use “bwa sampe” to generate the SAM file for the alignments.
system("bwa sampe \
./BWAIndex/GRCh38.p13_ref.fna \
./data/SRR769545_?.sai \
./data/SRR769545_?.fastq.gz \
> ./sam/SRR769545_aln.sam 2> ./sam/SRR769545_aln.log")
system("bowtie2-build \
--threads 4 \
./BWAIndex/GRCh38.p13_ref.fna \
./bowtie2")
system("bowtie2 -x ./bowtie2 \
-1 ./data/SRR769545_1.fastq.gz \
-2 ./data/SRR769545_2.fastq.gz \
-S ./sam/SRR769545_bowtie2.sam")
# or create bam file wif using -b option
system("bowtie2 -x ./bowtie2 \
-1 ./data/SRR769545_1.fastq.gz \
-2 ./data/SRR769545_2.fastq.gz \
-b ./sam/SRR769545_bowtie2.bam")
The output is /sam/SRR769545_bowtie2.sam.
Then I convert this bowtie2 sam file to bam file and remove the sam file to save store space as follow:
system("samtools view \
-uS \
-o SRR769545_bowtie2.bam \
SRR769545_bowtie2.sam")
system("rm SRR769545_bowtie2.sam")
sort the bam file by TAGSs (or by read names -n)
system("samtools sort \
-@ 4 \
-T mem.tmp.sort \
-o .sam/SRR769545_mem_sorted.bam \
./sam/SRR769545_mem.bam")
system("rm ./sam/SRR769545_mem.bam")
Indexing the sorted.bam file
system("samtools index ./sam/SRR769545_mem_sorted.bam")
Extracting Alignments of a Chromosome With the following command, you can split the alignments of a chromosome 11 in a separate file using “samtools view”
system("samtools view .sam/SRR769545_mem_sorted.bam NC_000011.10 > # .sam/chr11_human.sam")
Filtering and Counting Alignment in SAM/BAM Files
# search for the alignments with chimeric reads, which are tagged as “SA:”
system("samtools view ./sam/SRR769545_mem_sorted.bam | grep ‘SA:’ | less -S")
system("samtools view ./sam/SRR769545_mem_sorted.bam | grep ‘SA:’ | wc -l")
Count total reads of the file
# counting number of reads
system("samtools view -c ./sam/SRR769545_mem_sorted.bam")
To count mapped reads, use the “-F” option
system("samtools view -c -F 0x4 ./sam/SRR769545_mem_sorted.bam")
To count unmapped reads, use the “-f” option instead of -F
system("samtools view -c -f 0x4 ./sam/SRR769545_mem_sorted.bam")
Count I (insertions), D (deletions) in CIGAR column (6th coloumn).
“samtools view -F 0x4 SRR769545_mem_sorted.bam” is used to extract the mapped records. Then, the output is tranfered to “cut -f 6” using the pipe symbol “|” to separate the sixth column. The output is then transferred to “grep” to select only the strings that have either the character “D” or “I” using the class pattern “[ID]” to match any of the two characters. Then, the output is transferred to the “tr” command to delete any characters other than “I” and “D”. Finally, the output is transferred to the “wc -c” command to count the remaining characters:
system("samtools view -F 0x4 ./sam/SRR769545_mem_sorted.bam \
| cut -f 6 \
| grep '[ID]' \
| tr -cd 'ID' \
| wc -c ")
To count insertions and deletions separately, use the following, respectively:
Insertions:
system("samtools view -F 0x4 ./sam/SRR769545_mem_sorted.bam \
| cut -f 6 \
| grep 'I' \
| tr -cd 'I' \
| wc -c")
Deletions
system("samtools view \
-F 0x4 ./sam/SRR769545_mem_sorted.bam \
| cut -f 6 \
| grep 'D' \
| tr -cd 'D' \
| wc -c")
Removing Duplicate Reads
system("samtools rmdup \
./sam/SRR769545_mem_sorted.bam \
./sam/SRR769545_mem_rmdup.bam 2> ./sam/SRR769545_mem_rmdup.log")
Descriptive Statistics
# system("samtools flagstat ./sam/SRR769545_mem_sorted.bam")
# system("samtools coverage ./sam/SRR769545_mem_sorted.bam > coverage.txt")
# system("samtools depth ./sam/SRR769545_mem_sorted.bam > depth.txt")
system("brew install bcftools")
system("bcftools mpileup -f ./GRCh38.p13_ref.fna \
./sam/SRR769545_mem_sorted.bam \
| bcftools call -mv -Ob \
-o ./sam/SRR769545_mem.bcf")
system("bcftools convert -O v \
-o ./SRR769545_mem.vcf \
.sam//SRR769545_mem.bcf")
system("brew install tabix")
system("bgzip -c SRR769545.vcf > SRR769545.vcf.gz")
system("tabix -p vcf SRR769545.vcf.gz")
system("cat ./GRCh38.p13_ref.fna \
| bcftools consensus SRR769545.vcf.gz \
> ./SRR769545_genome.fasta ")