Except for the sequencing applications that use de novo genome assembly, read alignment/mapping to a reference genome is the most fundamental step in the workflow of the sequencing data analysis.

# setwd("/Users/nnthieu/refgenome")

Download data

system("mkdir data")
system("cd data")
system("prefetch SRR769545")
system("fasterq-dump SRR769545.sra")

# instead of fasterq-dump --verbose SRR769545 causing some code 3 errors

The outputs are 2 file: SRR769545_1.fastq and SRR769545_2.fastq, total 22G. It is large, so I zip them to save storing capacity.

system("gzip SRR769545_1.fastq")
system("gzip SRR769545_2.fastq")

The output files are:

SRR769545_1.fastq.gz SRR769545_2.fastq.gz

Sequence quality checking

system("fastqc SRR769545_1.fastq.gz SRR769545_2.fastq.gz")

Pereparing reference data

system("bwa index GRCh38.p13_ref.fna")

The output files stored in directory ‘BWAIndex’ are:

GRCh38.p13_ref.fna.amb

GRCh38.p13_ref.fna.ann

GRCh38.p13_ref.fna.bwt

GRCh38.p13_ref.fna.pac

GRCh38.p13_ref.fna.sa

BWA mem

# first input file is indexing 5 files in BWAIndex folder
# second and third input files are .gz files

system("bwa mem \
   -t 4 \
   ./BWAIndex/GRCh38.p13_ref.fna \   
   ./data/SRR769545_1.fastq.gz \       
   ./data/SRR769545_2.fastq.gz \
   > ./sam/SRR769545_mem.sam 2> ./sam/SRR769545_mem.log")

Display SAM file

system("cd sam")
system("less -S SRR769545_mem.sam")

Convert sam file to bam file then remove sam file to save storing space and memmory.

system("samtools view \
-uS \
-o SRR769545_mem.bam \
SRR769545_mem.sam")

system("rm SRR769545_mem.sam")

bwa-mem2

A new version of bwa-mem is bwa-mem2 that runs for the same algorithm but faster and the indexing occupies less storage space and memory.

# install bwa-mem2
system("conda install -c bioconda bwa-mem2")

# runing bwa
system("bwa bwasw \
   ./BWAIndex/GRCh38.p13_ref.fna \   
   ./data/SRR769545_1.fastq.gz \       
   ./data/SRR769545_2.fastq.gz \
   > ./sam/SRR769545_bwasw.sam 2> sam/SRR769545_bwasw.log")

bwa aln

use “bwa aln” to perform the first step of the alignment

system("cd ../")

system("bwa aln \
./BWAIndex/GRCh38.p13_ref.fna \
./data/SRR769545_1.fastq.gz \
> ./data/SRR769545_1.sai")

system("bwa aln \
./BWAIndex/GRCh38.p13_ref.fna \
./data/SRR769545_2.fastq.gz \
> ./data/SRR769545_2.sai ")

Then, I can use “bwa sampe” to generate the SAM file for the alignments.

system("bwa sampe \
./BWAIndex/GRCh38.p13_ref.fna \
./data/SRR769545_?.sai \
./data/SRR769545_?.fastq.gz \
> ./sam/SRR769545_aln.sam 2> ./sam/SRR769545_aln.log")

bowtie2

system("bowtie2-build \
--threads 4 \
./BWAIndex/GRCh38.p13_ref.fna \
./bowtie2")

system("bowtie2 -x ./bowtie2 \
-1 ./data/SRR769545_1.fastq.gz \
-2 ./data/SRR769545_2.fastq.gz \
-S ./sam/SRR769545_bowtie2.sam")

# or create bam file wif using -b option
system("bowtie2 -x ./bowtie2 \
-1 ./data/SRR769545_1.fastq.gz \
-2 ./data/SRR769545_2.fastq.gz \
-b ./sam/SRR769545_bowtie2.bam")

The output is /sam/SRR769545_bowtie2.sam.

Then I convert this bowtie2 sam file to bam file and remove the sam file to save store space as follow:

system("samtools view \
-uS \
-o SRR769545_bowtie2.bam \
SRR769545_bowtie2.sam")

system("rm SRR769545_bowtie2.sam")

Manupulating bam files

sort the bam file by TAGSs (or by read names -n)

system("samtools sort \
-@ 4 \
-T mem.tmp.sort \
-o .sam/SRR769545_mem_sorted.bam \
./sam/SRR769545_mem.bam")

system("rm ./sam/SRR769545_mem.bam")

Indexing the sorted.bam file

system("samtools index ./sam/SRR769545_mem_sorted.bam")

Extracting Alignments of a Chromosome With the following command, you can split the alignments of a chromosome 11 in a separate file using “samtools view”

system("samtools view .sam/SRR769545_mem_sorted.bam NC_000011.10 > # .sam/chr11_human.sam")

Filtering and Counting Alignment in SAM/BAM Files

# search for the alignments with chimeric reads, which are tagged as “SA:”
system("samtools view ./sam/SRR769545_mem_sorted.bam | grep ‘SA:’ | less -S")
system("samtools view ./sam/SRR769545_mem_sorted.bam | grep ‘SA:’ | wc -l")

Count total reads of the file

# counting number of reads
system("samtools view -c ./sam/SRR769545_mem_sorted.bam") 

To count mapped reads, use the “-F” option

system("samtools view -c -F 0x4 ./sam/SRR769545_mem_sorted.bam")

To count unmapped reads, use the “-f” option instead of -F

system("samtools view -c -f 0x4 ./sam/SRR769545_mem_sorted.bam")

Count I (insertions), D (deletions) in CIGAR column (6th coloumn).

“samtools view -F 0x4 SRR769545_mem_sorted.bam” is used to extract the mapped records. Then, the output is tranfered to “cut -f 6” using the pipe symbol “|” to separate the sixth column. The output is then transferred to “grep” to select only the strings that have either the character “D” or “I” using the class pattern “[ID]” to match any of the two characters. Then, the output is transferred to the “tr” command to delete any characters other than “I” and “D”. Finally, the output is transferred to the “wc -c” command to count the remaining characters:

system("samtools view -F 0x4 ./sam/SRR769545_mem_sorted.bam \
| cut -f 6 \
| grep '[ID]' \
| tr -cd 'ID' \
| wc -c ")

To count insertions and deletions separately, use the following, respectively:

Insertions:

system("samtools view -F 0x4 ./sam/SRR769545_mem_sorted.bam \
| cut -f 6 \
| grep 'I' \
| tr -cd 'I' \
| wc -c")

Deletions

system("samtools view \
-F 0x4 ./sam/SRR769545_mem_sorted.bam \
| cut -f 6 \
| grep 'D' \
| tr -cd 'D' \
| wc -c")

Removing Duplicate Reads

system("samtools rmdup \
./sam/SRR769545_mem_sorted.bam \
./sam/SRR769545_mem_rmdup.bam 2> ./sam/SRR769545_mem_rmdup.log")

Descriptive Statistics

# system("samtools flagstat ./sam/SRR769545_mem_sorted.bam")
# system("samtools coverage ./sam/SRR769545_mem_sorted.bam > coverage.txt")
# system("samtools depth ./sam/SRR769545_mem_sorted.bam > depth.txt")

Reference-guided genome assembly

system("brew install bcftools")

system("bcftools mpileup -f ./GRCh38.p13_ref.fna \
./sam/SRR769545_mem_sorted.bam \
| bcftools call -mv -Ob \
-o ./sam/SRR769545_mem.bcf")

system("bcftools convert -O v \
-o ./SRR769545_mem.vcf \
.sam//SRR769545_mem.bcf")
system("brew install tabix")
system("bgzip -c SRR769545.vcf > SRR769545.vcf.gz")
system("tabix -p vcf SRR769545.vcf.gz")
system("cat ./GRCh38.p13_ref.fna \
| bcftools consensus SRR769545.vcf.gz \
> ./SRR769545_genome.fasta ")