Introduction to Linux and command lines for NGS data analysis

0.1 NGS

0.1.1 Downloading files samples from NCBI

NCBI offers a number of tools to download files. In most cases if we have a few files, we can just use the web interfaces to download them. You can also use the batchentrez tool which allows you to upload a text file with the sample IDs and you can initiate a batch download.

However, sometimes the destination of the files is a remote server. We could save time by downloading directly to the remote server via the terminal.

Example:

Normally, it starts it starts by reading an article of interest, for example;

Detection of a SARS-CoV-2 variant of concern in South Africa

Research article

As we read, we realize that the article has made the data publicly available

Research article data availability

We click on the BioProject accession, in this case PRJNA694014 and this takes us to the project page on NCBI SRA.

Steps…

Now NCBI provides the SRA toolkit which is a suite of tools to interact with SRA. We need to ensure that we have downloaded an installed it on our system. If not yet installed, follow the URL https://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?view=software to download and install the appropriate version of the SRA Toolkit for your operating system.

The link takes us to NCBI’s Github repository, which has among other tools the NCBI SRA Toolkit that we are interested in. Download the toolkit to you system.

To be able to access the executables from any location, we will add the path to our PATH variable using the export keyword.


echo "export PATH=/Applications/biotools/sratoolkit/bin:$PATH" >> ~/.bashrc

source ~/.bashr     #run once for an already active shell. On next login, ~/.bashrc will already have been read.

export -p           #we can check the currently exported variables here.

Now that our environment is fully set up and we have the accession list, let us initiate download from SRA in our terminal.


[ ! -d "sra_download " ] && mkdir sra_download

cd sra_download

prefetch --option-file SRR_Acc_List5.txt

The downloaded files are in the form sra-id.sra, let us convert them into a more accessible format.


ls SRR13620072

ls -1d SRR136200* > files.txt  #list directories as files

# fastq-dump --split-files SRR13620072/SRR13620072.sra
parallel 'fastq-dump --split-files {}/{}.sra' ::: $(cat files.txt)

## Our files now have now been split. Lets check them quickls to see that they are all fine.
ls -alh *.fastq

gzip *.fastq           #compress fastqs to save on storage

rename -- _ _R *.gz   #Rename to add the read bit. Some pipelines are designed to require it.

0.1.1.1 Exercise

Create a commandline script to download some data from Ensembl and Uniprot.

0.2 Raw-reads (fastq) pre-processing

0.2.1 Generating sequence quality reports

To generate sequence quality statistics for our sequence, we will use fastqc and multiqc.

## cod

cd sra_donwload

fastqc -o qc *.gz

cd qc && multiqc .

0.2.2 Filter and trim sequences

Poor confidence base calls can lead to the detection of false-positive variants, so they need to be removed. Reads that are too short are likely to align to multiple regions in the genome and cause poor mapping metrics. Generally, it is good practice to filter out the very short reads e.g. less than 50 base pairs long or if the sequence depth is high enough, anything less than 100 base pairs.

Adaptor sequences ligated to the ends of libraries during the library preparation process also need to be removed from the sequencing reads as they can interfere with mapping and assembly. And ass you will have noticed, sequencing qualiy tends to drop at the tail ends of the sequence. We need to remove these poor-quality bases from the ends of reads.

There are several tools that can be used for trim the low quality reads and adapter sequences including Trimmomatic, TrimGalore and Cutadapt. Trimgalore performs adapter removal and quality filtering via calling the Cutadapt and Fastqc tools. Trimmomatic has two trimming modes: ‘adapters and SW’ mode and ‘adapters and MI’ mode. In ‘adapters and SW’ mode, a sliding window approach is used to remove read bases that have a low sequencing quality

## Trimgalore
trim_galore -h #access help documentation.
trim_galore --paired --three_prime_clip_R1 15 --three_prime_clip_R2 15 *.fastq.gz *.R2.fastq.gz ## Primer sequences are autodetected if not provided


## Cutadapt
cutadapt -a AACCGGTT -o output.fastq.gz input.fastq.gz

## Trimmomatic - trims any risidual adapter sequences off the reads and performs basic filtering of reads to remove low quality reads and short reads

trimmomatic_path=/biotools/Trimmomatic-0.38
parallel 'java -Xmx4000m -jar ${trimmomatic_path}/trimmomatic-0.38.jar PE -threads 20 {}_1.fastq.gz {}_2.fastq.gz {}_pair_F.fastq.gz {}_unpair_F.fastq.gz {}_pair_R.fastq.gz {}_unpair_R.fastq.gz ILLUMINACLIP:/biotools/Trimmomatic-0.38/adapters/NexteraPE-PE.fa:2:30:10:2 LEADING:3 TRAILING:3 SLIDINGWINDOW:4:15 MINLEN:36' ::: $(ls *.fastq.gz | rev | cut -c 12- | rev | uniq)

#NexteraPE - could not identify any specific adaptors (grep for known Illumina adaptors or with AdapterRemoval. This application has a function where you can identify adapter sequences if unknown using paired alignment of reads |THIS WORKS ONLY FOR SINGLE ADAPTER SEQUENCES| $AdapterRemoval --identify-adapters --file1 *read1.fq --file2 *read2.fq - No output to disk only to screen)

#LEADING and TRAILING 3 - This is the minimum quality required to keep a base either at the 5' or 3' (LEAD or TAIL) ends of reads. Why? Because the base quality drops off at the ends of reads.

#SLIDINGWINDOW 4:15 - This scans each read with a 4bp sliding window and cuts from the left most base when the average quality per base drops below 15.

#MINLEN 36 - this drops the read if the minimum length of the read is below 36. Why? Short reads are BAD as they can align to multiple sites. You want longer reads so that your reads align at the right place. 

#ILLUMINACLIP: Cut adapter and other Illumina-specific sequences from the read.
#LEADING: Cut bases off the start of a read, if below a threshold quality
#TRAILING: Cut bases off the end of a read, if below a threshold quality
# CROP: Cut the read to a specified length
# HEADCROP: Cut the specified number of bases from the start of the read
#TOPHRED33: Convert quality scores to Phred-33
#TOPHRED64: Convert quality scores to Phred-64

After trimming your sequences, it is necessary to run the QC steps again to ensure that everything went well. i.e. You did not loose too many of your reads or entire sequences altogether. Where necessary, you may relax the threshold to allow more reads to pass the QC steps.

0.3 Working with BAM files

0.3.1 SeqPanther - Generate summary statistiscs from BAM files

Generate summary statistiscs from BAM files using SeqPanther. SeqPanther summarises both mutation and codon counts and percentage frequencies, generates a visualization of the genome coverage and finally, produces a file that can be used to modify sequences.

Note: SeqPanther requires thate the reference sequence id (rid), specified by –rid, is the same in the reference sequence, BAM file and GFF file. It is easiest to modify the reference sequence and GFF so use samtools as spefied below to determine the reference to use. Check BAM for reference

samtools view -h Ko48924_K03455_HIVHXB2CG.bam | head -n 5

seqpanther  codoncounter -bam Ko48924_K03455_HIVHXB2CG.bam -rid "K03455|HIVHXB2CG" -ref K03455.fasta -gff K03455.gff -e0

column -s, -t < codon_output.csv | less -#2 -N -S

0.3.2 Generate minor alleles

Often we have mutations that we are expecting in our consensus sequences and incase these are missing we would like to check for there existence as minority variants.

unzip K02463738-bam.zip
#parallel 'unzip {}-bam.zip' ::: $(ls -1 *.zip | cut -d "-" -f1)

bamtools split -in K02463738-consensus_alignment_sorted.bam -reference
#parallel 'bamtools split -in {}-consensus_alignment_sorted.bam -reference' ::: $(ls -1 *-consensus_alignment_sorted.bam |  cut -d "-" -f1)

## check for indel qualities
samtools view K003734_S64_L001001_sequence-alignment.bam | grep 'B[ID]:Z:'

## if not, run to add quality information needed by lofreq
lofreq indelqual --dindel -f NC_045512.2.fasta  -o K02463738-consensus_alignment_sorted_iq.bam  K02463738-consensus_alignment_sorted.bam
#parallel 'lofreq indelqual --dindel -f {2}  -o {1}-NC_045512.2_1-consensus_alignment_sorted_iq.bam {1}-NC_045512.2_1-consensus_alignment_sorted.bam' ::: $(ls *NC_045512.2*.bam | cut -d "-" -f1)  ::: "NC_045512.2.fasta" #illumina


lofreq call --call-indels --use-orphan -B --min-cov 10 --sig 0.01 -f NC_045512.2.fasta -o K02463738.lf.vcf K02463738-consensus_alignment_sorted_iq.bam  --force-overwrite
#parallel 'lofreq call --call-indels --use-orphan -B --no-default-filter -f {2} -o {1}.lf.vcf {1}-NC_045512.2_1-consensus_alignment_sorted_iq.bam  --force-overwrite' ::: $(ls *_iq.bam | cut -d "-" -f1)  ::: "NC_045512.2_all.fasta"

#Strand Bias (SB):
#  Note  variants are only filtered if their SB pvalue is below the threshold
#  AND 85% of variant bases are on one strand (toggled with --sb-no-compound).


snpEff -ud 0 NC_045512.2 K02463738.lf.vcf > K02463738.lf_ann.vcf
#parallel 'snpEff -ud 0 NC_045512.2 {}.lf.vcf > {}.lf_ann.vcf' ::: $(ls *.lf.vcf | cut -d "." -f 1)

SnpSift extractFields -s " " -e "." K02463738.lf_ann.vcf CHROM POS REF ALT AF DP SB DP4 "ANN[0].EFFECT" "ANN[0].IMPACT" "ANN[0].GENE" > K02463738_sf.table
#parallel 'SnpSift extractFields -s " " -e "." {}.lf_ann.vcf CHROM POS REF ALT AF DP SB DP4 "ANN[0].EFFECT" "ANN[0].IMPACT" "ANN[0].GENE" > {}_sf.table' ::: $(ls *lf_ann.vcf | cut -d "." -f 1)

0.3.3 Regenerate FASTQ from BAM

You may want to perform an analysis that that requires FASTQ as in put but you have only been provided with BAM files. One example is if you would like to remap the files based on different reference. You can convert these to FASTQ as below.

## For Miseq Paired end data
samtools fastq ./K011236-NC_045512.2_1-consensus_alignment_sorted.bam -0 K011236.R0.fq.gz -1 K011236.R1.fq.gz -2 K011236.R2.fq.gz
repair.sh in=K011236.R1.fq.gz in2=K011236.R2.fq.gz out=K011236_R1.rep.fastq.gz out2=K011236_R2.rep.fastq.gz
bwa mem -t 4 -R '@RG\tID:foo\tSM:bar' NC_045512.2.fasta ./K011236_R{1 2}.rep.fastq.gz | samtools sort -@ 4 -o K011236.bwa.sorted.bam



## For ONT data
samtools fastq K010971-NC_045512.2_1-consensus_alignment_sorted.bam -0 K010971.R0.fq.gz -1 K010971.R1.fq.gz -2 K010971.R2.fq.gz
repair.sh in=K010971.R1.fq.gz in2=K010971.R2.fq.gz out=K010971_R1.rep.fastq.gz out2=K010971_R2.rep.fastq.gz
bwa mem -t 4 -R '@RG\tID:foo\tSM:bar' /analyses2/houriiyah/nCoV_genomes/run50_51/NC_045512.2.fasta K010971_R{1 2}.rep.fastq.gz | samtools sort -@ 4 -o ../K010971-NC_045512.2_1-consensus_alignment_sorted.bam

Exercise: Create the parallel version of these scripts your bam files.

0.4 Working with Fasta sequences

0.4.1 Generate QC report of fasta sequences.

There are several tools to generate QC statistics. Different tools have been applied to different pathogens. Since we are working with SARS-CoV-2 data, we will use Nextclade. Nextclade is developed by the Nextstrain team. It has been used extensively during the pandemic. Nextclade supports the analysis of sequences of various pathogens including….You can run it from the commandline or from the web. To run it from the web, visit clades.nextstrain.org. Here I will show you how to run it from the commandline.

Goal: Take the fasta file in folder run469. Generate QC statistics. Use the statics to determine which sequences are of low quality. See Nextclades quality control criteria. Select only the good sequences and rename them.


# How many sequences are in my multifasta
grep ">"run_consensus_v1.fasta | wc -l

# How are they named? 
grep ">"run_consensus_v1.fasta | less

#install nextclade


nextclade dataset list #or visit https://github.com/nextstrain/nextclade_data/tree/release/data/datasets to see available datasets

nextclade dataset get --name 'sars-cov-2' --output-dir '/Applications/data/sars-cov-2'   #Download Nextclade dataset. Change location.

nextclade run \
   --input-dataset sars-cov-2 \
   --output-tsv=output/nextclade.tsv \
     run469/run_consensus_v1.fasta #change to your dataset

0.4.2 Subset multifasta

Once we have determined which sequences are of good quality, we will use seqkit to subset the fasta.


seqkit grep -r -f goodSeqs.txt run_consensus_v1.fasta > run_consensus_v2.fasta

0.4.3 Remove duplicates from fasta file

There are two kinds of duplicates we can expect to find in our files. We can find sequences with exactly same name or sequences that are identical but have different names. It is unlikely that two unrelated sequences will have same name or that two sequences that are not related will be exactly identical. We can use seqkit as below to remove the duplicates.


seqkit rmdup --ignore-case run_consensus_v2.fasta\
        -o run_consensus_v2_uniq.fasta 
        --dup-seqs-file run_consensus_v2_uniq.fasta\
        --dup-num-file run_consensus_v2_uniq.txt

Seqkitalso has the -n or –by-name to find and remove duplicates by full name instead of just id and -s or –by-seq by seq. This is useful if you know that certain sequences meet either of these criteria and you still would like to keep them in the sequence set.

0.4.4 Rename sequences


  cp  run_consensus_v2_uniq.fasta  run_consensus_v3 ## Preserve provenance.
  
  for i in `cat rename.txt`;  
  do  on=$(echo $i | cut -d "," -f1);  
      nn=$(echo $i | cut -d "," -f2);  
      sed -i '' "s#$on#$nn#g" run_consensus_v3.fasta;  
  done

0.5 Resources for further study

Biopython - Python tools for computational molecular biology
SamTools Tutorial
Babraham Bioinformatics training course. Various courses on R, Python, Linux, Perl and Machine Learning.

Introduction to Linux and command lines for NGS data analysis - Part 2

San Emmanuel James

July 04, 2023