Explanation

This is an R Markdown document where I noted each bioinformatics step I took to process the BgBAR illumina sequence reads to it’s reference genome.

Question I am adressing:

Are there new or previously known genomic region that control resistance in the BgBAR snail?

Pipeline Begins

  1. DNA extraction and sample preparation
  1. S.mansoni DNA filtering from Bg snail DNA.

Note: - CIGAR stands for Compact Idiosyncratic Gapped Alignment Report. Its is a string stored in a SAM/BAM file that describes how each read aligns to the reference genome. It tells how many best matched and all.

S.mansoni DNA filtering from Bg snail DNA

  1. GENOME: directory path to the location of the S.mansoni reference genome
  2. BWA: Path to where the sample outputs will be saved
  3. DATA: Path to the folder where I hadto folder to where I was working
  4. DIRPATH: Path to where I ran all the codes. Current directory for the schistofilter
  5. ADAPTOR: path to where cutadapt is, we used v4.9
  6. ADAPTOR OPTIONS: “-a AAGATCGGAAGAGCACACGTCTGAACTCCAGTCACNNNNNNATCTCGTATGCCGTCTTCTGCTTG -A GATCGTCGGACTGTAGAACTCTGAACGTGTAGATCTCGGTGGTCGCCGTATCATT”
  7. FLTER: Path to where the filter script was
  8. SAMTOOLS: Path to where the samtools software is located, v 1.3 on cqls local
  9. BEDTOOLS: Path to where the bedtools software is located, v 2.31.1 on cqls local
  10. TRIMPATH: path to where trimmomatic is, reads trimmed in v0.39
  11. TRIMOPTIONS1: “PE -threads 4 -phred33 -trimlog trim.log -summary summary.txt”
  12. TRIMOPTIONS2: “LEADING:20 TRAILING:20 SLIDINGWINDOW:5:20 MINLEN:50”
  13. BWAINDEX: Path to the bwa-v0.7.17/bwa index on cluster
  14. {sample}: represents the whole 18 reads

## Ran a couple of rules in a snakefile which has a config file.

# 1. run command for adaptor to remove adaptors using cutadapt, adaptors are short artifacts sequence added to DNA fragments before sequencing.
SGE_Batch -c "snakemake adaptor_list.txt --core 4" -P 4 -q beagle -r adaptor_Output

 input:
        DATA + "{sample}_R1_001.fastq",
        DATA + "{sample}_R2_001.fastq"
    output:
        DIRPATH + "noA_{sample}_R1.fastq",
        DIRPATH + "noA_{sample}_R2.fastq"
    params:
        a=ADAPTOR,
        b=ADAPTOROPTIONS,
        o=DIRPATH + "noA_{sample}_R1.fastq",
        p=DIRPATH + "noA_{sample}_R2.fastq",
        c=DATA + "{sample}_R1_001.fastq",
        d=DATA + "{sample}_R2_001.fastq",
    shell:
        "{params.a} {params.b} -o {params.o} -p {params.p} {params.c} {params.d}"


# 2. run command for trimming the noAdaptor reads (Reads without adaptors)
SGE_Batch -c "snakemake trimmed_list.txt --cores 4" -P 4 -q beagle -r trimmed_Output1

input:
        DIRPATH + "noA_{sample}_R1.fastq",
        DIRPATH + "noA_{sample}_R2.fastq"
    output:
        DIRPATH + "Trimmed_{sample}_R1.fastq",
        DIRPATH + "Trimmed_{sample}_unpaired_R1.fastq",
        DIRPATH + "Trimmed_{sample}_R2.fastq",
        DIRPATH + "Trimmed_{sample}_unpaired_R2.fastq"
    params:
        t=TRIMPATH,
        o=TRIMOPTIONS1,
        a=DIRPATH + "noA_{sample}_R1.fastq",
        b=DIRPATH + "noA_{sample}_R2.fastq",
        c=DIRPATH + "Trimmed_{sample}_R1.fastq",
        d=DIRPATH + "Trimmed_{sample}_unpaired_R1.fastq",
        e=DIRPATH + "Trimmed_{sample}_R2.fastq",
        f=DIRPATH + "Trimmed_{sample}_unpaired_R2.fastq",
        p=TRIMOPTIONS2
    shell:
        " {params.t} {params.o} {params.a} {params.b} {params.c} {params.d} "
            "{params.e} {params.f} {params.p}"
            
#(Unpaired R1 and R2 exist in the output file because filtering removes reads independently, breaking some original pairs—but the remaining reads still contain valid biological information and should be kept.)


# 3. run this command for alignment
SGE_Batch -c "snakemake align_sam_list.txt --core 4" -P 4 -q beagle -r align_sam_Output

# Alignment of short DNA reads [The BgBAR reads "(R1.fastq, R2.fastq)] + reference genome dependencies" which outputs "{sample}.sam file"
   input:
        GENOME,
        GENOME + ".amb",
        GENOME + ".ann",
        GENOME + ".bwt",
        GENOME + ".pac",
        GENOME + ".sa",
        DIRPATH + "Trimmed_{sample}_R1.fastq", 
        DIRPATH + "Trimmed_{sample}_R2.fastq"
    output:
        BWA + "{sample}.sam" 
    params:
        b=BWAPATH,
        o=BWAOPTIONS,
        g=GENOME,
        d=DIRPATH + "Trimmed_{sample}_R1.fastq",
        e=DIRPATH + "Trimmed_{sample}_R2.fastq",
        s=BWA + "{sample}.sam"
    shell:
        "{params.b} "
            "{params.o} "
            "{params.g} "
            "{params.d} "
            "{params.e} > "
            "{params.s}"
            
# SAM is an output file which means Sequence Alignment/Map. A human readable, tab-limited file used to store DNA sequencing read alignments against a reference genome.
            

# 4. This filters the .sam file to schistofiltered.bam file. We removed secondary and supplementary reads using samtools v1.3 
SGE_Batch -c "snakemake schistofiltered_bam_list.txt --core 4" -P 4 -q beagle -r schistofiltered_bam_Output

# FILTER {sample}.sam {sample}_filtered.sam
    input:
        BWA + "{sample}.sam"
    output:
        BWA + "{sample}_schistofiltered.bam"
    params:
        f=FILTER,
        i=BWA + "{sample}.sam",
        o=BWA + "{sample}_schistofiltered.bam",
        s=SAMTOOLS
    shell:
        "{params.s} view -h {params.i} | "
            "perl {params.f} -m 70 | "
            "{params.s} view -f 1 -h -F 2304 -bS -o " #options to retain only Bg snail reads
            "{params.o}"
          
#BAM is an output file which means Binary Alignment Map, its a compressed SAM file, and it's machine-readable


# 5. We converted the schistofiltered.bam files to fastq files (read 1 and 2 because its paired-end reads) using bedtools.

# First convert to .qsortedbam files because it is essential that the BAM file is sorted by query name rather than by chromosomal position, ensuring that mate pairs are adjacent in the BAM file, which allows bedtools to correctly pair them into separate R1 and R2.
 
# Then qsortedbam was converted to R1.fastq and R2.fastq because its paired-end reads as a diploid organism.

#I made a shell script to convert bam to fastq file. And that was incorporated into the snakefile, located here:
/nfs6/IB/Blouin_Lab/users/seun/data/biomphalaria/BgBAR_GWAS/Schisto_filter/results/bam2fastq.sh

# ran this code:
SGE_Batch -c "snakemake bamtofastq_list.txt --core 4" -P 4 -q beagle -r bamtofastq_output
 
 rule sorted_schistofiltered_bam:
# 5a. sort the bam files using samtools sort -n -o input_qsort.bam input.bam
    input:
        BWA + "{sample}_schistofiltered.bam"
    output:
        BWA + "{sample}_schistofiltered.qsort.bam"
    params:
        s=SAMTOOLS
    shell:
        "{params.s} sort -n -o {output} {input}"

# 5b. convert to fastq with bedtools bamtofastq -i input_qsort.bam -fq _R1.fq -fq2 _R2.fq
    input:
        bam=BWA + "{sample}_schistofiltered.qsort.bam"
    output:
        r1=BWA + "{sample}_R1.fastq",
        r2=BWA + "{sample}_R2.fastq"
    params:
        b=BEDTOOLS
    shell:
        "{params.b} bamtofastq -i {input.bam} -fq {output.r1} -fq2 {output.r2}"
        
NOW WE HAVE A CLEAN BgBAR SNAIL DNA SEQUENCES

Pooled-Sequencing GWAS

  1. Alignment of Filtered BgBAR DNA Sequence from previous step to BS90 reference genome
# 1. set up dependencies of the reference genome for alignment, which is the bwa index of the genome
rule index:
    """Setup bwa index of genome"""
    input:
        GENOME
    output:
        GENOME + ".amb",
        GENOME + ".ann",
        GENOME + ".bwt",
        GENOME + ".pac",
        GENOME + ".sa"
    params:
        i=BWAINDEX,
        g=GENOME
    shell:
        "{params.i} "
            "{params.g}"
            
            
# 2.  Align to reference genome and outputs sam files
rule align_sam: # assuming Illumina data already filtered in previous alignment
    # BWAPATH BWAOPTIONS GENOME DATAPATH/Trimmed_{sample}_R1_schistofiltered.fastq DATAPATH/Trimmed_{sample}_R2_schistofiltered.fastq > {sample}.sam
    # -P 4
    input:
        GENOME,
        GENOME + ".amb",
        GENOME + ".ann",
        GENOME + ".bwt",
        GENOME + ".pac",
        GENOME + ".sa",
        TRIM + "{sample}_R1.fastq",
        TRIM + "{sample}_R2.fastq"
    output:
        BWA + "{sample}.sam"
    params:
        b=BWAPATH,
        o=BWAOPTIONS,
        g=GENOME,
        d=TRIM + "{sample}_R1.fastq",
        e=TRIM + "{sample}_R2.fastq",
        s=BWA + "{sample}.sam"
    shell:
        "{params.b} "
            "{params.o} "
            "{params.g} "
            "{params.d} "
            "{params.e} > "
            "{params.s}"
            
            
# 3. """Filter out secondary alignments with perl"""            
rule filter:
    # FILTER {sample}.sam {sample}_filtered.sam
    input:
        BWA + "{sample}.sam"
    output:
        BWA + "{sample}_filtered.sam"
    params:
        f=FILTER,
        i=BWA + "{sample}.sam",
        o=BWA + "{sample}_filtered.sam"
    shell:
        "{params.f} "
            "{params.i} "
            "{params.o}"
            
            
# 4. """Filtered sam to bam"""
rule samtobam_filter:
    # SAMTOOLS view -bT GENOME {sample}_filtered.sam > {sample}_filtered.bam
    input:
        BWA + "{sample}_filtered.sam"
    output:
        BWA + "{sample}_filtered.bam"
    params:
        t=SAMTOOLS,
        g=GENOME,
        i=BWA + "{sample}_filtered.sam",
        o=BWA + "{sample}_filtered.bam"
    shell:
        "{params.t} view -bT "
        "{params.g} "
            "{params.i} > "
            "{params.o} "
            
            
 # 5. """Sort filtered bam files...""" needed for pileup and downstream analysis because they require coordinate-sorted data.
rule sort_filter:
    # SAMTOOLS sort -o {sample}_filtered_sort.bam -T {sample}_filtered_sort {sample}_filtered.bam
    input:
        BWA + "{sample}_filtered.bam"
    output:
        BWA + "{sample}_filtered_sort.bam"
    params:
        s=SAMTOOLS,
        o=BWA + "{sample}_filtered_sort.bam",
        t=BWA + "{sample}_filtered_sort",
        i=BWA + "{sample}_filtered.bam"
    shell:
        "{params.s} sort -o "
            "{params.o} -T "
            "{params.t} "
            "{params.i}" 
            
            
# 6. Optional but recommended: index the sorted bam files, which allows for fast random access to the data, which is essential for efficient retrieval of alignments overlapping specific genomic regions during pileup and variant calling.

rule samtools_index:
    input:
        BWA + "{sample}_filtered_sort.bam"
    output:
        BWA + "{sample}_filtered_sort.bam.bai"
    params:
        s=SAMTOOLS
    shell:
        "{params.s} index {input}"
        

# 7. Create mpileup file generated by samtools mpileup. it transposes a bam file from reads oriented to genome-coordinate oriented rows, with information about all reads bp underlying that position.

rule pileup:
    # SAMTOOLS mpileup -t DP -A -o {sample}.pileup.txt -f GENOME BAMLIST
    input:
        bam=expand(BWA + "{sample}_filtered_sort.bam", sample=SAMPLES),
    output:
        "Bu_BS90_2022_masked_GWAS.pileup.txt"
    params:
        s=SAMTOOLS,
        o="Bu_BS90_2022_masked_GWAS.pileup.txt",
        g=GENOME,
    shell:
        "{params.s} mpileup -t DP -A -o "
            "{params.o} -f "
            "{params.g} "
            "{input.bam}" 


# 8. Make frequency table from the pileup file, which is used for Fst calculation. It calculates allele frequencies at each position in the pileup file, which is essential for downstream population genetic analyses like Fst calculation.

rule freq_table2:
    # perl FREQTABLE -g {sample}.pileup.txt FREQTABLEOPTIONS2  -n {sample_list} -o Freqs_{sample}.pileup.txt
    input:
        "Bu_BS90_2022_masked_GWAS.pileup.txt",
    output:
        "Freqs_Bu_BS90_2022_masked_GWAS.pileup.txt"
    params:
        f=FREQTABLE,
        p="Bu_BS90_2022_masked_GWAS.pileup.txt",
        q=FREQTABLEOPTIONS2,
        s=expand("{sample}", sample=SAMPLES),
        o="Freqs_Bu_BS90_2022_masked_GWAS.pileup.txt"
    shell:
        "perl {params.f} -g "
            "{params.p} "
            "{params.q} {params.s}"
            " -o {params.o}"
            
NOW WE HAVE THE FREQUENCY TABLE

Making of the FST 10kb by 5kb sliding window


# 1. Make the Mean Fst_window table

rule fst_window:
    # perl FstFromFreqTableWindow.pl -f /path/to/Freqs_GWAS_data.pileup.txt -a 4,12 -b 2,6
    # R = "4,12,14,16,18,20,22,30,34,36"
    # S = "2,6,8,10,24,26,28,32"
    input:
        "noheader_Freqs_Bu_BS90_2022_masked_GWAS.pileup.txt"
    params:
        p=FSTW,
        f="noheader_Freqs_Bu_BS90_2022_masked_GWAS.pileup.txt",
        a=R,
        b=S,
    output:
        "Mean_Fsts_Windows_10000_noheader_Freqs_Bu_BS90_2022_masked_GWAS.pileup.txt"
    shell:
        "perl {params.p} -f "
            "{params.f} -a {params.a} -b {params.b}"

Merging of mean Fst window to refrence genome chromosome

cat Bu_B_Chrs_Mean_Fsts_Windows_10000_noheader_Freqs_Bu_BS90_2022_masked_GWAS.pileup.txt | sort -k8,8n -k 6,6d > Bu_B_Chrs_Mean_Fsts_Windows_10000_noheader_Freqs_Bu_BS90_2022_masked_GWAS.pileup.txt &

#k8 means column with Chromosome number
#k6 means column with Chrom with Chr1a (scaffold labeling)

# This final file was downloaded and used for R analysis

Running randomized samples

Making Individual fst values for each SNP named Fst and Fst all.


#Fst no window (Individual Fst with 0.2 threshold)
rule fst:
    input:
        "noheader_Freqs_Bu_BS90_2022_masked_GWAS.pileup.txt"
    params:
        p=FST,
        f="noheader_Freqs_Bu_BS90_2022_masked_GWAS.pileup.txt",
        a=R,
        b=S,
    output:
        "High_Fsts_noheader_Freqs_Bu_BS90_2022_masked_GWAS.pileup.txt"
    shell:
        "perl {params.p} -f "
            "{params.f} -a {params.a} -b {params.b} "
            "-d 5 -t 0.2 -n 5"


#Fst_all no window (Individual Fst, no threshold for Fst reporting all SNPs)
rule fst_all:
    input:
        "noheader_Freqs_Bu_BS90_2022_masked_GWAS.pileup.txt"
    params:
        p=FSTA,
        f="noheader_Freqs_Bu_BS90_2022_masked_GWAS.pileup.txt",
        a=R,
        b=S,
    output:
        "All_Fsts_noheader_Freqs_Bu_BS90_2022_masked_GWAS.pileup.txt"
    shell:
        "perl {params.p} -f "
            "{params.f} -a {params.a} -b {params.b} "
            "-d 5 -n 5" 
 #This was very useful for the identification of the SNPs peak after finding out the window with the highest peak from our Plot on R           
            

Identifying the window or coordinate with the highest Fst value

  1. To get this done, we first used the (Bu_B_Chrs_Mean_Fsts_Windows_10000_noheader_Freqs_Bu_BS90_2022_masked_GWAS.pileup.txt.sorted), that was merged with fst window output to figure out the high fsts (in fst window) so that I can get the window which i would be looking into (in the fst no window output)

  2. print out the lines of SNPs in the region encompassing my peak.

# 1. Firstly: use this file to access the highest snps with the highest fst value to locate the window
Bu_B_Chrs_Mean_Fsts_Windows_10000_noheader_Freqs_Bu_BS90_2022_masked_GWAS.pileup.txt.sorted.txt

#running this code:
sort -k3,3nr -k8,8d Bu_B_Chrs_Mean_Fsts_Windows_10000_noheader_Freqs_Bu_BS90_2022_masked_GWAS_RS.pileup.txt.sorted.txt | awk '{if ($3>0.05) {print $0}}' | less 

#where k3 and k8 represents columns 3 and 8 with the header title fst value and chromosome number

#output looks like this: 
Output:
JAKZJK010000037.1       8760000 0.185518663660096       9       0.332774614545347       Chr16b  25      16

#This output tells us the contig, coordinate/window and chromosome to look into


# 2. Secondly: Find the highest fst within contig 37 (in the fst window file), and the window is 8760000 and save file as:
peak_37.1_Chr16_0.05thresh_Bu_B_Chrs_Mean_Fsts_Windows_10000_noheader_Freqs_Bu_BS90_2022_masked_GWAS_RS.pileup.txt.sorted.txt

#running this code and saved it:
sort -k3,3nr -k8,8d Bu_B_Chrs_Mean_Fsts_Windows_10000_noheader_Freqs_Bu_BS90_2022_masked_GWAS_RS.pileup.txt.sorted.txt | grep 'JAKZJK010000037.1' | awk '{if ($3>0.05) {print $0}}' | less >  peak_37.1_Chr16_0.05thresh_Bu_B_Chrs_Mean_Fsts_Windows_10000_noheader_Freqs_Bu_BS90_2022_masked_GWAS_RS.pileup.txt.sorted.txt


# 3. Thirdly: I checked the same window 8760000 in the All Fst file, since that's the one with all the snps. 

# running this code and saved it:
less All_Fsts_noheader_Freqs_Bu_BS90_2022_masked_GWAS.pileup.txt | grep "JAKZJK010000037.1" > peak_37.1_All_Fsts_noheader_Freqs_Bu_BS90_2022_masked_GWAS.pileup.txt

# 4. I went into the contig 37 All_Fsts file just created, and grep out the snps within window 8720000 and 8820000 because that's the window with the highest fsts.

# running this code and saved it: 
awk '$2 >= 8720000 && $2 <= 8820000' peak_37.1_All_Fsts_noheader_Freqs_Bu_BS90_2022_masked_GWAS.pileup.txt | less > peak_37.1_8760000_All_Fsts_noheader_Freqs_Bu_BS90_2022_masked_GWAS.pileup.txt

# 5. Then open the output file on excel for easier sorting and visualization. And then use primer web3 to design primers.

Design Primers using primer web3