Explanation
Question I am adressing:
Pipeline Begins
S.mansoni DNA filtering from Bg snail DNA
Pooled-Sequencing GWAS
Making of the FST 10kb by 5kb sliding window
Merging of mean Fst window to refrence genome chromosome
Running randomized samples
Making Individual fst values for each SNP named Fst and Fst all.
Identifying the window or coordinate with the highest Fst value
Design Primers using primer web3

Explanation

This is an R Markdown document where I noted each bioinformatics step I took to process the BgBAR illumina sequence reads to it’s reference genome.

Question I am adressing:

Are there new or previously known genomic region that control resistance in the BgBAR snail?

Pipeline Begins

DNA extraction and sample preparation

We did the DNA extraction of 324 snails which includes 144 infected snails and 180 uninfected snails as case and control groups. DNA was extracted from the head foot tissue by CTAB extraction and was quantified by using Qubit fluorometric assay (Thermo Fisher).
We pooled DNA from 18 snails in equal DNA concentration to create each independent subpools.
This made us create 8 subpools of infected and 10 subpools of uninfected snails.
Each subpool was independently barcoded, and the 18 libraries were combined in equimolar concentrations and sent for illumina sequencing at the OSU center for quantitative life science (https://cqls.oregonstate.edu/)

S.mansoni DNA filtering from Bg snail DNA.

We removed the S.mansoni reads from the DNA libraries of BgBAR snails, by aligning our data to the S.mansoni V9 genome (GCA_000237925.5) using bwa v0.7.17 mem
Removing reads that mapped with CIGAR value > 70 using a custom Perl script
Converted the filtered reads to FASTQ files using BEDtools (version 2.31.1)
We then aligned initially to BS90 reference genome by Bu et al., 2022.

Note: - CIGAR stands for Compact Idiosyncratic Gapped Alignment Report. Its is a string stored in a SAM/BAM file that describes how each read aligns to the reference genome. It tells how many best matched and all.

CIGAR > 70 means we filtered out reads where a large portion of read was clipped or poorly aligned, it should be kept as snail DNA. i.e if (more than half of the read) does not align well with S.mansoni reference genome, then it should be kept as snail DNA.

S.mansoni DNA filtering from Bg snail DNA

Things to watch out for:

GENOME: directory path to the location of the S.mansoni reference genome
BWA: Path to where the sample outputs will be saved
DATA: Path to the folder where I hadto folder to where I was working
DIRPATH: Path to where I ran all the codes. Current directory for the schistofilter
ADAPTOR: path to where cutadapt is, we used v4.9
ADAPTOR OPTIONS: “-a AAGATCGGAAGAGCACACGTCTGAACTCCAGTCACNNNNNNATCTCGTATGCCGTCTTCTGCTTG -A GATCGTCGGACTGTAGAACTCTGAACGTGTAGATCTCGGTGGTCGCCGTATCATT”
FLTER: Path to where the filter script was
SAMTOOLS: Path to where the samtools software is located, v 1.3 on cqls local
BEDTOOLS: Path to where the bedtools software is located, v 2.31.1 on cqls local
TRIMPATH: path to where trimmomatic is, reads trimmed in v0.39
TRIMOPTIONS1: “PE -threads 4 -phred33 -trimlog trim.log -summary summary.txt”
TRIMOPTIONS2: “LEADING:20 TRAILING:20 SLIDINGWINDOW:5:20 MINLEN:50”
BWAINDEX: Path to the bwa-v0.7.17/bwa index on cluster
{sample}: represents the whole 18 reads


## Ran a couple of rules in a snakefile which has a config file.

# 1. run command for adaptor to remove adaptors using cutadapt, adaptors are short artifacts sequence added to DNA fragments before sequencing.
SGE_Batch -c "snakemake adaptor_list.txt --core 4" -P 4 -q beagle -r adaptor_Output

 input:
        DATA + "{sample}_R1_001.fastq",
        DATA + "{sample}_R2_001.fastq"
    output:
        DIRPATH + "noA_{sample}_R1.fastq",
        DIRPATH + "noA_{sample}_R2.fastq"
    params:
        a=ADAPTOR,
        b=ADAPTOROPTIONS,
        o=DIRPATH + "noA_{sample}_R1.fastq",
        p=DIRPATH + "noA_{sample}_R2.fastq",
        c=DATA + "{sample}_R1_001.fastq",
        d=DATA + "{sample}_R2_001.fastq",
    shell:
        "{params.a} {params.b} -o {params.o} -p {params.p} {params.c} {params.d}"


# 2. run command for trimming the noAdaptor reads (Reads without adaptors)
SGE_Batch -c "snakemake trimmed_list.txt --cores 4" -P 4 -q beagle -r trimmed_Output1

input:
        DIRPATH + "noA_{sample}_R1.fastq",
        DIRPATH + "noA_{sample}_R2.fastq"
    output:
        DIRPATH + "Trimmed_{sample}_R1.fastq",
        DIRPATH + "Trimmed_{sample}_unpaired_R1.fastq",
        DIRPATH + "Trimmed_{sample}_R2.fastq",
        DIRPATH + "Trimmed_{sample}_unpaired_R2.fastq"
    params:
        t=TRIMPATH,
        o=TRIMOPTIONS1,
        a=DIRPATH + "noA_{sample}_R1.fastq",
        b=DIRPATH + "noA_{sample}_R2.fastq",
        c=DIRPATH + "Trimmed_{sample}_R1.fastq",
        d=DIRPATH + "Trimmed_{sample}_unpaired_R1.fastq",
        e=DIRPATH + "Trimmed_{sample}_R2.fastq",
        f=DIRPATH + "Trimmed_{sample}_unpaired_R2.fastq",
        p=TRIMOPTIONS2
    shell:
        " {params.t} {params.o} {params.a} {params.b} {params.c} {params.d} "
            "{params.e} {params.f} {params.p}"
            
#(Unpaired R1 and R2 exist in the output file because filtering removes reads independently, breaking some original pairs—but the remaining reads still contain valid biological information and should be kept.)


# 3. run this command for alignment
SGE_Batch -c "snakemake align_sam_list.txt --core 4" -P 4 -q beagle -r align_sam_Output

# Alignment of short DNA reads [The BgBAR reads "(R1.fastq, R2.fastq)] + reference genome dependencies" which outputs "{sample}.sam file"
   input:
        GENOME,
        GENOME + ".amb",
        GENOME + ".ann",
        GENOME + ".bwt",
        GENOME + ".pac",
        GENOME + ".sa",
        DIRPATH + "Trimmed_{sample}_R1.fastq", 
        DIRPATH + "Trimmed_{sample}_R2.fastq"
    output:
        BWA + "{sample}.sam" 
    params:
        b=BWAPATH,
        o=BWAOPTIONS,
        g=GENOME,
        d=DIRPATH + "Trimmed_{sample}_R1.fastq",
        e=DIRPATH + "Trimmed_{sample}_R2.fastq",
        s=BWA + "{sample}.sam"
    shell:
        "{params.b} "
            "{params.o} "
            "{params.g} "
            "{params.d} "
            "{params.e} > "
            "{params.s}"
            
# SAM is an output file which means Sequence Alignment/Map. A human readable, tab-limited file used to store DNA sequencing read alignments against a reference genome.
            

# 4. This filters the .sam file to schistofiltered.bam file. We removed secondary and supplementary reads using samtools v1.3 
SGE_Batch -c "snakemake schistofiltered_bam_list.txt --core 4" -P 4 -q beagle -r schistofiltered_bam_Output

# FILTER {sample}.sam {sample}_filtered.sam
    input:
        BWA + "{sample}.sam"
    output:
        BWA + "{sample}_schistofiltered.bam"
    params:
        f=FILTER,
        i=BWA + "{sample}.sam",
        o=BWA + "{sample}_schistofiltered.bam",
        s=SAMTOOLS
    shell:
        "{params.s} view -h {params.i} | "
            "perl {params.f} -m 70 | "
            "{params.s} view -f 1 -h -F 2304 -bS -o " #options to retain only Bg snail reads
            "{params.o}"
          
#BAM is an output file which means Binary Alignment Map, its a compressed SAM file, and it's machine-readable


# 5. We converted the schistofiltered.bam files to fastq files (read 1 and 2 because its paired-end reads) using bedtools.

# First convert to .qsortedbam files because it is essential that the BAM file is sorted by query name rather than by chromosomal position, ensuring that mate pairs are adjacent in the BAM file, which allows bedtools to correctly pair them into separate R1 and R2.
 
# Then qsortedbam was converted to R1.fastq and R2.fastq because its paired-end reads as a diploid organism.

#I made a shell script to convert bam to fastq file. And that was incorporated into the snakefile, located here:
/nfs6/IB/Blouin_Lab/users/seun/data/biomphalaria/BgBAR_GWAS/Schisto_filter/results/bam2fastq.sh

# ran this code:
SGE_Batch -c "snakemake bamtofastq_list.txt --core 4" -P 4 -q beagle -r bamtofastq_output
 
 rule sorted_schistofiltered_bam:
# 5a. sort the bam files using samtools sort -n -o input_qsort.bam input.bam
    input:
        BWA + "{sample}_schistofiltered.bam"
    output:
        BWA + "{sample}_schistofiltered.qsort.bam"
    params:
        s=SAMTOOLS
    shell:
        "{params.s} sort -n -o {output} {input}"

# 5b. convert to fastq with bedtools bamtofastq -i input_qsort.bam -fq _R1.fq -fq2 _R2.fq
    input:
        bam=BWA + "{sample}_schistofiltered.qsort.bam"
    output:
        r1=BWA + "{sample}_R1.fastq",
        r2=BWA + "{sample}_R2.fastq"
    params:
        b=BEDTOOLS
    shell:
        "{params.b} bamtofastq -i {input.bam} -fq {output.r1} -fq2 {output.r2}"
        
NOW WE HAVE A CLEAN BgBAR SNAIL DNA SEQUENCES

Pooled-Sequencing GWAS

Alignment of Filtered BgBAR DNA Sequence from previous step to BS90 reference genome

input file were R1.fastq and R2.fastq files
We then generated pileup files using Samtools version 1.3 [Li et al., 2009], and variants for Fst analysis were filtered for a minimum mean allele frequency of 0.05 and a minimum mean depth of 5.
Overlapping windows of 10 kb by 5kb with at least 150 SNPs in each window were used for visualization of Fst and heterozygosity (expected heterozygosity per segregating site, calculated using all the samples together).

# 1. set up dependencies of the reference genome for alignment, which is the bwa index of the genome
rule index:
    """Setup bwa index of genome"""
    input:
        GENOME
    output:
        GENOME + ".amb",
        GENOME + ".ann",
        GENOME + ".bwt",
        GENOME + ".pac",
        GENOME + ".sa"
    params:
        i=BWAINDEX,
        g=GENOME
    shell:
        "{params.i} "
            "{params.g}"
            
            
# 2.  Align to reference genome and outputs sam files
rule align_sam: # assuming Illumina data already filtered in previous alignment
    # BWAPATH BWAOPTIONS GENOME DATAPATH/Trimmed_{sample}_R1_schistofiltered.fastq DATAPATH/Trimmed_{sample}_R2_schistofiltered.fastq > {sample}.sam
    # -P 4
    input:
        GENOME,
        GENOME + ".amb",
        GENOME + ".ann",
        GENOME + ".bwt",
        GENOME + ".pac",
        GENOME + ".sa",
        TRIM + "{sample}_R1.fastq",
        TRIM + "{sample}_R2.fastq"
    output:
        BWA + "{sample}.sam"
    params:
        b=BWAPATH,
        o=BWAOPTIONS,
        g=GENOME,
        d=TRIM + "{sample}_R1.fastq",
        e=TRIM + "{sample}_R2.fastq",
        s=BWA + "{sample}.sam"
    shell:
        "{params.b} "
            "{params.o} "
            "{params.g} "
            "{params.d} "
            "{params.e} > "
            "{params.s}"
            
            
# 3. """Filter out secondary alignments with perl"""            
rule filter:
    # FILTER {sample}.sam {sample}_filtered.sam
    input:
        BWA + "{sample}.sam"
    output:
        BWA + "{sample}_filtered.sam"
    params:
        f=FILTER,
        i=BWA + "{sample}.sam",
        o=BWA + "{sample}_filtered.sam"
    shell:
        "{params.f} "
            "{params.i} "
            "{params.o}"
            
            
# 4. """Filtered sam to bam"""
rule samtobam_filter:
    # SAMTOOLS view -bT GENOME {sample}_filtered.sam > {sample}_filtered.bam
    input:
        BWA + "{sample}_filtered.sam"
    output:
        BWA + "{sample}_filtered.bam"
    params:
        t=SAMTOOLS,
        g=GENOME,
        i=BWA + "{sample}_filtered.sam",
        o=BWA + "{sample}_filtered.bam"
    shell:
        "{params.t} view -bT "
        "{params.g} "
            "{params.i} > "
            "{params.o} "
            
            
 # 5. """Sort filtered bam files...""" needed for pileup and downstream analysis because they require coordinate-sorted data.
rule sort_filter:
    # SAMTOOLS sort -o {sample}_filtered_sort.bam -T {sample}_filtered_sort {sample}_filtered.bam
    input:
        BWA + "{sample}_filtered.bam"
    output:
        BWA + "{sample}_filtered_sort.bam"
    params:
        s=SAMTOOLS,
        o=BWA + "{sample}_filtered_sort.bam",
        t=BWA + "{sample}_filtered_sort",
        i=BWA + "{sample}_filtered.bam"
    shell:
        "{params.s} sort -o "
            "{params.o} -T "
            "{params.t} "
            "{params.i}" 
            
            
# 6. Optional but recommended: index the sorted bam files, which allows for fast random access to the data, which is essential for efficient retrieval of alignments overlapping specific genomic regions during pileup and variant calling.

rule samtools_index:
    input:
        BWA + "{sample}_filtered_sort.bam"
    output:
        BWA + "{sample}_filtered_sort.bam.bai"
    params:
        s=SAMTOOLS
    shell:
        "{params.s} index {input}"
        

# 7. Create mpileup file generated by samtools mpileup. it transposes a bam file from reads oriented to genome-coordinate oriented rows, with information about all reads bp underlying that position.

rule pileup:
    # SAMTOOLS mpileup -t DP -A -o {sample}.pileup.txt -f GENOME BAMLIST
    input:
        bam=expand(BWA + "{sample}_filtered_sort.bam", sample=SAMPLES),
    output:
        "Bu_BS90_2022_masked_GWAS.pileup.txt"
    params:
        s=SAMTOOLS,
        o="Bu_BS90_2022_masked_GWAS.pileup.txt",
        g=GENOME,
    shell:
        "{params.s} mpileup -t DP -A -o "
            "{params.o} -f "
            "{params.g} "
            "{input.bam}" 


# 8. Make frequency table from the pileup file, which is used for Fst calculation. It calculates allele frequencies at each position in the pileup file, which is essential for downstream population genetic analyses like Fst calculation.

rule freq_table2:
    # perl FREQTABLE -g {sample}.pileup.txt FREQTABLEOPTIONS2  -n {sample_list} -o Freqs_{sample}.pileup.txt
    input:
        "Bu_BS90_2022_masked_GWAS.pileup.txt",
    output:
        "Freqs_Bu_BS90_2022_masked_GWAS.pileup.txt"
    params:
        f=FREQTABLE,
        p="Bu_BS90_2022_masked_GWAS.pileup.txt",
        q=FREQTABLEOPTIONS2,
        s=expand("{sample}", sample=SAMPLES),
        o="Freqs_Bu_BS90_2022_masked_GWAS.pileup.txt"
    shell:
        "perl {params.f} -g "
            "{params.p} "
            "{params.q} {params.s}"
            " -o {params.o}"
            
NOW WE HAVE THE FREQUENCY TABLE

Making of the FST 10kb by 5kb sliding window

The FST window compares the pools of 10 subpools of uninfected snail with 8 subpools of infected snails, like a pair-wise calculation, to provide the mean fst window, 10Kb by 5Kb, meaning it starts from 0-10,000bps, then 5,000 to 15,000bps, then 10,000bps to 20,000bps and 15kb to 25kb and so on.This way, there is overlap across the genome, so that we don’t miss any narrow peaks.
Converted the Freq_Bu_BS90_2022_masked_GWAS.pileup.txt file to exclude the header


# 1. Make the Mean Fst_window table

rule fst_window:
    # perl FstFromFreqTableWindow.pl -f /path/to/Freqs_GWAS_data.pileup.txt -a 4,12 -b 2,6
    # R = "4,12,14,16,18,20,22,30,34,36"
    # S = "2,6,8,10,24,26,28,32"
    input:
        "noheader_Freqs_Bu_BS90_2022_masked_GWAS.pileup.txt"
    params:
        p=FSTW,
        f="noheader_Freqs_Bu_BS90_2022_masked_GWAS.pileup.txt",
        a=R,
        b=S,
    output:
        "Mean_Fsts_Windows_10000_noheader_Freqs_Bu_BS90_2022_masked_GWAS.pileup.txt"
    shell:
        "perl {params.p} -f "
            "{params.f} -a {params.a} -b {params.b}"

Merging of mean Fst window to refrence genome chromosome

After the mean FST window is complete, we merged it to Bu_BS90 chromosome (Ref genome chromosome) inorder to know which contig from the Mean_Fst file has the right chromosome and then we are able to grapth it (Making the Fst Plot on R).
Then we sorted the output file into chromsome order.

cat Bu_B_Chrs_Mean_Fsts_Windows_10000_noheader_Freqs_Bu_BS90_2022_masked_GWAS.pileup.txt | sort -k8,8n -k 6,6d > Bu_B_Chrs_Mean_Fsts_Windows_10000_noheader_Freqs_Bu_BS90_2022_masked_GWAS.pileup.txt &

#k8 means column with Chromosome number
#k6 means column with Chrom with Chr1a (scaffold labeling)

# This final file was downloaded and used for R analysis

Running randomized samples

We run randomized samples which is the (mix of infected and uninfected samples). Going back to the Mean_Fsts_Windows steps of the pipeline but using randomized mixed libraries.
The final files were also downloaded and used on R.

Making Individual fst values for each SNP named Fst and Fst all.

To create the signature SNPs at Chr 16, we needed to run Fst and Fst all without window.
Fst reports more significant SNPs, because we had an Fst threshold of 0.2 and above. While Fst all reported all SNPs.


#Fst no window (Individual Fst with 0.2 threshold)
rule fst:
    input:
        "noheader_Freqs_Bu_BS90_2022_masked_GWAS.pileup.txt"
    params:
        p=FST,
        f="noheader_Freqs_Bu_BS90_2022_masked_GWAS.pileup.txt",
        a=R,
        b=S,
    output:
        "High_Fsts_noheader_Freqs_Bu_BS90_2022_masked_GWAS.pileup.txt"
    shell:
        "perl {params.p} -f "
            "{params.f} -a {params.a} -b {params.b} "
            "-d 5 -t 0.2 -n 5"


#Fst_all no window (Individual Fst, no threshold for Fst reporting all SNPs)
rule fst_all:
    input:
        "noheader_Freqs_Bu_BS90_2022_masked_GWAS.pileup.txt"
    params:
        p=FSTA,
        f="noheader_Freqs_Bu_BS90_2022_masked_GWAS.pileup.txt",
        a=R,
        b=S,
    output:
        "All_Fsts_noheader_Freqs_Bu_BS90_2022_masked_GWAS.pileup.txt"
    shell:
        "perl {params.p} -f "
            "{params.f} -a {params.a} -b {params.b} "
            "-d 5 -n 5" 
 #This was very useful for the identification of the SNPs peak after finding out the window with the highest peak from our Plot on R

Identifying the window or coordinate with the highest Fst value

We grep for SNP lines in my coordinate of interest (which is within window 8,760,000)

To get this done, we first used the (Bu_B_Chrs_Mean_Fsts_Windows_10000_noheader_Freqs_Bu_BS90_2022_masked_GWAS.pileup.txt.sorted), that was merged with fst window output to figure out the high fsts (in fst window) so that I can get the window which i would be looking into (in the fst no window output)
print out the lines of SNPs in the region encompassing my peak.

# 1. Firstly: use this file to access the highest snps with the highest fst value to locate the window
Bu_B_Chrs_Mean_Fsts_Windows_10000_noheader_Freqs_Bu_BS90_2022_masked_GWAS.pileup.txt.sorted.txt

#running this code:
sort -k3,3nr -k8,8d Bu_B_Chrs_Mean_Fsts_Windows_10000_noheader_Freqs_Bu_BS90_2022_masked_GWAS_RS.pileup.txt.sorted.txt | awk '{if ($3>0.05) {print $0}}' | less 

#where k3 and k8 represents columns 3 and 8 with the header title fst value and chromosome number

#output looks like this: 
Output:
JAKZJK010000037.1       8760000 0.185518663660096       9       0.332774614545347       Chr16b  25      16

#This output tells us the contig, coordinate/window and chromosome to look into


# 2. Secondly: Find the highest fst within contig 37 (in the fst window file), and the window is 8760000 and save file as:
peak_37.1_Chr16_0.05thresh_Bu_B_Chrs_Mean_Fsts_Windows_10000_noheader_Freqs_Bu_BS90_2022_masked_GWAS_RS.pileup.txt.sorted.txt

#running this code and saved it:
sort -k3,3nr -k8,8d Bu_B_Chrs_Mean_Fsts_Windows_10000_noheader_Freqs_Bu_BS90_2022_masked_GWAS_RS.pileup.txt.sorted.txt | grep 'JAKZJK010000037.1' | awk '{if ($3>0.05) {print $0}}' | less >  peak_37.1_Chr16_0.05thresh_Bu_B_Chrs_Mean_Fsts_Windows_10000_noheader_Freqs_Bu_BS90_2022_masked_GWAS_RS.pileup.txt.sorted.txt


# 3. Thirdly: I checked the same window 8760000 in the All Fst file, since that's the one with all the snps. 

# running this code and saved it:
less All_Fsts_noheader_Freqs_Bu_BS90_2022_masked_GWAS.pileup.txt | grep "JAKZJK010000037.1" > peak_37.1_All_Fsts_noheader_Freqs_Bu_BS90_2022_masked_GWAS.pileup.txt

# 4. I went into the contig 37 All_Fsts file just created, and grep out the snps within window 8720000 and 8820000 because that's the window with the highest fsts.

# running this code and saved it: 
awk '$2 >= 8720000 && $2 <= 8820000' peak_37.1_All_Fsts_noheader_Freqs_Bu_BS90_2022_masked_GWAS.pileup.txt | less > peak_37.1_8760000_All_Fsts_noheader_Freqs_Bu_BS90_2022_masked_GWAS.pileup.txt

# 5. Then open the output file on excel for easier sorting and visualization. And then use primer web3 to design primers.

Design Primers using primer web3

Bioinformatics and data processing

Oluwaseun Adu

2025-02-11