This is an R Markdown document where I noted each bioinformatics step I took to process the BgBAR illumina sequence reads to it’s reference genome.
Are there new or previously known genomic region that control resistance in the BgBAR snail?
We did the DNA extraction of 324 snails which includes 144 infected snails and 180 uninfected snails as case and control groups. DNA was extracted from the head foot tissue by CTAB extraction and was quantified by using Qubit fluorometric assay (Thermo Fisher).
We pooled DNA from 18 snails in equal DNA concentration to create each independent subpools.
This made us create 8 subpools of infected and 10 subpools of uninfected snails.
Each subpool was independently barcoded, and the 18 libraries were combined in equimolar concentrations and sent for illumina sequencing at the OSU center for quantitative life science (https://cqls.oregonstate.edu/)
We removed the S.mansoni reads from the DNA libraries of BgBAR snails, by aligning our data to the S.mansoni V9 genome (GCA_000237925.5) using bwa v0.7.17 mem
Removing reads that mapped with CIGAR value > 70 using a custom Perl script
Converted the filtered reads to FASTQ files using BEDtools (version 2.31.1)
We then aligned initially to BS90 reference genome by Bu et al., 2022.
Note: - CIGAR stands for Compact Idiosyncratic Gapped Alignment Report. Its is a string stored in a SAM/BAM file that describes how each read aligns to the reference genome. It tells how many best matched and all.
## Ran a couple of rules in a snakefile which has a config file.
# 1. run command for adaptor to remove adaptors using cutadapt, adaptors are short artifacts sequence added to DNA fragments before sequencing.
SGE_Batch -c "snakemake adaptor_list.txt --core 4" -P 4 -q beagle -r adaptor_Output
input:
DATA + "{sample}_R1_001.fastq",
DATA + "{sample}_R2_001.fastq"
output:
DIRPATH + "noA_{sample}_R1.fastq",
DIRPATH + "noA_{sample}_R2.fastq"
params:
a=ADAPTOR,
b=ADAPTOROPTIONS,
o=DIRPATH + "noA_{sample}_R1.fastq",
p=DIRPATH + "noA_{sample}_R2.fastq",
c=DATA + "{sample}_R1_001.fastq",
d=DATA + "{sample}_R2_001.fastq",
shell:
"{params.a} {params.b} -o {params.o} -p {params.p} {params.c} {params.d}"
# 2. run command for trimming the noAdaptor reads (Reads without adaptors)
SGE_Batch -c "snakemake trimmed_list.txt --cores 4" -P 4 -q beagle -r trimmed_Output1
input:
DIRPATH + "noA_{sample}_R1.fastq",
DIRPATH + "noA_{sample}_R2.fastq"
output:
DIRPATH + "Trimmed_{sample}_R1.fastq",
DIRPATH + "Trimmed_{sample}_unpaired_R1.fastq",
DIRPATH + "Trimmed_{sample}_R2.fastq",
DIRPATH + "Trimmed_{sample}_unpaired_R2.fastq"
params:
t=TRIMPATH,
o=TRIMOPTIONS1,
a=DIRPATH + "noA_{sample}_R1.fastq",
b=DIRPATH + "noA_{sample}_R2.fastq",
c=DIRPATH + "Trimmed_{sample}_R1.fastq",
d=DIRPATH + "Trimmed_{sample}_unpaired_R1.fastq",
e=DIRPATH + "Trimmed_{sample}_R2.fastq",
f=DIRPATH + "Trimmed_{sample}_unpaired_R2.fastq",
p=TRIMOPTIONS2
shell:
" {params.t} {params.o} {params.a} {params.b} {params.c} {params.d} "
"{params.e} {params.f} {params.p}"
#(Unpaired R1 and R2 exist in the output file because filtering removes reads independently, breaking some original pairs—but the remaining reads still contain valid biological information and should be kept.)
# 3. run this command for alignment
SGE_Batch -c "snakemake align_sam_list.txt --core 4" -P 4 -q beagle -r align_sam_Output
# Alignment of short DNA reads [The BgBAR reads "(R1.fastq, R2.fastq)] + reference genome dependencies" which outputs "{sample}.sam file"
input:
GENOME,
GENOME + ".amb",
GENOME + ".ann",
GENOME + ".bwt",
GENOME + ".pac",
GENOME + ".sa",
DIRPATH + "Trimmed_{sample}_R1.fastq",
DIRPATH + "Trimmed_{sample}_R2.fastq"
output:
BWA + "{sample}.sam"
params:
b=BWAPATH,
o=BWAOPTIONS,
g=GENOME,
d=DIRPATH + "Trimmed_{sample}_R1.fastq",
e=DIRPATH + "Trimmed_{sample}_R2.fastq",
s=BWA + "{sample}.sam"
shell:
"{params.b} "
"{params.o} "
"{params.g} "
"{params.d} "
"{params.e} > "
"{params.s}"
# SAM is an output file which means Sequence Alignment/Map. A human readable, tab-limited file used to store DNA sequencing read alignments against a reference genome.
# 4. This filters the .sam file to schistofiltered.bam file. We removed secondary and supplementary reads using samtools v1.3
SGE_Batch -c "snakemake schistofiltered_bam_list.txt --core 4" -P 4 -q beagle -r schistofiltered_bam_Output
# FILTER {sample}.sam {sample}_filtered.sam
input:
BWA + "{sample}.sam"
output:
BWA + "{sample}_schistofiltered.bam"
params:
f=FILTER,
i=BWA + "{sample}.sam",
o=BWA + "{sample}_schistofiltered.bam",
s=SAMTOOLS
shell:
"{params.s} view -h {params.i} | "
"perl {params.f} -m 70 | "
"{params.s} view -f 1 -h -F 2304 -bS -o " #options to retain only Bg snail reads
"{params.o}"
#BAM is an output file which means Binary Alignment Map, its a compressed SAM file, and it's machine-readable
# 5. We converted the schistofiltered.bam files to fastq files (read 1 and 2 because its paired-end reads) using bedtools.
# First convert to .qsortedbam files because it is essential that the BAM file is sorted by query name rather than by chromosomal position, ensuring that mate pairs are adjacent in the BAM file, which allows bedtools to correctly pair them into separate R1 and R2.
# Then qsortedbam was converted to R1.fastq and R2.fastq because its paired-end reads as a diploid organism.
#I made a shell script to convert bam to fastq file. And that was incorporated into the snakefile, located here:
/nfs6/IB/Blouin_Lab/users/seun/data/biomphalaria/BgBAR_GWAS/Schisto_filter/results/bam2fastq.sh
# ran this code:
SGE_Batch -c "snakemake bamtofastq_list.txt --core 4" -P 4 -q beagle -r bamtofastq_output
rule sorted_schistofiltered_bam:
# 5a. sort the bam files using samtools sort -n -o input_qsort.bam input.bam
input:
BWA + "{sample}_schistofiltered.bam"
output:
BWA + "{sample}_schistofiltered.qsort.bam"
params:
s=SAMTOOLS
shell:
"{params.s} sort -n -o {output} {input}"
# 5b. convert to fastq with bedtools bamtofastq -i input_qsort.bam -fq _R1.fq -fq2 _R2.fq
input:
bam=BWA + "{sample}_schistofiltered.qsort.bam"
output:
r1=BWA + "{sample}_R1.fastq",
r2=BWA + "{sample}_R2.fastq"
params:
b=BEDTOOLS
shell:
"{params.b} bamtofastq -i {input.bam} -fq {output.r1} -fq2 {output.r2}"
NOW WE HAVE A CLEAN BgBAR SNAIL DNA SEQUENCES
input file were R1.fastq and R2.fastq files
We then generated pileup files using Samtools version 1.3 [Li et al., 2009], and variants for Fst analysis were filtered for a minimum mean allele frequency of 0.05 and a minimum mean depth of 5.
Overlapping windows of 10 kb by 5kb with at least 150 SNPs in each window were used for visualization of Fst and heterozygosity (expected heterozygosity per segregating site, calculated using all the samples together).
# 1. set up dependencies of the reference genome for alignment, which is the bwa index of the genome
rule index:
"""Setup bwa index of genome"""
input:
GENOME
output:
GENOME + ".amb",
GENOME + ".ann",
GENOME + ".bwt",
GENOME + ".pac",
GENOME + ".sa"
params:
i=BWAINDEX,
g=GENOME
shell:
"{params.i} "
"{params.g}"
# 2. Align to reference genome and outputs sam files
rule align_sam: # assuming Illumina data already filtered in previous alignment
# BWAPATH BWAOPTIONS GENOME DATAPATH/Trimmed_{sample}_R1_schistofiltered.fastq DATAPATH/Trimmed_{sample}_R2_schistofiltered.fastq > {sample}.sam
# -P 4
input:
GENOME,
GENOME + ".amb",
GENOME + ".ann",
GENOME + ".bwt",
GENOME + ".pac",
GENOME + ".sa",
TRIM + "{sample}_R1.fastq",
TRIM + "{sample}_R2.fastq"
output:
BWA + "{sample}.sam"
params:
b=BWAPATH,
o=BWAOPTIONS,
g=GENOME,
d=TRIM + "{sample}_R1.fastq",
e=TRIM + "{sample}_R2.fastq",
s=BWA + "{sample}.sam"
shell:
"{params.b} "
"{params.o} "
"{params.g} "
"{params.d} "
"{params.e} > "
"{params.s}"
# 3. """Filter out secondary alignments with perl"""
rule filter:
# FILTER {sample}.sam {sample}_filtered.sam
input:
BWA + "{sample}.sam"
output:
BWA + "{sample}_filtered.sam"
params:
f=FILTER,
i=BWA + "{sample}.sam",
o=BWA + "{sample}_filtered.sam"
shell:
"{params.f} "
"{params.i} "
"{params.o}"
# 4. """Filtered sam to bam"""
rule samtobam_filter:
# SAMTOOLS view -bT GENOME {sample}_filtered.sam > {sample}_filtered.bam
input:
BWA + "{sample}_filtered.sam"
output:
BWA + "{sample}_filtered.bam"
params:
t=SAMTOOLS,
g=GENOME,
i=BWA + "{sample}_filtered.sam",
o=BWA + "{sample}_filtered.bam"
shell:
"{params.t} view -bT "
"{params.g} "
"{params.i} > "
"{params.o} "
# 5. """Sort filtered bam files...""" needed for pileup and downstream analysis because they require coordinate-sorted data.
rule sort_filter:
# SAMTOOLS sort -o {sample}_filtered_sort.bam -T {sample}_filtered_sort {sample}_filtered.bam
input:
BWA + "{sample}_filtered.bam"
output:
BWA + "{sample}_filtered_sort.bam"
params:
s=SAMTOOLS,
o=BWA + "{sample}_filtered_sort.bam",
t=BWA + "{sample}_filtered_sort",
i=BWA + "{sample}_filtered.bam"
shell:
"{params.s} sort -o "
"{params.o} -T "
"{params.t} "
"{params.i}"
# 6. Optional but recommended: index the sorted bam files, which allows for fast random access to the data, which is essential for efficient retrieval of alignments overlapping specific genomic regions during pileup and variant calling.
rule samtools_index:
input:
BWA + "{sample}_filtered_sort.bam"
output:
BWA + "{sample}_filtered_sort.bam.bai"
params:
s=SAMTOOLS
shell:
"{params.s} index {input}"
# 7. Create mpileup file generated by samtools mpileup. it transposes a bam file from reads oriented to genome-coordinate oriented rows, with information about all reads bp underlying that position.
rule pileup:
# SAMTOOLS mpileup -t DP -A -o {sample}.pileup.txt -f GENOME BAMLIST
input:
bam=expand(BWA + "{sample}_filtered_sort.bam", sample=SAMPLES),
output:
"Bu_BS90_2022_masked_GWAS.pileup.txt"
params:
s=SAMTOOLS,
o="Bu_BS90_2022_masked_GWAS.pileup.txt",
g=GENOME,
shell:
"{params.s} mpileup -t DP -A -o "
"{params.o} -f "
"{params.g} "
"{input.bam}"
# 8. Make frequency table from the pileup file, which is used for Fst calculation. It calculates allele frequencies at each position in the pileup file, which is essential for downstream population genetic analyses like Fst calculation.
rule freq_table2:
# perl FREQTABLE -g {sample}.pileup.txt FREQTABLEOPTIONS2 -n {sample_list} -o Freqs_{sample}.pileup.txt
input:
"Bu_BS90_2022_masked_GWAS.pileup.txt",
output:
"Freqs_Bu_BS90_2022_masked_GWAS.pileup.txt"
params:
f=FREQTABLE,
p="Bu_BS90_2022_masked_GWAS.pileup.txt",
q=FREQTABLEOPTIONS2,
s=expand("{sample}", sample=SAMPLES),
o="Freqs_Bu_BS90_2022_masked_GWAS.pileup.txt"
shell:
"perl {params.f} -g "
"{params.p} "
"{params.q} {params.s}"
" -o {params.o}"
NOW WE HAVE THE FREQUENCY TABLE
The FST window compares the pools of 10 subpools of uninfected snail with 8 subpools of infected snails, like a pair-wise calculation, to provide the mean fst window, 10Kb by 5Kb, meaning it starts from 0-10,000bps, then 5,000 to 15,000bps, then 10,000bps to 20,000bps and 15kb to 25kb and so on.This way, there is overlap across the genome, so that we don’t miss any narrow peaks.
Converted the Freq_Bu_BS90_2022_masked_GWAS.pileup.txt file to exclude the header
# 1. Make the Mean Fst_window table
rule fst_window:
# perl FstFromFreqTableWindow.pl -f /path/to/Freqs_GWAS_data.pileup.txt -a 4,12 -b 2,6
# R = "4,12,14,16,18,20,22,30,34,36"
# S = "2,6,8,10,24,26,28,32"
input:
"noheader_Freqs_Bu_BS90_2022_masked_GWAS.pileup.txt"
params:
p=FSTW,
f="noheader_Freqs_Bu_BS90_2022_masked_GWAS.pileup.txt",
a=R,
b=S,
output:
"Mean_Fsts_Windows_10000_noheader_Freqs_Bu_BS90_2022_masked_GWAS.pileup.txt"
shell:
"perl {params.p} -f "
"{params.f} -a {params.a} -b {params.b}"
After the mean FST window is complete, we merged it to Bu_BS90 chromosome (Ref genome chromosome) inorder to know which contig from the Mean_Fst file has the right chromosome and then we are able to grapth it (Making the Fst Plot on R).
Then we sorted the output file into chromsome order.
cat Bu_B_Chrs_Mean_Fsts_Windows_10000_noheader_Freqs_Bu_BS90_2022_masked_GWAS.pileup.txt | sort -k8,8n -k 6,6d > Bu_B_Chrs_Mean_Fsts_Windows_10000_noheader_Freqs_Bu_BS90_2022_masked_GWAS.pileup.txt &
#k8 means column with Chromosome number
#k6 means column with Chrom with Chr1a (scaffold labeling)
# This final file was downloaded and used for R analysis
We run randomized samples which is the (mix of infected and uninfected samples). Going back to the Mean_Fsts_Windows steps of the pipeline but using randomized mixed libraries.
The final files were also downloaded and used on R.
To create the signature SNPs at Chr 16, we needed to run Fst and Fst all without window.
Fst reports more significant SNPs, because we had an Fst threshold of 0.2 and above. While Fst all reported all SNPs.
#Fst no window (Individual Fst with 0.2 threshold)
rule fst:
input:
"noheader_Freqs_Bu_BS90_2022_masked_GWAS.pileup.txt"
params:
p=FST,
f="noheader_Freqs_Bu_BS90_2022_masked_GWAS.pileup.txt",
a=R,
b=S,
output:
"High_Fsts_noheader_Freqs_Bu_BS90_2022_masked_GWAS.pileup.txt"
shell:
"perl {params.p} -f "
"{params.f} -a {params.a} -b {params.b} "
"-d 5 -t 0.2 -n 5"
#Fst_all no window (Individual Fst, no threshold for Fst reporting all SNPs)
rule fst_all:
input:
"noheader_Freqs_Bu_BS90_2022_masked_GWAS.pileup.txt"
params:
p=FSTA,
f="noheader_Freqs_Bu_BS90_2022_masked_GWAS.pileup.txt",
a=R,
b=S,
output:
"All_Fsts_noheader_Freqs_Bu_BS90_2022_masked_GWAS.pileup.txt"
shell:
"perl {params.p} -f "
"{params.f} -a {params.a} -b {params.b} "
"-d 5 -n 5"
#This was very useful for the identification of the SNPs peak after finding out the window with the highest peak from our Plot on R
To get this done, we first used the (Bu_B_Chrs_Mean_Fsts_Windows_10000_noheader_Freqs_Bu_BS90_2022_masked_GWAS.pileup.txt.sorted), that was merged with fst window output to figure out the high fsts (in fst window) so that I can get the window which i would be looking into (in the fst no window output)
print out the lines of SNPs in the region encompassing my peak.
# 1. Firstly: use this file to access the highest snps with the highest fst value to locate the window
Bu_B_Chrs_Mean_Fsts_Windows_10000_noheader_Freqs_Bu_BS90_2022_masked_GWAS.pileup.txt.sorted.txt
#running this code:
sort -k3,3nr -k8,8d Bu_B_Chrs_Mean_Fsts_Windows_10000_noheader_Freqs_Bu_BS90_2022_masked_GWAS_RS.pileup.txt.sorted.txt | awk '{if ($3>0.05) {print $0}}' | less
#where k3 and k8 represents columns 3 and 8 with the header title fst value and chromosome number
#output looks like this:
Output:
JAKZJK010000037.1 8760000 0.185518663660096 9 0.332774614545347 Chr16b 25 16
#This output tells us the contig, coordinate/window and chromosome to look into
# 2. Secondly: Find the highest fst within contig 37 (in the fst window file), and the window is 8760000 and save file as:
peak_37.1_Chr16_0.05thresh_Bu_B_Chrs_Mean_Fsts_Windows_10000_noheader_Freqs_Bu_BS90_2022_masked_GWAS_RS.pileup.txt.sorted.txt
#running this code and saved it:
sort -k3,3nr -k8,8d Bu_B_Chrs_Mean_Fsts_Windows_10000_noheader_Freqs_Bu_BS90_2022_masked_GWAS_RS.pileup.txt.sorted.txt | grep 'JAKZJK010000037.1' | awk '{if ($3>0.05) {print $0}}' | less > peak_37.1_Chr16_0.05thresh_Bu_B_Chrs_Mean_Fsts_Windows_10000_noheader_Freqs_Bu_BS90_2022_masked_GWAS_RS.pileup.txt.sorted.txt
# 3. Thirdly: I checked the same window 8760000 in the All Fst file, since that's the one with all the snps.
# running this code and saved it:
less All_Fsts_noheader_Freqs_Bu_BS90_2022_masked_GWAS.pileup.txt | grep "JAKZJK010000037.1" > peak_37.1_All_Fsts_noheader_Freqs_Bu_BS90_2022_masked_GWAS.pileup.txt
# 4. I went into the contig 37 All_Fsts file just created, and grep out the snps within window 8720000 and 8820000 because that's the window with the highest fsts.
# running this code and saved it:
awk '$2 >= 8720000 && $2 <= 8820000' peak_37.1_All_Fsts_noheader_Freqs_Bu_BS90_2022_masked_GWAS.pileup.txt | less > peak_37.1_8760000_All_Fsts_noheader_Freqs_Bu_BS90_2022_masked_GWAS.pileup.txt
# 5. Then open the output file on excel for easier sorting and visualization. And then use primer web3 to design primers.