1 Introduction

This is an R Markdown document crafted mainly for the analyses of Perna canaliculus population genetics. This documentation also includes useful guidelines from setting up your Ubuntu environment, installing Stacks pipeline on your local computer, running trial scripts, and sample shell scripting for running onto your hpc servers.

1.1 Setting up your Ubuntu

This part will help you set up your Windows Subsystem for Linux (Ubuntu) in your Windows 10 or higher computer. Set up you working environment by installing and updating packages.

  1. Download Ubuntu application in the Microsoft Store
  2. Set up your username and password (invisible)
  3. After setting up, install Windows Terminal application in your computer.This application allows you to have multiple windows of Ubuntu terminal.
  4. Open your Ubuntu in your Windows Terminal and check your present working directory (run command pwd).
  5. You will notice that your present working directory looks something like /mnt/c/Users/mendiomi

1.1.1 Set up your environment

  1. Run these commands:

    sudo apt-get update -y

    sudo apt install build-essential

  2. Basically install all the updates and essential packages in your Ubuntu. YES to all installation questions.

  3. Install some of the essential packages for Stacks-2.60.

    Install gpp: sudo apt-get install -y gpp

    Install Zlib compression sudo apt-get install zlib1g-dev

    Now, your Ubuntu is ready.

1.2 Installing Stacks-2.60 on your local computer

Follow these easy steps to install Stacks-2.60 on your local computer. This will help you run simple commands using the program and run trial executions for your data processing.

  1. Download the the tar.gz file from the Stacks website

  2. Move tar.gz file on your installation directory.

  3. Go to your Ubuntu terminal and go to the directory where you placed your tar.gz

  4. Decompress the file:

    tar xfvz stacks-2.60.tar.gz

  5. Change directory:

    cd stacks-2.60/

  6. Configure

    ./configure --prefix=/mnt/path/to/your/desired/directory

  7. Make install:

    make -j 8

  8. Check executable and other files in your directory:

    ls

  9. You should see all of the files and executable in the directory (highlighted in green)

  10. Try to call one of the executable:

    ustacks

    If if does not work, try including the root directories:

    /mnt/path/to/directory/where/you/installed/stacks-2.60/ustacks

    You should see the program information/manual appearing on your terminal.

1.2.1 Set up your $PATH variable

This part will help you to set up your $PATH or environment so you can call your programs (executable) even you are not in the current directory where you put them.

Temporarily manipulate your $PATH variable:

  1. Check your$PATH

    echo $PATH

    You will see something like this:

    /usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/mnt/c/WINDOWS/system32:/mnt/c/WINDOWS:/mnt/c/WINDOWS/System32/Wbem:/mnt/c/WINDOWS/System32/WindowsPowerShell/v1.0/:/mnt/c/GCsolution/Program:/mnt/c/GCMSSO~1/Program:/mnt/c/WINDOWS/System32/OpenSSH/:/mnt/c/Program Files/PuTTY/:/mnt/c/Users/mendiomi/AppData/Local/Microsoft/WindowsApps:/snap/bin

  2. Add /opt/bin on your $PATH

    PATH=$PATH:/opt/bin

  3. Now, echo your $PATH

    echo $PATH

    You will see something like this:

    /usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/mnt/c/WINDOWS/system32:/mnt/c/WINDOWS:/mnt/c/WINDOWS/System32/Wbem:/mnt/c/WINDOWS/System32/WindowsPowerShell/v1.0/:/mnt/c/GCsolution/Program:/mnt/c/GCMSSO~1/Program:/mnt/c/WINDOWS/System32/OpenSSH/:/mnt/c/Program Files/PuTTY/:/mnt/c/Users/mendiomi/AppData/Local/Microsoft/WindowsApps:/snap/bin:/opt/bin

    You will see at that you added something at the end.

  4. Now, try it using your current working directory:

    PATH=$PATH:/mnt/c/Users/mendiomi/installation/stacks-2.60

    echo $PATH

    /usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/mnt/c/WINDOWS/system32:/mnt/c/WINDOWS:/mnt/c/WINDOWS/System32/Wbem:/mnt/c/WINDOWS/System32/WindowsPowerShell/v1.0/:/mnt/c/GCsolution/Program:/mnt/c/GCMSSO~1/Program:/mnt/c/WINDOWS/System32/OpenSSH/:/mnt/c/Program Files/PuTTY/:/mnt/c/Users/mendiomi/AppData/Local/Microsoft/WindowsApps:/snap/bin:/opt/bin:/mnt/c/Users/mendiomi/installation/stacks-2.60

    You temporarily set the$PATH variable on your current directory.

  5. Go to your root directory:

    cd ~

  6. Call one of the Stacks programs:

    process_radtags

    You should be able to see the manual of process_radtags running.

Manipulate your $PATH permanently:

  1. Open the .bashrc file using nano text file editor and paste the path of the directory where the executable are located:

    nano .bashrc

  2. Go to the bottom of the file and paste the path

    PATH=$PATH:/mnt/c/Users/mendiomi/installation/stacks-2.60

  3. Write out (^O) the file (save) and exit.

  4. Create another Ubuntu windows and try to call your programs.

  5. You can manipulate the.bashrc file for your other programs.

1.3 Mounting external drive (network drives) on WSL Ubuntu

This part guides you how to mount your external drive (network drives) on WSL Ubuntu.

  1. Map your network drives following this instruction.

  2. Create a directory

    mkdir /mnt/Z

  3. Mount the drive using your network drive link

    sudo mount -t drvfs'\\vuwstocoissrin1.vuw.ac.nz\SBS_Moana_01' /mnt/Z/

  4. Finalize the mounting

    sudo mount -e

  5. You can know access your files in the network drive using your Ubuntu WSL.

2 Running the Stacks pipeline

This section will help you process your data in the Stacks pipeline. Specifically, run programs such as process_radtagsand denovo_map.plfrom your local computer. This will also cover ways to demultiplex and clean your data, remove adapter contamination from your sequence data, assess your data in fastqcandmultiqc, and optimize parameters in denovo_map.pl.

2.1 Know your data

It is important to know your data on how it were sequenced and prepared before feeding them into the Stacks pipeline. Basic questions includes: are your sequences single-end or paired-end? demultiplexed or not? If no, do you have the list of barcodes? what restriction enzymes were used? what is the expected read length? do you know the adapter used in the sequencing? how many lanes of sequences do you have? and so on. General knowledge about these will give you hits on how you will move forward in processing your data in the Stacks pipeline.

2.2 Prepare your data input

Before running the process_radtags, you need the following documents or directories ready to run it smoothly:

  1. Concatenate sequence lanes
  2. a directory where you can output your demultiplexed data
  3. Create your barcode file
  4. the enzymes that were used to cut your DNA
  5. the commands that you will use (familiarize yourself here)

2.2.1 Concatenate sequence lanes

If you have two or more sequence lanes, it is better to concatenate them in a single file as it is more difficult to process them in the Stacks pipeline

  1. First, you need to merge sequence data depending on their nature. Single-end sequencing lanes should result to a single sequencing data. Whereas, paired-end sequencing will have forward and reverse reads. With this, you need to concatenate all forward and reverse reads separately. Some cases merge them into a single file and truncate them into a specific length at the end of the process.

  2. Here is an example on how to concatenate your two single-end sequencing lanes:

    cat {sequence.lane.1.fastq.txt.gz} {sequence.lane.2.fastq.txt.gz} > concatenated.lanes.fq.gz

  3. This concatenated file of your sequence reads is now ready for process_radtags.

2.2.2 Create your barcode file

The barcode file should contain the individual barcodes in the first column and individual ID names in the second column delimited by tab (/t)

An example of a barcofe file looks something like this:

GAACTATC INDIV_1

GAATCGAA INDIV_2

ATCTTGCG INDIV_3

Here is an example of an R script to create a barcode file from your excel of text spreadsheet:

library(readr)

#read your text file containing your barcodes and ID names
barcode <- read.delim("barcode.txt", delim = "\t", 
                      escape_double = FALSE, 
                      trim_ws=TRUE) 

#subset the comlumns of your barcode and their unique ID tags
barcode_key <- subset(barcode, select=c("barcode", "uidtag")) 

#check
barcode_key 

#if they have two lanes, remove the duplicates
barcode_key <- unique(barcode_key) 

#save your file
write.table(barcode_key, file = "barcode_key.txt", 
            append = FALSE, quote = F, sep="\t", 
            eol = "\n", na = "NA", dec = ".", 
            row.names = F, col.names = F, qmethod = c("escape", "double")) 

2.3 Run process_radtags

This section will introduce you the the first program you will encounter in Stacks-2.60. Pasticularly, this section will discuss useful parameters for (1) demultiplexing your sequence data; (2) removing low quality sequence reads; (3) removing unwanted sequences with attached adapters.

2.3.1 Useful parameters in process_radtags

This section lists all the parameters that I found useful for my own dataset, and for your sequence data too. The list of all parameter that you can use can be found here.

-p : path to a directory of files 
-o : path to output the processed files
-b : path to a file containing barcodes for this run, omit to ignore any barcoding
-e, --renz_1 : provide the restriction enzyme used (cut site occurs on single-end read)
--renz_2 : if a double digest was used, provide the second restriction enzyme used (cut site occurs on the paired-end read)
-c : clean data, remove any read with an uncalled base
-q : discard reads with low quality scores
-r : rescue barcodes and RAD-Tags
-t : truncate final read length to this value
--adapter_1 : provide adaptor sequence that may occur on the single-end read for filtering

2.3.2 Playing with process_radtags

There are ways to start playing with your data in the program process_radtags. Here, I list several ways to run process_radtags on your local computer and be familiarize to the parameters.

  1. Demultiplex your data with other read filtering options (-r, -c, -q), using one enzyme: process_radtags -p /path/to/your/input/dir -o /path/to/your/output/dir -b path/barcode_key.txt -e pstI -r -c -q
Expected result:
  
728826350 total sequences
42944821 barcode not found drops (5.9%)
1660759 low quality read drops (0.2%)
1854212 RAD cutsite not found drops (0.3%)
682366558 retained reads (93.6%)

Note: High recovery of retained reads normally more than 80%. 
  1. Demultiplex your data with other read filtering options (-r, -c, -q), using two enzyme: process_radtags -p /path/to/your/input/dir -o /path/to/your/output/dir -b path/barcode_key.txt --renz_1 pstI --renz_2 mspI -r -c -q

  2. Demultiplex your data with other read filtering options (-r, -c, -q), using two enzyme, and remove read with adapter contamination: process_radtags -p /path/to/your/input/dir -o /path/to/your/output/dir -b path/barcode_key.txt --renz_1 pstI --renz_2 mspI -r -c -q --adapter_1 AGATCGGAAGAG

    Note: the --adapter_1 option uses the universal illumina adapters for adapter cleanup.

Expected result:

728826350 total sequences
277056779 reads contained adapter sequence (37.9%) #removing reads with attached adapter sequences
66186288 barcode not found drops (9.1%)
1603974 low quality read drops (0.2%)
1877921 RAD cutsite not found drops (0.3%)
382101388 retained reads (52.3%)

Note: Lower recovery of retained reads (52.3%) because of the removal of reads with adapter contamination
  1. Run afastqc to check your reads. Go to Running fastqc and multiqc.

  2. Demultiplex your data with other read filtering options (-r, -c, -q), using two enzyme, and remove read with adapter contamination: process_radtags -p /path/to/your/input/dir -o /path/to/your/output/dir -b path/barcode_key.txt --renz_1 pstI --renz_2 mspI -r -c -q --adapter_1 AGATCGGAAGAG -t 91

    Note: 91bp is the most abundant read length on your sequences. Results show about 35% retained reads (280M reads for each sequence data), about 30% dropped for low quality reads.

2.3.3 Running fastqc and multiqc

Running fastqc before and after any filtering steps is recommended to check the quality and status of your sequences (e.g. adapter contamination, quality of reads) prior and after processing them. Here, I will show you how to set up your fastqc and multiqc, to visualize the quality and status of your sequences.

  1. Install fastqc:

    sudo apt install fastqc

    Check if you have installed it:

    fastqc

    if you are getting a java error, install Xming software.

  2. Run fastqc on one of your samples:

    fastqc /your/input/dir/yoursample.gz -o /your/output/dir

  3. Check if your sample have no adapter contamination; and

  4. Check if your sequences are at the same length because it is important in the denovo_map.plpipeline.

  5. If your sequences happens to have different read lengths, go to step 5 of Playing with process_radtags. Continue to step 6 if you have similar read lengths for all your sequence reads.

  6. Install multiqc:

    git clone https://github.com/ewels/MultiQC.git cd MultiQC pip install .

  7. cd to multiqc/ folder and run multiqc command to check the executable. Check multiqc documentation here.

  8. Run multiqcto all your fastqc resutls:

    multiqc data/*_fastqc.zip

2.4 Parameter optimization (denovo_map.pl)

This part will demonstrate how to get the optimized parameters (-m, -M, -n options) for the denovo_map.pl pipeline following the protocol of Paris et. and Catchen et al. This section is divided into two parts: (1) identify the ideal -m parameter where the coverage depth of all or subset of samples are more than 10x; and (2) identify which values of -M and -n parameters provide the highest number of loci and variant sites (SNPs) using the optimal -m value, utilizing only the first variant site for every locus (–write_single_snp) with loci present to 80% of samples (-r 0.80). At the end of this segment, we will be able get the optimized values for -m, -M, and -n, appropriately.

2.4.1 Optimizing -m parameter

2.4.1.1 Using your local computer

  1. Copy denovo_map.pl executable from installtion directory scripts/denovo_map.pl into the main installation directory /mnt/c/Users/mendiomi/installation/stacks-2.60/. This will make your denovo_map.pl executable to the global environment you assigned here.

  2. Edit the denovo_map.pl file

    nano deno_map.pl

  3. Change the my $exe_path with the installation directory path:

my $dry_run      = false;
my $exe_path     = "/mnt/c/Users/mendiomi/installation/stacks-2.60/;
my $out_path     = "";
  1. Now, run your commands in the terminal using a popmap which contains a list of subsamples (e.g. 2 samples per location all assigned to one population) from your dataset:

    for i in {1..10}; do mkdir -p m$i; denovo_map.pl --samples 02_clean/optimization_samples --popmap 02_clean/optimization_samples/popmap_optimization.txt -o 03_optimization/m$i -m $i -M 2 -n 2 -T 10; done

    Here, we will be using a value of 2 for both -M and -n options.

2.4.1.2 Using hpc

  1. Submit a shell script to your hpc using sbatch <file.sh>. Below is a sample script.
#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=12
#SBATCH --mem=64G
#SBATCH --partition=parallel
#SBATCH --time=2-00:00
#SBATCH --job-name=pcan_opt
#SBATCH -o /nfs/scratch2/mendiomi/01_pcanaliculus/03_optimization/m_optimization.out
#SBATCH -e /nfs/scratch2/mendiomi/01_pcanaliculus/03_optimization/m_optimization.err
#SBATCH --mail-type=BEGIN,END,FAIL
#SBATCH --mail-user=michaeljohn.mendiola@vuw.ac.nz

module load stacks/2.2

IN=/nfs/scratch2/mendiomi/01_pcanaliculus/02_clean/01_pcanaliculus_1349
OUT=/nfs/scratch2/mendiomi/01_pcanaliculus/03_optimization
POPMAP=/nfs/home/mendiomi/00_scripts

# running m for 1 to 10

for i in {1..10}; 
do
mkdir -p $OUT/m$i;
denovo_map.pl --samples $IN --popmap $POPMAP/popmap_optimization.txt -o $OUT/m$i -m $i -M 2 -n 2 -T 24; 
done

2.4.1.3 Checking the outputs

  1. Open a denovo_map.log file from every -m run and the check the Depths of Coverage for Processed Samples section where you can find the depth of coverage for every samples.

  2. Extract the information using the this simple commands:

    cat 03_optimization/m1/denovo_map.log | perl -ne 'BEGIN { $/="\n\n" }; print if $_ =~ /Depths/;'

    Basically, the command will extract the section starting with ‘Depths’ and will print the depth of coverage for all your samples. You can print it in a text file by adding > out.txt at the end of the command, and/or create a for loop to obtain all the depth of coverage from all you m runs.

  3. Create a .csv file using this format:

opt depth
1   10.13
1   10.02
1   2.72
1   6.18
. .
. .
. .
. .
. .
10 10.31
10 7.8

Where optis the value of m and depth is the depth of coverage per sample.

  1. Identify the value of m by checking the data when all samples have more than 10x.
  2. Plot depths of coverage in R:
#setwd("D:/Documents/01_PhD MOANA/Data/01_optimization/")
#be sure to load the .Rmd file
#set working directory
setwd("C:/Users/mendiomi/OneDrive - Victoria University of Wellington - STAFF/Documents/Moana Project/Data/01_optimization/")

library(tidyverse)
library(hrbrthemes)
library(viridis)
library(readr)

m_covDepth <- read_csv("m_covDepth.csv")
m_covDepth$opt <- as.character(m_covDepth$opt)
#View(m_covDepth) 

# ggPlot depth of coverage
m_covDepth %>%
  ggplot( aes(x=fct_inorder(opt), y=depth, fill=opt)) +
  geom_boxplot(outlier.size = 0.5) +
  scale_fill_viridis(discrete = TRUE, alpha=0.6) +
  geom_jitter(color="black", size=0, alpha=0) +
  theme_classic() +
  theme(
    legend.position="none",
    plot.title = element_text(size=11)) +
  xlab(expression(~italic(m)~'parameter')) + ylab("depth of coverage (x)")

2.4.2 Optimizing -M and -n parameters

  1. Ideally, this portion should be run in the hpc as it will take a lot of memory.
  2. Varying -M and -n affects the number of loci and snp recovery.
  3. Here, optimizing M and n will be done in three variations, n=M, n=M-1, and n=M+1.
  4. To do this, I made a custom script that will execute all these optimization in one run. Below is a sample script for n and M parameters.
#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=12
#SBATCH --mem=64G
#SBATCH --partition=parallel
#SBATCH --time=3-00:00
#SBATCH --job-name=pcan_opt
#SBATCH -o /nfs/scratch2/mendiomi/01_pcanaliculus/03_optimization/Mn_optimization.out
#SBATCH -e /nfs/scratch2/mendiomi/01_pcanaliculus/03_optimization/Mn_optimization.err
#SBATCH --mail-type=BEGIN,END,FAIL
#SBATCH --mail-user=michaeljohn.mendiola@vuw.ac.nz

module load stacks/2.2

IN=/nfs/scratch2/mendiomi/01_pcanaliculus/02_clean/01_pcanaliculus_1349
OUT=/nfs/scratch2/mendiomi/01_pcanaliculus/03_optimization
POPMAP=/nfs/home/mendiomi/00_scripts

#running n=M
for i in {1..10};
do
mkdir -p $OUT/M$i\n$i;
denovo_map.pl --samples $IN --popmap $POPMAP/popmap_optimization.txt -o $OUT/M$i\n$i -m 5 -M $i -n $i -T 24 -X "populations: --write_single_snp -r 0.80";
done

#running n=M-1
for i in {2..10}; 
do 
mkdir -p $OUT/M$i\n$[i-1];
denovo_map.pl --samples $IN --popmap $POPMAP/popmap_optimization.txt -o $OUT/M$i\n$[i-1] -m 5 -M $i -n $[i-1] -T 24 -X "populations: --write_single_snp -r 0.80";
done

#running n=M+1
for i in {1..9}; 
do 
mkdir -p $OUT/M$i\n$[i+1];
denovo_map.pl --samples $IN --popmap $POPMAP/popmap_optimization.txt -o $OUT/M$i\n$[i+1] -m 5 -M $i -n $[i+1] -T 24 -X "populations: --write_single_snp -r 0.80";
done
  1. Count the number of loci and SNPs using this custom script. Note that the first line counts the number of kept loci (haplotypes.tsv), and the second command counts the number of retained SNPs (sumstats.tsv).
#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --mem=1G
#SBATCH --partition=quicktest
#SBATCH --time=10:00
#SBATCH --job-name=count
#SBATCH -o /nfs/scratch2/mendiomi/01_pcanaliculus/03_optimization/count.out
#SBATCH -e /nfs/scratch2/mendiomi/01_pcanaliculus/03_optimization/count.err
#SBATCH --mail-type=BEGIN,END,FAIL
#SBATCH --mail-user=michaeljohn.mendiola@vuw.ac.nz

OUT=/nfs/scratch2/mendiomi/01_pcanaliculus/03_optimization

n=M
for i in {1..10};
do 
cat $OUT/M$i\n$i/populations.haplotypes.tsv | sed '1d' | wc -l >> $OUT/countloci.txt; 
cat $OUT/M$i\n$i/populations.sumstats.tsv | grep -v "^#" | cut -f 3 | sort -n | uniq | wc -l >> $OUT/countloci.txt; 
d#one

n=M-1
for i in {2..10};
do 
cat $OUT/M$i\n$[i-1]/populations.haplotypes.tsv | sed '1d' | wc -l >> $OUT/countloci_minus1.txt; 
cat $OUT/M$i\n$[i-1]/populations.sumstats.tsv | grep -v "^#" | cut -f 3 | sort -n | uniq | wc -l >> $OUT/countloci_minus1.txt; 
done

n=M+1
for i in {1..9}; 
do 
cat $OUT/M$i\n$[i+1]/populations.haplotypes.tsv | sed '1d' | wc -l >> $OUT/countloci_plus1.txt; 
cat $OUT/M$i\n$[i+1]/populations.sumstats.tsv | grep -v "^#" | cut -f 3 | sort -n | uniq | wc -l >> $OUT/countloci_plus1.txt; 
done
  1. Make a .csv file containing all these numbers. A = number of loci; B = number of SNPs; m = -1; p= +1. Sample file below:
M      A       B      Am       Bm      Ap      Bp
1     3548  3379                    3899    3738
2     3811  3678    3606    3472    3962    3820
3     3918  3782    3857    3727    3980    3835
4     3949  3809    3917    3785    3979    3835
5     3924  3781    3906    3763    3938    3796
6     3897  3751    3910    3771    3903    3763
7     3892  3733    3890    3731    3899    3741
8     3890  3734    3877    3723    3873    3717
9     3848  3689    3861    3701    3859    3690
10  3839    3670    3827    3661        
  1. Note that there is no n=M-1 for M1 and n=M+1 in M10.
  2. Plot your M and n in R:
setwd("C:/Users/mendiomi/OneDrive - Victoria University of Wellington - STAFF/Documents/Moana Project/Data/01_optimization/")

library(ggalt)

nM_countloci <- read_csv("nM_countloci.csv")
#View(nM_countloci)

# make spline - line curve
spline_B <- as.data.frame(spline(nM_countloci$M, nM_countloci$B))

ggplot(nM_countloci, aes (x=M)) + 
  geom_point(aes(y = B, colour = 'n=M'), size = 2) +
  geom_point(aes(y = Bm, colour = 'n=M-1'), size = 2) + 
  geom_point(aes(y = Bp, colour = 'n=M+1'), size = 2) +
  geom_line(data = spline_B, aes(x = x, y = y)) +
  scale_color_manual(name=expression(~italic(n)~'parameter'), 
                     labels=c('n = M + 1', 'n = M', 'n = M - 1'), 
                     values=c('n=M+1'='#00BFC4','n=M'='black','n=M-1'='#F8666D')) +
  theme_classic() +
  scale_x_continuous(breaks = seq(1, 10, by = 1)) + 
  scale_y_continuous(limits=c(3200,4000)) +
  labs(x=expression(~italic(M)~'parameter'), y=expression(~italic(r80)~'loci'))