This is an R Markdown document crafted mainly for the analyses of Perna canaliculus population genetics. This documentation also includes useful guidelines from setting up your Ubuntu environment, installing Stacks pipeline on your local computer, running trial scripts, and sample shell scripting for running onto your hpc servers.
This part will help you set up your Windows Subsystem for Linux (Ubuntu) in your Windows 10 or higher computer. Set up you working environment by installing and updating packages.
pwd)./mnt/c/Users/mendiomiRun these commands:
sudo apt-get update -y
sudo apt install build-essential
Basically install all the updates and essential packages in your Ubuntu. YES to all installation questions.
Install some of the essential packages for Stacks-2.60.
Install gpp: sudo apt-get install -y gpp
Install Zlib compression sudo apt-get install zlib1g-dev
Now, your Ubuntu is ready.
Follow these easy steps to install Stacks-2.60 on your local computer. This will help you run simple commands using the program and run trial executions for your data processing.
Download the the tar.gz file from the Stacks website
Move tar.gz file on your installation directory.
Go to your Ubuntu terminal and go to the directory where you placed your tar.gz
Decompress the file:
tar xfvz stacks-2.60.tar.gz
Change directory:
cd stacks-2.60/
Configure
./configure --prefix=/mnt/path/to/your/desired/directory
Make install:
make -j 8
Check executable and other files in your directory:
ls
You should see all of the files and executable in the directory (highlighted in green)
Try to call one of the executable:
ustacks
If if does not work, try including the root directories:
/mnt/path/to/directory/where/you/installed/stacks-2.60/ustacks
You should see the program information/manual appearing on your terminal.
$PATH variableThis part will help you to set up your $PATH or environment so you can call your programs (executable) even you are not in the current directory where you put them.
Temporarily manipulate your $PATH variable:
Check your$PATH
echo $PATH
You will see something like this:
/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/mnt/c/WINDOWS/system32:/mnt/c/WINDOWS:/mnt/c/WINDOWS/System32/Wbem:/mnt/c/WINDOWS/System32/WindowsPowerShell/v1.0/:/mnt/c/GCsolution/Program:/mnt/c/GCMSSO~1/Program:/mnt/c/WINDOWS/System32/OpenSSH/:/mnt/c/Program Files/PuTTY/:/mnt/c/Users/mendiomi/AppData/Local/Microsoft/WindowsApps:/snap/bin
Add /opt/bin on your $PATH
PATH=$PATH:/opt/bin
Now, echo your $PATH
echo $PATH
You will see something like this:
/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/mnt/c/WINDOWS/system32:/mnt/c/WINDOWS:/mnt/c/WINDOWS/System32/Wbem:/mnt/c/WINDOWS/System32/WindowsPowerShell/v1.0/:/mnt/c/GCsolution/Program:/mnt/c/GCMSSO~1/Program:/mnt/c/WINDOWS/System32/OpenSSH/:/mnt/c/Program Files/PuTTY/:/mnt/c/Users/mendiomi/AppData/Local/Microsoft/WindowsApps:/snap/bin:/opt/bin
You will see at that you added something at the end.
Now, try it using your current working directory:
PATH=$PATH:/mnt/c/Users/mendiomi/installation/stacks-2.60
echo $PATH
/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/usr/games:/usr/local/games:/mnt/c/WINDOWS/system32:/mnt/c/WINDOWS:/mnt/c/WINDOWS/System32/Wbem:/mnt/c/WINDOWS/System32/WindowsPowerShell/v1.0/:/mnt/c/GCsolution/Program:/mnt/c/GCMSSO~1/Program:/mnt/c/WINDOWS/System32/OpenSSH/:/mnt/c/Program Files/PuTTY/:/mnt/c/Users/mendiomi/AppData/Local/Microsoft/WindowsApps:/snap/bin:/opt/bin:/mnt/c/Users/mendiomi/installation/stacks-2.60
You temporarily set the$PATH variable on your current directory.
Go to your root directory:
cd ~
Call one of the Stacks programs:
process_radtags
You should be able to see the manual of process_radtags running.
Manipulate your $PATH permanently:
Open the .bashrc file using nano text file editor and paste the path of the directory where the executable are located:
nano .bashrc
Go to the bottom of the file and paste the path
PATH=$PATH:/mnt/c/Users/mendiomi/installation/stacks-2.60
Write out (^O) the file (save) and exit.
Create another Ubuntu windows and try to call your programs.
You can manipulate the.bashrc file for your other programs.
This part guides you how to mount your external drive (network drives) on WSL Ubuntu.
Create a directory
mkdir /mnt/Z
Mount the drive using your network drive link
sudo mount -t drvfs'\\vuwstocoissrin1.vuw.ac.nz\SBS_Moana_01' /mnt/Z/
Finalize the mounting
sudo mount -e
You can know access your files in the network drive using your Ubuntu WSL.
Stacks pipelineThis section will help you process your data in the Stacks pipeline. Specifically, run programs such as process_radtagsand denovo_map.plfrom your local computer. This will also cover ways to demultiplex and clean your data, remove adapter contamination from your sequence data, assess your data in fastqcandmultiqc, and optimize parameters in denovo_map.pl.
It is important to know your data on how it were sequenced and prepared before feeding them into the Stacks pipeline. Basic questions includes: are your sequences single-end or paired-end? demultiplexed or not? If no, do you have the list of barcodes? what restriction enzymes were used? what is the expected read length? do you know the adapter used in the sequencing? how many lanes of sequences do you have? and so on. General knowledge about these will give you hits on how you will move forward in processing your data in the Stacks pipeline.
Before running the process_radtags, you need the following documents or directories ready to run it smoothly:
If you have two or more sequence lanes, it is better to concatenate them in a single file as it is more difficult to process them in the Stacks pipeline
First, you need to merge sequence data depending on their nature. Single-end sequencing lanes should result to a single sequencing data. Whereas, paired-end sequencing will have forward and reverse reads. With this, you need to concatenate all forward and reverse reads separately. Some cases merge them into a single file and truncate them into a specific length at the end of the process.
Here is an example on how to concatenate your two single-end sequencing lanes:
cat {sequence.lane.1.fastq.txt.gz} {sequence.lane.2.fastq.txt.gz} > concatenated.lanes.fq.gz
This concatenated file of your sequence reads is now ready for process_radtags.
The barcode file should contain the individual barcodes in the first column and individual ID names in the second column delimited by tab (/t)
An example of a barcofe file looks something like this:
GAACTATC INDIV_1
GAATCGAA INDIV_2
ATCTTGCG INDIV_3
Here is an example of an R script to create a barcode file from your excel of text spreadsheet:
library(readr)
#read your text file containing your barcodes and ID names
barcode <- read.delim("barcode.txt", delim = "\t",
escape_double = FALSE,
trim_ws=TRUE)
#subset the comlumns of your barcode and their unique ID tags
barcode_key <- subset(barcode, select=c("barcode", "uidtag"))
#check
barcode_key
#if they have two lanes, remove the duplicates
barcode_key <- unique(barcode_key)
#save your file
write.table(barcode_key, file = "barcode_key.txt",
append = FALSE, quote = F, sep="\t",
eol = "\n", na = "NA", dec = ".",
row.names = F, col.names = F, qmethod = c("escape", "double"))
denovo_map.pl)This part will demonstrate how to get the optimized parameters (-m, -M, -n options) for the denovo_map.pl pipeline following the protocol of Paris et. and Catchen et al. This section is divided into two parts: (1) identify the ideal -m parameter where the coverage depth of all or subset of samples are more than 10x; and (2) identify which values of -M and -n parameters provide the highest number of loci and variant sites (SNPs) using the optimal -m value, utilizing only the first variant site for every locus (–write_single_snp) with loci present to 80% of samples (-r 0.80). At the end of this segment, we will be able get the optimized values for -m, -M, and -n, appropriately.
Copy denovo_map.pl executable from installtion directory scripts/denovo_map.pl into the main installation directory /mnt/c/Users/mendiomi/installation/stacks-2.60/. This will make your denovo_map.pl executable to the global environment you assigned here.
Edit the denovo_map.pl file
nano deno_map.pl
Change the my $exe_path with the installation directory path:
my $dry_run = false;
my $exe_path = "/mnt/c/Users/mendiomi/installation/stacks-2.60/;
my $out_path = "";
Now, run your commands in the terminal using a popmap which contains a list of subsamples (e.g. 2 samples per location all assigned to one population) from your dataset:
for i in {1..10}; do mkdir -p m$i; denovo_map.pl --samples 02_clean/optimization_samples --popmap 02_clean/optimization_samples/popmap_optimization.txt -o 03_optimization/m$i -m $i -M 2 -n 2 -T 10; done
Here, we will be using a value of 2 for both -M and -n options.
sbatch <file.sh>. Below is a sample script.#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=12
#SBATCH --mem=64G
#SBATCH --partition=parallel
#SBATCH --time=2-00:00
#SBATCH --job-name=pcan_opt
#SBATCH -o /nfs/scratch2/mendiomi/01_pcanaliculus/03_optimization/m_optimization.out
#SBATCH -e /nfs/scratch2/mendiomi/01_pcanaliculus/03_optimization/m_optimization.err
#SBATCH --mail-type=BEGIN,END,FAIL
#SBATCH --mail-user=michaeljohn.mendiola@vuw.ac.nz
module load stacks/2.2
IN=/nfs/scratch2/mendiomi/01_pcanaliculus/02_clean/01_pcanaliculus_1349
OUT=/nfs/scratch2/mendiomi/01_pcanaliculus/03_optimization
POPMAP=/nfs/home/mendiomi/00_scripts
# running m for 1 to 10
for i in {1..10};
do
mkdir -p $OUT/m$i;
denovo_map.pl --samples $IN --popmap $POPMAP/popmap_optimization.txt -o $OUT/m$i -m $i -M 2 -n 2 -T 24;
done
Open a denovo_map.log file from every -m run and the check the Depths of Coverage for Processed Samples section where you can find the depth of coverage for every samples.
Extract the information using the this simple commands:
cat 03_optimization/m1/denovo_map.log | perl -ne 'BEGIN { $/="\n\n" }; print if $_ =~ /Depths/;'
Basically, the command will extract the section starting with ‘Depths’ and will print the depth of coverage for all your samples. You can print it in a text file by adding > out.txt at the end of the command, and/or create a for loop to obtain all the depth of coverage from all you m runs.
Create a .csv file using this format:
opt depth
1 10.13
1 10.02
1 2.72
1 6.18
. .
. .
. .
. .
. .
10 10.31
10 7.8
Where optis the value of m and depth is the depth of coverage per sample.
#setwd("D:/Documents/01_PhD MOANA/Data/01_optimization/")
#be sure to load the .Rmd file
#set working directory
setwd("C:/Users/mendiomi/OneDrive - Victoria University of Wellington - STAFF/Documents/Moana Project/Data/01_optimization/")
library(tidyverse)
library(hrbrthemes)
library(viridis)
library(readr)
m_covDepth <- read_csv("m_covDepth.csv")
m_covDepth$opt <- as.character(m_covDepth$opt)
#View(m_covDepth)
# ggPlot depth of coverage
m_covDepth %>%
ggplot( aes(x=fct_inorder(opt), y=depth, fill=opt)) +
geom_boxplot(outlier.size = 0.5) +
scale_fill_viridis(discrete = TRUE, alpha=0.6) +
geom_jitter(color="black", size=0, alpha=0) +
theme_classic() +
theme(
legend.position="none",
plot.title = element_text(size=11)) +
xlab(expression(~italic(m)~'parameter')) + ylab("depth of coverage (x)")
n=M, n=M-1, and n=M+1.#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=12
#SBATCH --mem=64G
#SBATCH --partition=parallel
#SBATCH --time=3-00:00
#SBATCH --job-name=pcan_opt
#SBATCH -o /nfs/scratch2/mendiomi/01_pcanaliculus/03_optimization/Mn_optimization.out
#SBATCH -e /nfs/scratch2/mendiomi/01_pcanaliculus/03_optimization/Mn_optimization.err
#SBATCH --mail-type=BEGIN,END,FAIL
#SBATCH --mail-user=michaeljohn.mendiola@vuw.ac.nz
module load stacks/2.2
IN=/nfs/scratch2/mendiomi/01_pcanaliculus/02_clean/01_pcanaliculus_1349
OUT=/nfs/scratch2/mendiomi/01_pcanaliculus/03_optimization
POPMAP=/nfs/home/mendiomi/00_scripts
#running n=M
for i in {1..10};
do
mkdir -p $OUT/M$i\n$i;
denovo_map.pl --samples $IN --popmap $POPMAP/popmap_optimization.txt -o $OUT/M$i\n$i -m 5 -M $i -n $i -T 24 -X "populations: --write_single_snp -r 0.80";
done
#running n=M-1
for i in {2..10};
do
mkdir -p $OUT/M$i\n$[i-1];
denovo_map.pl --samples $IN --popmap $POPMAP/popmap_optimization.txt -o $OUT/M$i\n$[i-1] -m 5 -M $i -n $[i-1] -T 24 -X "populations: --write_single_snp -r 0.80";
done
#running n=M+1
for i in {1..9};
do
mkdir -p $OUT/M$i\n$[i+1];
denovo_map.pl --samples $IN --popmap $POPMAP/popmap_optimization.txt -o $OUT/M$i\n$[i+1] -m 5 -M $i -n $[i+1] -T 24 -X "populations: --write_single_snp -r 0.80";
done
#!/bin/bash
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=1
#SBATCH --mem=1G
#SBATCH --partition=quicktest
#SBATCH --time=10:00
#SBATCH --job-name=count
#SBATCH -o /nfs/scratch2/mendiomi/01_pcanaliculus/03_optimization/count.out
#SBATCH -e /nfs/scratch2/mendiomi/01_pcanaliculus/03_optimization/count.err
#SBATCH --mail-type=BEGIN,END,FAIL
#SBATCH --mail-user=michaeljohn.mendiola@vuw.ac.nz
OUT=/nfs/scratch2/mendiomi/01_pcanaliculus/03_optimization
n=M
for i in {1..10};
do
cat $OUT/M$i\n$i/populations.haplotypes.tsv | sed '1d' | wc -l >> $OUT/countloci.txt;
cat $OUT/M$i\n$i/populations.sumstats.tsv | grep -v "^#" | cut -f 3 | sort -n | uniq | wc -l >> $OUT/countloci.txt;
d#one
n=M-1
for i in {2..10};
do
cat $OUT/M$i\n$[i-1]/populations.haplotypes.tsv | sed '1d' | wc -l >> $OUT/countloci_minus1.txt;
cat $OUT/M$i\n$[i-1]/populations.sumstats.tsv | grep -v "^#" | cut -f 3 | sort -n | uniq | wc -l >> $OUT/countloci_minus1.txt;
done
n=M+1
for i in {1..9};
do
cat $OUT/M$i\n$[i+1]/populations.haplotypes.tsv | sed '1d' | wc -l >> $OUT/countloci_plus1.txt;
cat $OUT/M$i\n$[i+1]/populations.sumstats.tsv | grep -v "^#" | cut -f 3 | sort -n | uniq | wc -l >> $OUT/countloci_plus1.txt;
done
M A B Am Bm Ap Bp
1 3548 3379 3899 3738
2 3811 3678 3606 3472 3962 3820
3 3918 3782 3857 3727 3980 3835
4 3949 3809 3917 3785 3979 3835
5 3924 3781 3906 3763 3938 3796
6 3897 3751 3910 3771 3903 3763
7 3892 3733 3890 3731 3899 3741
8 3890 3734 3877 3723 3873 3717
9 3848 3689 3861 3701 3859 3690
10 3839 3670 3827 3661
setwd("C:/Users/mendiomi/OneDrive - Victoria University of Wellington - STAFF/Documents/Moana Project/Data/01_optimization/")
library(ggalt)
nM_countloci <- read_csv("nM_countloci.csv")
#View(nM_countloci)
# make spline - line curve
spline_B <- as.data.frame(spline(nM_countloci$M, nM_countloci$B))
ggplot(nM_countloci, aes (x=M)) +
geom_point(aes(y = B, colour = 'n=M'), size = 2) +
geom_point(aes(y = Bm, colour = 'n=M-1'), size = 2) +
geom_point(aes(y = Bp, colour = 'n=M+1'), size = 2) +
geom_line(data = spline_B, aes(x = x, y = y)) +
scale_color_manual(name=expression(~italic(n)~'parameter'),
labels=c('n = M + 1', 'n = M', 'n = M - 1'),
values=c('n=M+1'='#00BFC4','n=M'='black','n=M-1'='#F8666D')) +
theme_classic() +
scale_x_continuous(breaks = seq(1, 10, by = 1)) +
scale_y_continuous(limits=c(3200,4000)) +
labs(x=expression(~italic(M)~'parameter'), y=expression(~italic(r80)~'loci'))