Notes available online at http://rpubs.com/lcollado/GFS-2013-09-24
If you have not used the cluster, please check http://www.biostat.jhsph.edu/bit/SGE_lecture.ppt.pdf and http://www.biostat.jhsph.edu/bit/cluster-usage.html for more information.
## Log into enigma2
ssh lcollado@biostat.enigma2.jhsph.edu
## Ask for an interactive session
qrsh
Add the version of R and high-throughput sequencing tools, including SAMtools.
# Sequencing cluster tools, includes updated R. It's run by Kasper
if [ -f /hpscc/usr/local/setup/setup.all.sh ]; then
source /hpscc/usr/local/setup/setup.all.sh
fi
## Go to where the data is stored
cd /home/bst/student/lcollado/GFS-2013-09-24
Files from 2 experiments
$ ls -lh
total 5.5G
drwxr-xr-x+ 3 lcollado biostats 3 Sep 23 16:00 Caenorhabditis_elegans
-rw-r--r--+ 1 lcollado biostats 551M Sep 23 14:36 Caenorhabditis_elegans_Ensembl_WBcel215.tar.gz
-rwxr-xr-x+ 1 lcollado biostats 36M Sep 23 16:01 genome.1.bt2
-rwxr-xr-x+ 1 lcollado biostats 24M Sep 23 16:01 genome.2.bt2
-rwxr-xr-x+ 1 lcollado biostats 71 Sep 23 16:01 genome.3.bt2
-rwxr-xr-x+ 1 lcollado biostats 24M Sep 23 16:01 genome.4.bt2
-rwxr-xr-x+ 1 lcollado biostats 98M Sep 23 16:01 genome.fa
-rwxr-xr-x+ 1 lcollado biostats 36M Sep 23 16:01 genome.rev.1.bt2
-rwxr-xr-x+ 1 lcollado biostats 24M Sep 23 16:02 genome.rev.2.bt2
-rw-r--r--+ 1 lcollado biostats 1.2K Sep 23 16:03 README.md
-rw-r--r--+ 1 lcollado biostats 996M Sep 23 13:43 Sample_NoIndex_L002_R1_001.fastq
-r--r--r--+ 1 lcollado biostats 388M Sep 23 13:37 Sample_NoIndex_L002_R1_001.fastq.gz
-rw-r--r--+ 1 lcollado biostats 996M Sep 23 13:44 Sample_NoIndex_L002_R2_001.fastq
-r--r--r--+ 1 lcollado biostats 396M Sep 23 13:37 Sample_NoIndex_L002_R2_001.fastq.gz
-rw-r--r--+ 1 lcollado biostats 1.5G Sep 23 13:45 Sherlock-IsogenicWT-080716_GA-EAS46_0001_209DH_L5.fastq
-rw-r--r--+ 1 lcollado biostats 533M Sep 23 13:38 Sherlock-IsogenicWT-080716_GA-EAS46_0001_209DH_L5.fastq.gz
$ head Sherlock-IsogenicWT-080716_GA-EAS46_0001_209DH_L5.fastq
@GA-EAS46_1_209DH:5:1:889:471
CAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
+GA-EAS46_1_209DH:5:1:889:471
Uh@[hhheYTdchhhShhaWhJhhhhhhVhhOh^\K
@GA-EAS46_1_209DH:5:1:744:748
ACGCTATCGGTCTCTCGCCAATATTTAGCTTTAGAT
+GA-EAS46_1_209DH:5:1:744:748
hhhhhhhhhhhhhhhhhhhbEhWhhhMf\hhhSEOh
@GA-EAS46_1_209DH:5:1:709:882
GTTGTGTTAAAAGCGACAACCTAGCTGCTGTCTTTG
Full definition here.
Bowtie2 is a fast short-read aligner. It maps the reads to the reference genome. Developed by Ben Langmead.
iGenome downloaded from http://support.illumina.com/sequencing/sequencing_software/igenome.ilmn 1. Bowtie2 index files were copied to the main directory. They start with genome*
SAM is a commonly used format for reporting aligments of short reads. Most aligners have adopted it and include options for reporting alignments in this format. SAMtools is a suite of
http://samtools.sourceforge.net/
By Kasper.
Print every 4th line, starting with number 2:
awk '{ if (NR % 4 == 2) print}' FILE
Splits a string into singletons, keeps columns 1-3, sorts and tabulates them (combine with awk snippet)
cut -c1-3 FILE | sort | uniq -c
Counts the number of characters in line 2, including newline
head -2 FILE | tail -1 | wc -c
Counts the number of lines with 'N' (combine with awk snippet)
grep N FILE | wc -l
Quality values are a conination of "scale" and "encoding". Scale can be either "Illumina/Solexa" or "Sanger". Sanger is now standard. The scale has to do with the interpretation of the 0-40 integer number as a probability. Encoding is done using ASCII and has two forms, depending on offset: "Illumina" or "Sanger".
FASTQ / FASTA files may have additional strings in the header field, separated by spaces. The first string is the record name, and other strings are sometimes discarded by various programs.
The read name typically has the following information machine, flow cell id, lane, tile, (x,y) coordinates on tile. The values (lane, tile, (x,y)) gives you the spatial location of the read on the flowcell.
(By Leo) Below is an example of aligning to the C. elegans genome. Takes around 6 minutes.
$ bowtie2 -x genome -1 Sample_NoIndex_L002_R1_001.fastq -2 Sample_NoIndex_L002_R2_001.fastq -N 0 -k 1 -S Sample-noMismatch.sam
4000000 reads; of these:
4000000 (100.00%) were paired; of these:
3998681 (99.97%) aligned concordantly 0 times
1319 (0.03%) aligned concordantly exactly 1 time
0 (0.00%) aligned concordantly >1 times
----
3998681 pairs aligned concordantly 0 times; of these:
5 (0.00%) aligned discordantly 1 time
----
3998676 pairs aligned 0 times concordantly or discordantly; of these:
7997352 mates make up the pairs; of these:
7993619 (99.95%) aligned 0 times
3719 (0.05%) aligned exactly 1 time
14 (0.00%) aligned >1 times
0.08% overall alignment rate
Next we can convert the SAM file to a BAM file using SAMtools. Takes around 4 minutes.
$ samtools view -bS Sample-noMismatch.sam > Sample-noMismatch.bam
[samopen] SAM header is present: 7 sequences.
Now that we have a BAM file, we can create the index file. To do so, we first sort the alignment file.
This will allow other programs to quickly access parts of the BAM file very quickly. This is great for visualization tools and for extracting pieces of the information. In R we use the Rsamtools package available in Bioconductor to interact with these files.
## Sorting takes 6 min
$ samtools sort Sample-noMismatch.bam Sample-noMismatch-sorted.bam
[bam_sort_core] merging from 4 files...
## Indexing takes less than a min
$ samtools index Sample-noMismatch-sorted.bam