Exploring HTS files

Notes available online at http://rpubs.com/lcollado/GFS-2013-09-24

Setup

Cluster

If you have not used the cluster, please check http://www.biostat.jhsph.edu/bit/SGE_lecture.ppt.pdf and http://www.biostat.jhsph.edu/bit/cluster-usage.html for more information.

## Log into enigma2
ssh lcollado@biostat.enigma2.jhsph.edu

## Ask for an interactive session
qrsh

.bash_profile

Add the version of R and high-throughput sequencing tools, including SAMtools.

# Sequencing cluster tools, includes updated R. It's run by Kasper
if [ -f /hpscc/usr/local/setup/setup.all.sh ]; then
source  /hpscc/usr/local/setup/setup.all.sh
fi

Data

Location

## Go to where the data is stored
cd /home/bst/student/lcollado/GFS-2013-09-24

Description

Files from 2 experiments

$ ls -lh
total 5.5G
drwxr-xr-x+ 3 lcollado biostats    3 Sep 23 16:00 Caenorhabditis_elegans
-rw-r--r--+ 1 lcollado biostats 551M Sep 23 14:36 Caenorhabditis_elegans_Ensembl_WBcel215.tar.gz
-rwxr-xr-x+ 1 lcollado biostats  36M Sep 23 16:01 genome.1.bt2
-rwxr-xr-x+ 1 lcollado biostats  24M Sep 23 16:01 genome.2.bt2
-rwxr-xr-x+ 1 lcollado biostats   71 Sep 23 16:01 genome.3.bt2
-rwxr-xr-x+ 1 lcollado biostats  24M Sep 23 16:01 genome.4.bt2
-rwxr-xr-x+ 1 lcollado biostats  98M Sep 23 16:01 genome.fa
-rwxr-xr-x+ 1 lcollado biostats  36M Sep 23 16:01 genome.rev.1.bt2
-rwxr-xr-x+ 1 lcollado biostats  24M Sep 23 16:02 genome.rev.2.bt2
-rw-r--r--+ 1 lcollado biostats 1.2K Sep 23 16:03 README.md
-rw-r--r--+ 1 lcollado biostats 996M Sep 23 13:43 Sample_NoIndex_L002_R1_001.fastq
-r--r--r--+ 1 lcollado biostats 388M Sep 23 13:37 Sample_NoIndex_L002_R1_001.fastq.gz
-rw-r--r--+ 1 lcollado biostats 996M Sep 23 13:44 Sample_NoIndex_L002_R2_001.fastq
-r--r--r--+ 1 lcollado biostats 396M Sep 23 13:37 Sample_NoIndex_L002_R2_001.fastq.gz
-rw-r--r--+ 1 lcollado biostats 1.5G Sep 23 13:45 Sherlock-IsogenicWT-080716_GA-EAS46_0001_209DH_L5.fastq
-rw-r--r--+ 1 lcollado biostats 533M Sep 23 13:38 Sherlock-IsogenicWT-080716_GA-EAS46_0001_209DH_L5.fastq.gz

Questions

  1. What is the structure of each record.
  2. How many reads are there in the file? How many nucleotides per read?
  3. How many reads contains one or more 'N'. This is usually a sign of a 'bad' read.
  4. Compute the distribution of the 3 first nucleotides in the reads
  5. How are the quality values encoded? (Talk about different formats for this).
  6. What does the information in the read name signify?
  7. Align the yeast reads to the genome using Bowtie.
  8. Convert the SAM format to BAM and index the file using samtools.
  9. Discuss the SAM/BAM format.

FASTQ format

$ head Sherlock-IsogenicWT-080716_GA-EAS46_0001_209DH_L5.fastq
@GA-EAS46_1_209DH:5:1:889:471
CAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA
+GA-EAS46_1_209DH:5:1:889:471
Uh@[hhheYTdchhhShhaWhJhhhhhhVhhOh^\K
@GA-EAS46_1_209DH:5:1:744:748
ACGCTATCGGTCTCTCGCCAATATTTAGCTTTAGAT
+GA-EAS46_1_209DH:5:1:744:748
hhhhhhhhhhhhhhhhhhhbEhWhhhMf\hhhSEOh
@GA-EAS46_1_209DH:5:1:709:882
GTTGTGTTAAAAGCGACAACCTAGCTGCTGTCTTTG

Full definition here.

Bowtie2

Bowtie2 is a fast short-read aligner. It maps the reads to the reference genome. Developed by Ben Langmead.

Reference genome

iGenome downloaded from http://support.illumina.com/sequencing/sequencing_software/igenome.ilmn 1. Bowtie2 index files were copied to the main directory. They start with genome*

SAMtools

SAM is a commonly used format for reporting aligments of short reads. Most aligners have adopted it and include options for reporting alignments in this format. SAMtools is a suite of

http://samtools.sourceforge.net/

Partial solutions

By Kasper

By Kasper.

Print every 4th line, starting with number 2:

awk '{ if (NR % 4 == 2) print}' FILE

Splits a string into singletons, keeps columns 1-3, sorts and tabulates them (combine with awk snippet)

cut -c1-3 FILE | sort | uniq -c

Counts the number of characters in line 2, including newline

head -2 FILE | tail -1 | wc -c

Counts the number of lines with 'N' (combine with awk snippet)

grep N FILE | wc -l

Quality values are a conination of "scale" and "encoding". Scale can be either "Illumina/Solexa" or "Sanger". Sanger is now standard. The scale has to do with the interpretation of the 0-40 integer number as a probability. Encoding is done using ASCII and has two forms, depending on offset: "Illumina" or "Sanger".

FASTQ / FASTA files may have additional strings in the header field, separated by spaces. The first string is the record name, and other strings are sometimes discarded by various programs.

The read name typically has the following information machine, flow cell id, lane, tile, (x,y) coordinates on tile. The values (lane, tile, (x,y)) gives you the spatial location of the read on the flowcell.

Aligning

(By Leo) Below is an example of aligning to the C. elegans genome. Takes around 6 minutes.

$ bowtie2 -x genome -1 Sample_NoIndex_L002_R1_001.fastq -2 Sample_NoIndex_L002_R2_001.fastq -N 0 -k 1 -S Sample-noMismatch.sam
4000000 reads; of these:
  4000000 (100.00%) were paired; of these:
    3998681 (99.97%) aligned concordantly 0 times
    1319 (0.03%) aligned concordantly exactly 1 time
    0 (0.00%) aligned concordantly >1 times
    ----
    3998681 pairs aligned concordantly 0 times; of these:
      5 (0.00%) aligned discordantly 1 time
    ----
    3998676 pairs aligned 0 times concordantly or discordantly; of these:
      7997352 mates make up the pairs; of these:
        7993619 (99.95%) aligned 0 times
        3719 (0.05%) aligned exactly 1 time
        14 (0.00%) aligned >1 times
0.08% overall alignment rate

Next we can convert the SAM file to a BAM file using SAMtools. Takes around 4 minutes.

$ samtools view -bS Sample-noMismatch.sam > Sample-noMismatch.bam
[samopen] SAM header is present: 7 sequences.

Now that we have a BAM file, we can create the index file. To do so, we first sort the alignment file.

This will allow other programs to quickly access parts of the BAM file very quickly. This is great for visualization tools and for extracting pieces of the information. In R we use the Rsamtools package available in Bioconductor to interact with these files.

## Sorting takes 6 min
$ samtools sort Sample-noMismatch.bam Sample-noMismatch-sorted.bam
[bam_sort_core] merging from 4 files...

## Indexing takes less than a min
$ samtools index Sample-noMismatch-sorted.bam