Module 2: Sequences & Genomic Features
Microbiology Primer
Genomes
- one copy lives in nucleus
Eukaryotic gene expression
- gene goes through transcription
- Pre-mRNA strand created by transcription
- Pre-mRNA exons spliced into one mRNA strand (introns removed)
- mRNA translated into protein
different exon combinations produced during transcription will produce different protein products
- called splice variants or transcripts
Major computational biology questions to ask:
- What are the genes?
- What are the proteins?
- What are their sequences?
- Quantify their abundance.
- What is their role in the cell?
Sequence Representation & Generation
Questions:
- What do sequences look like?
- What do they do?
- How many are there?
Genomic sequences
- Adenine (A)
- Guanine (G)
- Thyamine (T)
- Cytosine (C)
mRNA sequences
- Uracil replaces Thyamine (U)
Sequence Generation
- shear DNA / RNA to roughly same length fragments
- amplifiy fragments if more material needed
- sequence from one end only = single-end reads
- sequence from both ends = paired-end reads
- Assembly graph
- node fragments are called reads
- algorithm needed to put these reads together so that all are included
Sequencing methods
- Sanger (used until 2008)
- FASTA / Pearson format
- 1 line header starts with “>” followed by unique identifier and metadata
- be careful not to include any blank lines
- Next Generation Sequencing (NGS; 2008 - present)
- FASTQ format
- 4 line record
- 1st line starts with “@” and ends with “/1” or “/2” to indicate single / paired reads
- 2nd line is gene sequence; N codes for unknown
- 3rd line starts with “+”
- 4th line is base quality score (codes for confidence of matching line 2 gene)
Base quality scores (FASTQ format; line 4)
- p = probability that call at bse is correct
- quality value = Q = -10log(base10)p
- represented by ASCII symbols (33 - 126)
- values below 20 are low
- max value is ~ 40
Annotation
Genomic features
- genome annotation: determine precise location and structure of genomic features along genome
- genomic features:
- genes
- promoters
- protein binding sites
- translation start / stop sites
- SNasel sites
- et cetera
BED format
- browser extensible data (BED) format
- coordinates are 0-based
- basic format
- column 1: chromosome / scaffold
- column 2: start of feature (0-based; count from 0 instead of 1)
- column 3: end of feature
- extended format
- columns 1-3 plus…
- name: exon position
- score: value b/w 0 - 1000 to indicate color intensity
- strand direction (+/-)
- thick_start: start region of protein coding
- thick_end: end region of protein coding
- rgb: (red, green, blue)
- block coordinates can be included
GTF format
- genomic transfer format (GTF)
- each interval feature takes 1 line
- columns are separeted by “”
- fields within column 9 separated by " "
- coordinates are 1-based (count starting with 1)
- column 1: chromosome / scaffold ID
- column 2: source (program / website / etc)
- column 3: type of feature
- column 4: beginning of feature along axis
- column 5: end of feature
- column 6: score (confidence)
- column 7: strand (forward + / reverse -)
- column 8: frame (0/1/2)
- column 9: composite with two fields
- field 1: gene ID followed by name of gene
- field 2: transcript ID
- et cetera…
GFF3 format
- genomic feature format version 3 (GFF3)
- column 1: chromosome / scaffold ID
- column 2: source (cufflinks / URL / etc)
- column 3: type of feature (mRNA / exon / etc)
- column 4: start coordinate of feature on genome
- column 5: end coordinate of feature on genome
- column 6: score
- column 7: strand
- column 8: coding frame (“.” = unknown)
- column 9: fields (e.g., ID, name, parent cluster)
Alignment
- sequence a fragment of the gene (RNA) or region (DNA), then map (align) it to the genome
- Alignment: a mapping between the letters of the two sequences, with some spaces (indels)
- Alignment adjusts for differences caused by:
- polymorphisms
- sequencing errors
- introns
- insertion / deletion / substitution
Alignment types
- contiguous alignment e.g., DNA fragments can be aligned to genome contiguously
- spliced alignment e.g., mRNA splices info that’s located at exons
- if read exists entirely within an exon, it’ll be contiguous alignment (one piece)
- if read spans boundary b/w two exons, it needs to be divided (two pieces)
Next generation sequencing alignments
- NGS produces normally distributed fragment lengths
- mapping should have same distance between reads & same orientation
- concordant (properly paired): distance boundary same & same orientation
- non-concordant: reads not mapping in same orientations or distance between reads are too far apart
SAM format
- header lines start with “@”
- @HD: tells what file type is
- @SQ: tells sequence (SN) info & length (LN)
- @PG: tells program version (VN), name (ID), etc
- alignments are broken into 1 line each
- each alignment field is separated with ‘’ (tab)
- Field 1: Read ID - pulled from FASTQ file header
- Field 2: FLAG
- Field 3: scaffold / chromosome ID
- Field 4: Start position of alignment
- Field 5: mapping quality
- Field 6: CIGAR (compressed representation of alignment)
- Field 7: mate’s chromosome location; “=” codes for same; “*" codes for mate not mapped (e.g., concordant)
- Field 8: mate start location
- Field 9: mate distance measured along genomic axis
- Field 10: query sequence
- Field 11: query base quality
- Field 12: tags (program specific)
sample SAM tags
- AS: alignment score
- NM: edit distance to reference
- NH: number of hits
- XS: strand
- HI: hit index for this alignment
SAM field 2: FLAG
- binary represented data
- 0x1: multiple segments (mates)
- 0x2: each segment properly aligned
- 0x4: segment unmapped
- 0x8: next segment unmapped
- 0x10: SEQ is reverse completed in the alignment
- 0x20: SEQ of next segment is reverse complemented
- 0x40: first segment (mate)
- 0x80: last segment (mate)
- 0x100: secondary alignment
- 0x200: not passing quality checks
- 0x400: PCR or optical duplicate
- 0x800: supplementary alignment
- e.g., 0000 0110 0011 (base 2)
- paired, proper pair, mapped, mate mapped
- forward, mate reverse, first in pair, not second (last) in pair,
- passed quality check, not PCR duplicate, not a suppl. alignment
SAM field 6: CIGAR
- sequence of short strings coding editing operations
- M: match (sequence match or subsitution)
- I: insertion to the reference
- D: deletion from the reference
- N: skipped region (intron); very special case
- S: soft clipping (sequence start or end not aligned; seq appears in SEQ)
- H: hard clipping (seq not in SEQ)
- P: padding first segment (mate)
- =: sequence match
- X: sequence mismatch
Recreating Sequences & Features Retrival
NCBI Data Repository
- ncbi.nlm.nih.gov
- example transcript data search
- search for IL-2 homo sapiens mRNA (nucleotide database type)
- click FASTA to get format
- send to file & download in FASTA format
- example RNA-seq data search
- search for strawberry RNA-seq species (SRA - short read archive database type)
- click data file to go to the data table
- download tab -> downloading SRA data using command line utilities
- write down the ID for the data
- use “wget” URL in terminal window
- paste wget command into terminal
- modify URL to be just the ID for the URL
- nohup fastq-dump filename.sra &
- head filename.fastq
- fastq-dump –help (gives info on parameters)
UCSC Genomic Data Repository
- genome.ucsc.edu
- gene prediction data example
- clade: mammal
- genome: human
- assembly: latest version of genome
- group: genes & gene predictions
- track: refseq genes
- table: refGene
- region: position chr22
- output format: GTF
- file -> save as… -> filename.gtf.txt
- output format: BED -> whole gene
- file -> save as… -> filename.bed.txt
- repeats data example
- clade: mammal
- genome: human
- assembly: latest version of genome
- group: repeats
- track: repeatmaster
- position: chr22
- filter: repName -> match: “Alu*"
- output: BED
- file -> save as… -> alus.bed.txt
Exam 3
As part of the effort to catalog genetic variation in the plant Arabidopsis thaliana, you re-sequenced the genome of one strain (‘wu_0_A’; genome file: ‘wu_0.v7.fas’), to determine genetic variants in this organism. The sequencing reads produced are in the file ‘wu_0_A_wgs.fastq’. Using the tools bowtie2, samtools and bcftools, develop a pipeline for variant calling in this genome. NOTE: Genome and re-sequencing data have been obtained and modified from those generated by the 1001 Genomes Project, accession ‘Wu_0_A’.
Apply to questions 1 - 5:
- Generate a bowtie2 index of the wu_0_A genome using bowtie2-build, with the prefix ‘wu_0’.
Apply to questions 6 - 10:
- Run bowtie2 to align the reads to the genome, under two scenarios: first, to report only full-length matches of the reads; and second, to allow partial (local) matches. All other parameters are as set by default.
For the following set of questions (11 - 20), use the set of full-length alignments calculated under scenario 1 only. Convert this SAM file to BAM, then sort the resulting BAM file.
Apply to questions 11 - 15:
Compile candidate sites of variation using SAMtools mpileup for further evaluation with BCFtools. Provide the reference fasta genome and use the option “-uv” to generate the output in uncompressed VCF format for easy examination.
Apply to questions 16 - 20:
Call variants using ‘BCFtools call’ with the multiallelic-caller model. For this, you will need to first re-run SAMtools mpileup with the BCF output format option (‘-g’) to generate the set of candidate sites to be evaluated by BCFtools. In the output to BCFtools, select to show only the variant sites, in uncompressed VCF format for easy examination.
1 - How many sequences were in the genome? 2 - What was the name of the third sequence in the genome file? Give the name only, without the “>” sign. 3 - What was the name of the last sequence in the genome file? Give the name only, without the “>” sign. 4 - How many index files did the operation create? 5 - What is the 3-character extension for the index files created? 6 - How many reads were in the original fastq file? 7 - How many matches (alignments) were reported for: i) the original (full-match) setting? and ii) with the local-match setting? Exclude lines in the file containing unmapped reads. Give these two numbers separated by a space (e.g., 23 53). 8 - How many reads were mapped, in each scenario? Use the format above. 9 - How many reads had multiple matches, in each scenario? Use the format above. You can find this in the bowtie2 summary; note that by default bowtie2 only reports the best match for each read. 10 -How many alignments contained insertions and/or deletions, in each scenario? Use the format above. 11 - How many entries were reported for Chr3? 12 - How many entries have ‘A’ as the corresponding genome letter? 13 - How many entries have exactly 20 supporting reads (read depth)? 14 - How many entries represent indels? 15 - How many entries are reported for position 175672 on Chr1? 16 - How many variants are called on Chr3? 17 - How many variants represent an A->T SNP? If useful, you can use ‘grep –P’ to allow tabular spaces in the search term. 18 - How many entries are indels? 19 - How many entries have precisely 20 supporting reads (read depth)? 20 - What type of variant (i.e., SNP or INDEL) is called at position 11937923 on Chr3?