Command Line Tools Lecture Notes

Module 1: Basic Unix Commands

Unix commands:

  • date
  • echo hello
  • pwd (print working directory)
  • cd (change directory)
  • cd . (current directory)
  • cd .. (parent directory)
  • ls (list files in directory)
  • ls -l (list files in directory + additional file info)
  • ls -lt (list files in directory + additional file info in chron order)
  • man ls (info re: command ls)

File Naming:

  • ls apple.* (lists all files with apple)
  • ls ?pple.genome (returns A/apple.genome)
  • ls [a-z]*.genome (returns all files with extensions .genome)
  • ls p*.genome (returns all files with p.genome in title)
  • ls {pear, peach}.genome (lists all files with pear or peach)

Content Creation:

  • mkdir apple pear peach (makes directories named apple, pear, and peach)
  • ls apple.* (lists all apple. file types)
  • cp apple.* apple (copies all apple file types into apple directory)
  • mv pear.* pear (moves all pear file types into pear directory)
  • rm -i apple.genome (removes apple.genes with a prompt to remove file)
  • rmdir Old.stuff (removes empty directories)
  • rm -r Old.stuff (removes all sub-directories & itself, recursively)

Accessing Content in Unix:

  • more apple.genome (shows file’s content one page at time (space bar))
  • more apple.genome /> (shows file’s content one page at time & looks for next sequence (major page heading)) [fast.a files?]
  • less apple.genome (shows file’s content one page at time and in reverse (space bar + up arrow))
  • head peach.genome (gives first ten lines of file)
  • head -50 peach.genome (gives first 50 lines of file)
  • tail peach.genome (gives last ten lines of file)
  • tail -15 peach.genome (gives last 15 lines of file)
  • cat /.genes (combines all directories genes files)

redirecting content:

  • wc apple/apple.genome (prints file’s #lines; #words; #char; file name)
  • wc -l apple/apple.genome (prints #lines in file)
  • wc -l apple/apple.genome > nlines (stores info into nlines)
  • wc -l < apple/apple.genes (stores info)
  • ls | wc -l (pipes input of wc to be output of ls command)
  • cat /.genome | more (allows us to see one page of file at a time)

querying content:

  • sort months (lists contents of file in alphabetical order)
  • sort -r months (lists contents of file in reverse alphabetical order)
  • vi months (in command prompt text editor)
  • sort -k2 months (sorts file by column 2)
  • sort -kn2 months (sorts file by column 2 numerically)
  • sort -k 2nr months (sorts file by column 2 numerically, in reverse)
  • sort -k 3 -k 2n months (sorts column 3 first then column 2 numerically)
  • cut -f1 months (prints column 1 only)
  • cut -f1-3 months (prints columns 1-3 only)
  • cut -d ‘’ -f1 months (prints column 1 by the defined delimiter (e.g., space))
  • cut -d ‘’ -f3 months > seasons (stores space delimited column 3 into seasons)
  • sort -u seasons (prints unique sorted list)
  • uniq seasons | sort -u (gives unique seaons in alphabetical order)
  • uniq -c seasons (gives number of occurances)
  • grep root /.samples (prints the parents of the path)
  • grep " 12 winter" months (tells where this appears)
  • grep -n “12 winter” months (tells the content of line with the line number)

comparing content:

  • diff orchard orchard.1 (tells all differences b/w files & their locations)
  • comm apple/apple.samples pear/pear.samples
  • comm -1 -2 apple/apple.samples pear/pear.samples (ignores items in lines 1 &2)

archiving content:

  • gzip apple.genome (creates .gz compressed file)
  • gunzip apple.genome.gz (uncompresses .gz file)
  • bzip2 apple.genome (creates .bz2 compressed file [better compression than gzip])
  • bunzip apple.genome.bz2 (uncompresses .bz2 file)
  • tar -cvf Apple.tar apple.genes apple.genome apple.samples (c = compress; v = vost; f = files; combines files into archive)

  • gzip Apple.tar
  • gunzip Apple.tar.gz
  • tar -xvf Apple.tar
  • tar -cvf AppleD.tar apple/
  • bzip2 AppleD.tar
  • tar -xvf AppleD.tar
  • gzip apple.genome
  • apple.genome.gz

Module 2: Sequences & Genomic Features

Microbiology Primer

Genomes

  • one copy lives in nucleus

Eukaryotic gene expression

  • gene goes through transcription
  • Pre-mRNA strand created by transcription
  • Pre-mRNA exons spliced into one mRNA strand (introns removed)
  • mRNA translated into protein

different exon combinations produced during transcription will produce different protein products

  • called splice variants or transcripts

Major computational biology questions to ask:

  • What are the genes?
  • What are the proteins?
  • What are their sequences?
  • Quantify their abundance.
  • What is their role in the cell?

Sequence Representation & Generation

Questions:

  • What do sequences look like?
  • What do they do?
  • How many are there?

Genomic sequences

  • Adenine (A)
  • Guanine (G)
  • Thyamine (T)
  • Cytosine (C)

mRNA sequences

  • Uracil replaces Thyamine (U)

Sequence Generation

  • shear DNA / RNA to roughly same length fragments
  • amplifiy fragments if more material needed
  • sequence from one end only = single-end reads
  • sequence from both ends = paired-end reads
  • Assembly graph
    • node fragments are called reads
    • algorithm needed to put these reads together so that all are included

Sequencing methods

  • Sanger (used until 2008)
    • FASTA / Pearson format
    • 1 line header starts with “>” followed by unique identifier and metadata
    • be careful not to include any blank lines
  • Next Generation Sequencing (NGS; 2008 - present)
    • FASTQ format
    • 4 line record
    • 1st line starts with “@” and ends with “/1” or “/2” to indicate single / paired reads
    • 2nd line is gene sequence; N codes for unknown
    • 3rd line starts with “+”
    • 4th line is base quality score (codes for confidence of matching line 2 gene)

Base quality scores (FASTQ format; line 4)

  • p = probability that call at bse is correct
  • quality value = Q = -10log(base10)p
  • represented by ASCII symbols (33 - 126)
  • values below 20 are low
  • max value is ~ 40

Annotation

Genomic features

  • genome annotation: determine precise location and structure of genomic features along genome
  • genomic features:
    • genes
    • promoters
    • protein binding sites
    • translation start / stop sites
    • SNasel sites
    • et cetera

BED format

  • browser extensible data (BED) format
  • coordinates are 0-based
  • basic format
    • column 1: chromosome / scaffold
    • column 2: start of feature (0-based; count from 0 instead of 1)
    • column 3: end of feature
  • extended format
    • columns 1-3 plus…
    • name: exon position
    • score: value b/w 0 - 1000 to indicate color intensity
    • strand direction (+/-)
    • thick_start: start region of protein coding
    • thick_end: end region of protein coding
    • rgb: (red, green, blue)
    • block coordinates can be included

GTF format

  • genomic transfer format (GTF)
  • each interval feature takes 1 line
  • columns are separeted by “”
  • fields within column 9 separated by " "
  • coordinates are 1-based (count starting with 1)
  • column 1: chromosome / scaffold ID
  • column 2: source (program / website / etc)
  • column 3: type of feature
  • column 4: beginning of feature along axis
  • column 5: end of feature
  • column 6: score (confidence)
  • column 7: strand (forward + / reverse -)
  • column 8: frame (0/1/2)
  • column 9: composite with two fields
    • field 1: gene ID followed by name of gene
    • field 2: transcript ID
    • et cetera…

GFF3 format

  • genomic feature format version 3 (GFF3)
  • column 1: chromosome / scaffold ID
  • column 2: source (cufflinks / URL / etc)
  • column 3: type of feature (mRNA / exon / etc)
  • column 4: start coordinate of feature on genome
  • column 5: end coordinate of feature on genome
  • column 6: score
  • column 7: strand
  • column 8: coding frame (“.” = unknown)
  • column 9: fields (e.g., ID, name, parent cluster)

Alignment

  • sequence a fragment of the gene (RNA) or region (DNA), then map (align) it to the genome
  • Alignment: a mapping between the letters of the two sequences, with some spaces (indels)
  • Alignment adjusts for differences caused by:
    • polymorphisms
    • sequencing errors
    • introns
    • insertion / deletion / substitution

Alignment types

  • contiguous alignment e.g., DNA fragments can be aligned to genome contiguously
  • spliced alignment e.g., mRNA splices info that’s located at exons
    • if read exists entirely within an exon, it’ll be contiguous alignment (one piece)
    • if read spans boundary b/w two exons, it needs to be divided (two pieces)

Next generation sequencing alignments

  • NGS produces normally distributed fragment lengths
    • mapping should have same distance between reads & same orientation
    • concordant (properly paired): distance boundary same & same orientation
    • non-concordant: reads not mapping in same orientations or distance between reads are too far apart

SAM format

  • header lines start with “@”
  • @HD: tells what file type is
  • @SQ: tells sequence (SN) info & length (LN)
  • @PG: tells program version (VN), name (ID), etc
  • alignments are broken into 1 line each
  • each alignment field is separated with ‘’ (tab)
    • Field 1: Read ID - pulled from FASTQ file header
    • Field 2: FLAG
    • Field 3: scaffold / chromosome ID
    • Field 4: Start position of alignment
    • Field 5: mapping quality
    • Field 6: CIGAR (compressed representation of alignment)
    • Field 7: mate’s chromosome location; “=” codes for same; “*" codes for mate not mapped (e.g., concordant)
    • Field 8: mate start location
    • Field 9: mate distance measured along genomic axis
    • Field 10: query sequence
    • Field 11: query base quality
    • Field 12: tags (program specific)

sample SAM tags

  • AS: alignment score
  • NM: edit distance to reference
  • NH: number of hits
  • XS: strand
  • HI: hit index for this alignment

SAM field 2: FLAG

  • binary represented data
  • 0x1: multiple segments (mates)
  • 0x2: each segment properly aligned
  • 0x4: segment unmapped
  • 0x8: next segment unmapped
  • 0x10: SEQ is reverse completed in the alignment
  • 0x20: SEQ of next segment is reverse complemented
  • 0x40: first segment (mate)
  • 0x80: last segment (mate)
  • 0x100: secondary alignment
  • 0x200: not passing quality checks
  • 0x400: PCR or optical duplicate
  • 0x800: supplementary alignment
  • e.g., 0000 0110 0011 (base 2)
    • paired, proper pair, mapped, mate mapped
    • forward, mate reverse, first in pair, not second (last) in pair,
    • passed quality check, not PCR duplicate, not a suppl. alignment

SAM field 6: CIGAR

  • sequence of short strings coding editing operations
  • M: match (sequence match or subsitution)
  • I: insertion to the reference
  • D: deletion from the reference
  • N: skipped region (intron); very special case
  • S: soft clipping (sequence start or end not aligned; seq appears in SEQ)
  • H: hard clipping (seq not in SEQ)
  • P: padding first segment (mate)
  • =: sequence match
  • X: sequence mismatch

Recreating Sequences & Features Retrival

NCBI Data Repository

  • ncbi.nlm.nih.gov
  • example transcript data search
    • search for IL-2 homo sapiens mRNA (nucleotide database type)
    • click FASTA to get format
    • send to file & download in FASTA format
  • example RNA-seq data search
    • search for strawberry RNA-seq species (SRA - short read archive database type)
    • click data file to go to the data table
    • download tab -> downloading SRA data using command line utilities
    • write down the ID for the data
    • use “wget” URL in terminal window
    • paste wget command into terminal
    • modify URL to be just the ID for the URL
    • nohup fastq-dump filename.sra &
    • head filename.fastq
    • fastq-dump –help (gives info on parameters)

UCSC Genomic Data Repository

  • genome.ucsc.edu
  • gene prediction data example
    • clade: mammal
    • genome: human
    • assembly: latest version of genome
    • group: genes & gene predictions
    • track: refseq genes
    • table: refGene
    • region: position chr22
    • output format: GTF
    • file -> save as… -> filename.gtf.txt
    • output format: BED -> whole gene
    • file -> save as… -> filename.bed.txt
  • repeats data example
    • clade: mammal
    • genome: human
    • assembly: latest version of genome
    • group: repeats
    • track: repeatmaster
    • position: chr22
    • filter: repName -> match: “Alu*"
    • output: BED
    • file -> save as… -> alus.bed.txt