Command Line Tools Lecture Notes

Module 1: Basic Unix Commands

Unix commands:

date
echo hello
pwd (print working directory)
cd (change directory)
cd . (current directory)
cd .. (parent directory)
ls (list files in directory)
ls -l (list files in directory + additional file info)
ls -lt (list files in directory + additional file info in chron order)
man ls (info re: command ls)

File Naming:

ls apple.* (lists all files with apple)
ls ?pple.genome (returns A/apple.genome)
ls [a-z]*.genome (returns all files with extensions .genome)
ls p*.genome (returns all files with p.genome in title)
ls {pear, peach}.genome (lists all files with pear or peach)

Content Creation:

mkdir apple pear peach (makes directories named apple, pear, and peach)
ls apple.* (lists all apple. file types)
cp apple.* apple (copies all apple file types into apple directory)
mv pear.* pear (moves all pear file types into pear directory)
rm -i apple.genome (removes apple.genes with a prompt to remove file)
rmdir Old.stuff (removes empty directories)
rm -r Old.stuff (removes all sub-directories & itself, recursively)

Accessing Content in Unix:

more apple.genome (shows file’s content one page at time (space bar))
more apple.genome /> (shows file’s content one page at time & looks for next sequence (major page heading)) [fast.a files?]
less apple.genome (shows file’s content one page at time and in reverse (space bar + up arrow))
head peach.genome (gives first ten lines of file)
head -50 peach.genome (gives first 50 lines of file)
tail peach.genome (gives last ten lines of file)
tail -15 peach.genome (gives last 15 lines of file)
cat /.genes (combines all directories genes files)

redirecting content:

wc apple/apple.genome (prints file’s #lines; #words; #char; file name)
wc -l apple/apple.genome (prints #lines in file)
wc -l apple/apple.genome > nlines (stores info into nlines)
wc -l < apple/apple.genes (stores info)
ls | wc -l (pipes input of wc to be output of ls command)
cat /.genome | more (allows us to see one page of file at a time)

querying content:

sort months (lists contents of file in alphabetical order)
sort -r months (lists contents of file in reverse alphabetical order)
vi months (in command prompt text editor)
sort -k2 months (sorts file by column 2)
sort -kn2 months (sorts file by column 2 numerically)
sort -k 2nr months (sorts file by column 2 numerically, in reverse)
sort -k 3 -k 2n months (sorts column 3 first then column 2 numerically)
cut -f1 months (prints column 1 only)
cut -f1-3 months (prints columns 1-3 only)
cut -d ‘’ -f1 months (prints column 1 by the defined delimiter (e.g., space))
cut -d ‘’ -f3 months > seasons (stores space delimited column 3 into seasons)
sort -u seasons (prints unique sorted list)
uniq seasons | sort -u (gives unique seaons in alphabetical order)
uniq -c seasons (gives number of occurances)
grep root /.samples (prints the parents of the path)
grep " 12 winter" months (tells where this appears)
grep -n “12 winter” months (tells the content of line with the line number)

comparing content:

diff orchard orchard.1 (tells all differences b/w files & their locations)
comm apple/apple.samples pear/pear.samples
comm -1 -2 apple/apple.samples pear/pear.samples (ignores items in lines 1 &2)

archiving content:

gzip apple.genome (creates .gz compressed file)
gunzip apple.genome.gz (uncompresses .gz file)
bzip2 apple.genome (creates .bz2 compressed file [better compression than gzip])
bunzip apple.genome.bz2 (uncompresses .bz2 file)
tar -cvf Apple.tar apple.genes apple.genome apple.samples (c = compress; v = vost; f = files; combines files into archive)
gzip Apple.tar
gunzip Apple.tar.gz
tar -xvf Apple.tar
tar -cvf AppleD.tar apple/
bzip2 AppleD.tar
tar -xvf AppleD.tar
gzip apple.genome
apple.genome.gz

Module 2: Sequences & Genomic Features

Microbiology Primer

Genomes

one copy lives in nucleus

Eukaryotic gene expression

gene goes through transcription
Pre-mRNA strand created by transcription
Pre-mRNA exons spliced into one mRNA strand (introns removed)
mRNA translated into protein

different exon combinations produced during transcription will produce different protein products

called splice variants or transcripts

Major computational biology questions to ask:

What are the genes?
What are the proteins?
What are their sequences?
Quantify their abundance.
What is their role in the cell?

Sequence Representation & Generation

Questions:

What do sequences look like?
What do they do?
How many are there?

Genomic sequences

Adenine (A)
Guanine (G)
Thyamine (T)
Cytosine (C)

mRNA sequences

Uracil replaces Thyamine (U)

Sequence Generation

shear DNA / RNA to roughly same length fragments
amplifiy fragments if more material needed
sequence from one end only = single-end reads
sequence from both ends = paired-end reads
Assembly graph
- node fragments are called reads
- algorithm needed to put these reads together so that all are included

Sequencing methods

Sanger (used until 2008)
- FASTA / Pearson format
- 1 line header starts with “>” followed by unique identifier and metadata
- be careful not to include any blank lines
Next Generation Sequencing (NGS; 2008 - present)
- FASTQ format
- 4 line record
- 1st line starts with “@” and ends with “/1” or “/2” to indicate single / paired reads
- 2nd line is gene sequence; N codes for unknown
- 3rd line starts with “+”
- 4th line is base quality score (codes for confidence of matching line 2 gene)

Base quality scores (FASTQ format; line 4)

p = probability that call at bse is correct
quality value = Q = -10log(base10)p
represented by ASCII symbols (33 - 126)
values below 20 are low
max value is ~ 40

Annotation

Genomic features

genome annotation: determine precise location and structure of genomic features along genome
genomic features:
- genes
- promoters
- protein binding sites
- translation start / stop sites
- SNasel sites
- et cetera

BED format

browser extensible data (BED) format
coordinates are 0-based
basic format
- column 1: chromosome / scaffold
- column 2: start of feature (0-based; count from 0 instead of 1)
- column 3: end of feature
extended format
- columns 1-3 plus…
- name: exon position
- score: value b/w 0 - 1000 to indicate color intensity
- strand direction (+/-)
- thick_start: start region of protein coding
- thick_end: end region of protein coding
- rgb: (red, green, blue)
- block coordinates can be included

GTF format

genomic transfer format (GTF)
each interval feature takes 1 line
columns are separeted by “”
fields within column 9 separated by " "
coordinates are 1-based (count starting with 1)
column 1: chromosome / scaffold ID
column 2: source (program / website / etc)
column 3: type of feature
column 4: beginning of feature along axis
column 5: end of feature
column 6: score (confidence)
column 7: strand (forward + / reverse -)
column 8: frame (0/1/2)
column 9: composite with two fields
- field 1: gene ID followed by name of gene
- field 2: transcript ID
- et cetera…

GFF3 format

genomic feature format version 3 (GFF3)
column 1: chromosome / scaffold ID
column 2: source (cufflinks / URL / etc)
column 3: type of feature (mRNA / exon / etc)
column 4: start coordinate of feature on genome
column 5: end coordinate of feature on genome
column 6: score
column 7: strand
column 8: coding frame (“.” = unknown)
column 9: fields (e.g., ID, name, parent cluster)

Alignment

sequence a fragment of the gene (RNA) or region (DNA), then map (align) it to the genome
Alignment: a mapping between the letters of the two sequences, with some spaces (indels)
Alignment adjusts for differences caused by:
- polymorphisms
- sequencing errors
- introns
- insertion / deletion / substitution

Alignment types

contiguous alignment e.g., DNA fragments can be aligned to genome contiguously
spliced alignment e.g., mRNA splices info that’s located at exons
- if read exists entirely within an exon, it’ll be contiguous alignment (one piece)
- if read spans boundary b/w two exons, it needs to be divided (two pieces)

Next generation sequencing alignments

NGS produces normally distributed fragment lengths
- mapping should have same distance between reads & same orientation
- concordant (properly paired): distance boundary same & same orientation
- non-concordant: reads not mapping in same orientations or distance between reads are too far apart

SAM format

header lines start with “@”
@HD: tells what file type is
@SQ: tells sequence (SN) info & length (LN)
@PG: tells program version (VN), name (ID), etc
alignments are broken into 1 line each
each alignment field is separated with ‘’ (tab)
- Field 1: Read ID - pulled from FASTQ file header
- Field 2: FLAG
- Field 3: scaffold / chromosome ID
- Field 4: Start position of alignment
- Field 5: mapping quality
- Field 6: CIGAR (compressed representation of alignment)
- Field 7: mate’s chromosome location; “=” codes for same; “*" codes for mate not mapped (e.g., concordant)
- Field 8: mate start location
- Field 9: mate distance measured along genomic axis
- Field 10: query sequence
- Field 11: query base quality
- Field 12: tags (program specific)

sample SAM tags

AS: alignment score
NM: edit distance to reference
NH: number of hits
XS: strand
HI: hit index for this alignment

SAM field 2: FLAG

binary represented data
0x1: multiple segments (mates)
0x2: each segment properly aligned
0x4: segment unmapped
0x8: next segment unmapped
0x10: SEQ is reverse completed in the alignment
0x20: SEQ of next segment is reverse complemented
0x40: first segment (mate)
0x80: last segment (mate)
0x100: secondary alignment
0x200: not passing quality checks
0x400: PCR or optical duplicate
0x800: supplementary alignment
e.g., 0000 0110 0011 (base 2)
- paired, proper pair, mapped, mate mapped
- forward, mate reverse, first in pair, not second (last) in pair,
- passed quality check, not PCR duplicate, not a suppl. alignment

SAM field 6: CIGAR

sequence of short strings coding editing operations
M: match (sequence match or subsitution)
I: insertion to the reference
D: deletion from the reference
N: skipped region (intron); very special case
S: soft clipping (sequence start or end not aligned; seq appears in SEQ)
H: hard clipping (seq not in SEQ)
P: padding first segment (mate)
=: sequence match
X: sequence mismatch

Recreating Sequences & Features Retrival

NCBI Data Repository

ncbi.nlm.nih.gov
example transcript data search
- search for IL-2 homo sapiens mRNA (nucleotide database type)
- click FASTA to get format
- send to file & download in FASTA format
example RNA-seq data search
- search for strawberry RNA-seq species (SRA - short read archive database type)
- click data file to go to the data table
- download tab -> downloading SRA data using command line utilities
- write down the ID for the data
- use “wget” URL in terminal window
- paste wget command into terminal
- modify URL to be just the ID for the URL
- nohup fastq-dump filename.sra &
- head filename.fastq
- fastq-dump –help (gives info on parameters)

UCSC Genomic Data Repository

genome.ucsc.edu
gene prediction data example
- clade: mammal
- genome: human
- assembly: latest version of genome
- group: genes & gene predictions
- track: refseq genes
- table: refGene
- region: position chr22
- output format: GTF
- file -> save as… -> filename.gtf.txt
- output format: BED -> whole gene
- file -> save as… -> filename.bed.txt
repeats data example
- clade: mammal
- genome: human
- assembly: latest version of genome
- group: repeats
- track: repeatmaster
- position: chr22
- filter: repName -> match: “Alu*"
- output: BED
- file -> save as… -> alus.bed.txt

Cmd Line Tools Lecture Notes

Jeanna Clark

August 3, 2015