Phylogenetics

Lycett Et Al found a contradiction between behavioral data and genetic data

Accession #: Unique ID for a sequence Can search the Accession Number in the NCBI website
Popset through NCBI- tells all of the acccession numbers for a given paper

NCBI can tell you where the organism is from, the organelle that the genome is from, the name, etc

NCBI links to blast page Can blast in reference to a query sequence i.e. bonobo Can also limit query returns

Week 8

Blast - Biotechnology

Compares amino acid dequences, protein sequences, or nucleoides of DNA / RNA to other known sequences

Looks for similarity of query’s

Looking for similarities may prove helpful for identifying the function of a gene if it is known in another organism

Multtiple Sequence Alignment

Aligns 3+ sequences of genes to show homologous nucleootides

Required for moleules phylogenetics

By identifying homologous regions, we can isolate important coding sequenes that dont really change

Also useful in desigining PCR primers

Bioinformatics

Combination fo fields to analyze biological data

Used to identify genes, SNP’s and is useful in understanding diseases and unique adaptations

In Silico

Performed on a compuuter or via compter simulation

Often used in biology

Model real life sccenarios on biological organisms

Protein Domain

Compares seqence motifs and structural motifs

Protein Domain - conserved part of protein sequence and tertiary sequence

Can evolve, funtion, and exist independently of the rest of the protein chain

Domains may appear in multiple different proteins - fold independently of the rest of the gene

Domains can be rearranged

Domains typically have a ffunction which is why they are conserved

Sequence Motif

Nucleotide / amino acid sequence that has / is believed to have biologial significance

Sequence different than strctural becae doesnt have three dimensioonal arrangements

Sequence Annotation

DNA Annotation

Determining location and funtion of genes

Annotation is a note that explains the findings

Annotation includes * genomic position * intron and exon boundaries * regulatory sequences * repeats

Steps * Identify Introns (non coding) * Gene prediction – elements of the genome * Attatch biological inffo to elements

Structural Annotation – looks for genomic elements ORF’s (open reading frames) gene structure coding regions location of regulatory motifs

Functional Annotation – attatches biological information to genomic elements * Biochemical function * Biological function * Involved regulation and interactions * Expression

May use biological and in silio analysis

Homology

Similarity due to shared anestry between structures or genes in different taxa

Homologous Strutures – different purpose duee to desent with modification fom a commmon anecstor

Sequence Homology – similar protein or DNA sequence as a result of shared ancecstry

Shared ancesttry either through speciation event or gene duplication event

Homology determines based off of sequenec similarity from common anestor

Conserved Sequences

Conserved Sequence – similar in nucleic acids / proteins across species or within genome

Highly Conserved –a sequence that has remained relatively uncchanged going far up the phylogenetic tree

  • RNA components
  • Ribosomes

Mechanisms

Change can occur as a result of * Insertions and Deletions * Recombine / Delete due to chromosomal rearrangements

Conserved sequences persist even if the above occur because if an organisms had the above in an important region, it likely wouldn’t be able to survive

Selection pressures determine extent of conservation… hhhow stronly the environment selects for the specific functionality of that gene

Coding Sequences

mutations in amino acid and nucleicc acid sequences are different due to silent mutation sna dislent mutattion can afftect the nucleic acid sequence but not the amino acid sequence

May be replacements with amino acids that maintain the 3d structure

Non-Coding

Usally less conserved than coding RNA’s

Structure and Function (Coding) usually conserved more than Non - Coding

Identification

Conserved sequences are identified using bioinformatics based on sequence alignment

Can be used in high - throughput DNA sequencing and protein mass spectrometry

Multiple Sequence Alignment

This form of analysis shows letters a increasing sizes as a visual representaiton of the level of conservation

Genome Alignment

WGA - whole genome allignments

Difficult to do with larger organisms due to the computational complexity

Can be done with multiple smaller organisms

Appliations

Medical Research

Highly conserved sequences – important biological functions – useful in identifying genetic diseases

Congenital metabolic disorders and lysosomal storage diseases are a result off changes to conserved genes – missing / faulty enzymes

Functional Annotation

Identifying conserved sequences can allow for functional prediction

Sequence Alignment

Arranging sequences of a gene that is shared between species and compares the amino acids to look for conserved and non - conserved domains

Interpretation

If sequences have a common ancestor, the mismatches can be shown as point mutations and gaps as indels

Global allignments– forces allignment over a larger space

Local Allignments– finds allignments in smaller regions and alligns themm … better matches… 50% similar but the parts that are simmilar have a 100% conservation rate

Pairwise alignment

Used on only two query sequences at a time and is used to find the best matching piecewise alignment

Efficient … used when extreme precision is not required

3 main methods * dot - matrix methods * dynamic programming * word methods

These methods have difficulty with highly repetitive sequences

MUM - maximum unique match – longest subsequence that occurs in botth querry sequences

Longer MUM - closer relatedness

Multiple Sequence Alignment

3+ queries that are assumed to come from a common ancestor

MSA can lead to sequence homology and allows for phylogenetic analusis to see shaed evolutionary origins point mutations– different characters

indels – hyphens

Often used to assess 3d structure of protein by looking at changes in base amino acid structure

Heuristics used to maximize scores – heuristics give insight into evolutionary process

Heuristics means that there is a higher likelihood for errors

More sequences introduce more error because there are more indels that could mess up the algortihm

MSA used for phylogenetics– compares important, highly conserved regions between species to see how similarly related they are and can be used in an evolutionary seting

Indel

Mutation that involves an insertion or a deletion of a base in the genome Can result in a frame shift

Point Mutation– replaces a single nucleotide without changing the quantity in the gene

If indel is multiple of 3, there will be no framshift

Phylogenetic Tree

Shows evolutionary relationship between species – gives phylogeny based on sim. and diff. in physical and genetic characteristics

Rooted Phylogenetic Tree - node is most recent commmon anccestor – branch lengths are time estimates since evolution from common ancestor Often use an outgroup to root

Unrooted trees –relatedness of leaf nodes, dont require knowledge of ancecstral root Relatedness without ancestry Can convert to rooted to unrooted by ommiting the root

Construction

Distance Matrix Methods – neighbor joining / UPGMA Geneticc distancce from Multiple seuqnece alignments Simplest Not an evolutionary model

Maximum Parsimony – implies evolution

Optimally ccriterion of maximumm likelihood – Beysian framework –explicit model of evolution to tree estimamtion

Genetic Difference

Measure of genetic divergence between species 
Pop. with similar alleles = smaller genetic distance -- closely related wiith small distance 

Computational Phylogenetics

Applying algorithms to phylogenetic analysis wih tthe goal of assembling a phylogenetic tree showing te hypohesized relatioonship between genes, species, or taxa

Can be morphological or molecules or genetic MSA used for molecular and genetic

Types of Phylogenetic Trees

Parsimony – used for morphological data bu not really for genetic data… tries to minimize the number of evolutionary steps

Genetic Data Phylogeny

Must start wit MSA

Genetic Distance Approach

Converts MSA to distance matrix simplest just count the diffrences in MSA

Computationally cheap… take MSA, convert to base numbers of sim and diff

Molecular evolution (model based phylogenetic methods)

Model how species evolve and use this to build a tree

Computationally expensice

BLAST uses a hybrid ccalled minimum evolution

Evolutionary Distance and Distane Matrices

Distane based converts MSA to distancec matrix Simplest convert MSA to counts of similarities and differences and converts this to distance

Distance based makes pairwise differences to give pairwise distances... i.e. distance / difference between each two 

This is the one with a box comparing species to each oter.. the diagonals get 0's because they are the same to each other

     |   Bonobo   |      Chimp       |      Human
Bonobo | 0 | |
Chimp | | 0 |

Human | | | 0

Compare Chimp and Bonobo… note the differences

| Bonobo | Chimp | Human

Bonobo | 0 | 2 |

Chimp | 2 | 0 |

Human | | | 0

Bonobo versus Human

| Bonobo | Chimp | Human

Bonobo | 0 | 2 | 6

Chimp | 2 | 0 |

Human | 6 | | 0

Chimp Versus Human

| Bonobo | Chimp | Human

Bonobo | 0 | 2 | 6

Chimp | 2 | 0 | 4

Human | 6 | 4 | 0

The above is a symmmetrical matrix.. mirror diagonal immage

We can get rid of info on one side and use it to fill in some otherr information i.e. similarities or percentage difference rather than count

Phylogeny uses disance matrix made of pairwise distances based on differences

Distance Matrix Methods

Use genetic distance between sequences in questions.. use MSA as input

Distance matrix can be used to construct rooted or unrooted trees

Neighbor Joining

Uses data lustering to sequene using genetic distance as clustering metric

Makes unrooted trees

Doesnt assume constant rate of evolution

UPGMA and WPGMA

UPGMA - Unweighted Pair Group Method with Arithmeic Mean WPGMA - Weightedd pair group metohd with arithmetic mean

rooted tress

Require constant rate assumption– assumes distacne from root to every branch tip arre equal

Fitch - Margoliash Method

Weighted least squares method based on geneti differene Closely related sequenes are given greatter weight This compensates for teh increased inaccuracy in measuring distances between distantly related sequenes

Cluster Analysis

Grouping objects so objects in same cluster are more similar

Form of data mining

Used in ecology, transcriptomics, and sequence analysis ## Hierarchial Clustering

builds a hierarchy of clusters in terms of greatest similarity between

Cluster dissimilariy / similarity

certain level of dissimilarity or similarity allows for combining of clusters

Creating a Phylogenetics Tree – Youtube Video

Steps * Sequence Alignment * Multiple Sequence Alignment * Distance matrix * UPGMA

Shows inferred evolutionary relationship between a set of organisms

Shows descent with modification

Sequences seperated by shorter evolutionary distances are expected to be more similar

Distance Matrix Method 1

Start with multiple allignment, construct a pairrwise differencec matric and construc a tree from the pairwise distances

Branch length indicates the number of mutations have occurred on that branch

Patrisic disttance… this is the one that kind of looks like the x with teh spacce between … here you add the distance from the branchh end to the mid, the distance of the mid, and the distance froim the end mid and end of the other branch

Distance Matrix Method 2

UPGMA – distance based phylogeny method

Simple, can be dont by hand, distance based

Dendrogram

Diagram representing a tree

Used in * Hierarchial Clustering * Computational Biology * Phylogenetics

Clade – monophyletic group–

Group of organisms that come from a common ancestor

Clades can be sttacked i.e. clades can be as broad or shallow as we want Clade as long as grouping ccontains all after the common ancestor

Clade different than taxa, taxa not monophyletic

Cladograms– phylogenetic trees of a single clade

Terminology

Nested clade– clade within a clade

Sisters– clades are sisters if they ave an immediate common ancestor

Symetric Matrix

Reference the distance matrix

Similarity

Similarity measure … quantifies similarity between two objects

Inverse of distance metrics

Euclidean Distance

straigt line distance between two points in euclidean space

Outgroup

Reference group for phylogeny… mre distantly related group of organisms

Point fo comparison for ingroups

Allows for phylogeny to be rooted

Vocab to know

Basic Phylogenetics

  • Identifying Clades
  • Formulating Phylogeneticc Hypotheses form maps

Bioinformatics

  • Accession Numbers

BLAST via NCBI website

Sequence Alignment

  • MSA - interpretations, indels
  • Pairwise ccomparison - PID
  • Basic Sequence Sorting

Similarity and Dissimilarity Matrices

  • Creating a matrix from MSA
  • Diagonal Matrix

Phylogenetic Tree Interpretation

  • Clade
  • Branch Length *Interpreting different styles

Phylogenetics vocab

A review worksheet is available here (Links to an external site.) (might have some terms we didn’t cover) A more detailed set of slides is available here (Links to an external site.) (has some info we didn’t cover; lecture video available upon request) taxa sister taxa sister species clade Outgroup tips branch lengths convergent evolution

##Taxonomy vocab A slide deck that reviews basic info on how we name species is here (Links to an external site.); it contains some info beyond what we covered in computational biology. Lecture video available upon request. Species subspecies Pan = genus Pan troglodytes = full species name

##Key Concepts / Ideas You can build a phylogenetic tree of anything Can use DNA, physical features, behaviors Can build trees for things other than species Could do programming languages For the curious: a blog post on this topic https://www.i-programmer.info/news/98-languages/8809-the-evolution-of-programming-languages.html How are species names written? When do you know you are looking at a species name? Species name vs. genus. Phylogenetic hypotheses from a map Things that are close together geographically tend to be related due to migration What do the lengths of branches on a phylogenetic tree mean? It depends! Nothing, time, degree of change/difference, often correlated with time If branch lengths mean something they need to be labeled! The primary interest of PNAS paper - culture; genetics only used a little bit Compare behavioral traits with DNA Draw map/flow chart of the analyses Do chimps have culture? Do they pass behaviors on via culture or only genetically? This is contentious in some circles. PNAS study - only uses 2 subspecies - I find this problematic how is their behavioral trait matrix interpreted (color code matrix of behaviors) how is the cultural hypothesis compared to the genetics hypothesis using phylogenetics?
tree in PNAS paper - no genetics; has a contradiction based on what you’d expect from genetics cultural of convergence of chimp populations despite different genetic histories

Review Lycett paper

Mitochondrial DNA frequently used for phylogenetics studies convergence The contradiction between behavioral and genetic phylogenies

BLAST - vocab and concepts

Accession numbers

GenBank

Use contrl+F to find accession numbers

NCBI website

Papers linked to PubMed

Pubmed vs. google scholar

popsets

how do you interpret a genebank entry? - organism, taxonomic information, source (citation), geographic origin,

mitochondria - circular DNA; chloroplast - circular geneome

when I BLAST a sequence - what am I comparing against

setting specific species for comparison

what is a subspecies really?

why aren’t there human sub-species?

set number of search results

Key output of BLAST - E-value - how well your sequence relates to another sequence

Percent identity (PID)

BLAST alignment tab

BLAST alignment is split across multiple lines

what does “-” mean in BLAST alignment (insertion

“|” vertical line = homology, homologous

genetic lineage.

BLAST can make phylogenetic trees (distance trees), but not one you’d use in a paper

simplifying trees - collapsing tips in a clade down a single tip

clades, common ancestors

BLAST can build trees in different ways: minimum evolution vs. neighbor-joining. Usually similar but can have differences.

MSA

blast MSA viewer (can turn letters on/off, color coding, )

BLAST MSA - what do dots represent? = matching base pairs (homology)

UPGMA readings - read them!

##Vocab / functions Bioconductor Dependencies BiocManager :: MSA Sequence logo dots in MSA Indels ,insertions, deletions sequencing error Conserved bases PID Evolutionarily conserved Evolutionarily similar identical sequences information content consensus sequence consensus sequence vs. sequence logo “N” in sequence Accession number meta data FASTA file nchar() Down triangle in RSTudio global alignment PID() str()

Vocab / key concepts / ideas / code:

MSA

Distance matrix

Sequence logo

Consensus sequence

Major indel in the MSA

Indels due to sequencing errors

Indels are only identifiable in reference to other sequences - looking at a single sequence you don’t know where the indels are; need to compare to consensus sequence, reference sequence etc

Ns in MSA

Sequence errors

Trimming ends of sequences/MSA

Highly polymorphic sites

Basic excel tricks

Excel ruins accession numbers / gene names

COVID 19 problem - not enough lines!

conditional formatting

Pairwise comparison

Pairwise similarities

Pairwise differences

How many unique pairwise comparisons are possible with x sequences?

BLAST returns PID

Scoring an alignment

=count() in Excel

Automatic the scoring of alignment in excel using =if()

Symmetrical matrix

SNP, alleles

matrix()

nrows =

byrow = T

Converting matrix to distance matrix in R

Data reduction

Drawing unrooted tree as a regular phylogenetic tree

Rotation around nodes on a phylogenetic tree

Topology

Multiple methods for computational tasks and implications for results

Interpretation of vertical axes on phylogenetic trees

Updating of matrix during clustering/phylogeny creation