Lycett Et Al found a contradiction between behavioral data and genetic data
Accession #: Unique ID for a sequence Can search the Accession Number in the NCBI website
Popset through NCBI- tells all of the acccession numbers for a given paper
NCBI can tell you where the organism is from, the organelle that the genome is from, the name, etc
NCBI links to blast page Can blast in reference to a query sequence i.e. bonobo Can also limit query returns
Compares amino acid dequences, protein sequences, or nucleoides of DNA / RNA to other known sequences
Looks for similarity of query’s
Looking for similarities may prove helpful for identifying the function of a gene if it is known in another organism
Aligns 3+ sequences of genes to show homologous nucleootides
Required for moleules phylogenetics
By identifying homologous regions, we can isolate important coding sequenes that dont really change
Also useful in desigining PCR primers
Combination fo fields to analyze biological data
Used to identify genes, SNP’s and is useful in understanding diseases and unique adaptations
Performed on a compuuter or via compter simulation
Often used in biology
Model real life sccenarios on biological organisms
Compares seqence motifs and structural motifs
Protein Domain - conserved part of protein sequence and tertiary sequence
Can evolve, funtion, and exist independently of the rest of the protein chain
Domains may appear in multiple different proteins - fold independently of the rest of the gene
Domains can be rearranged
Domains typically have a ffunction which is why they are conserved
Nucleotide / amino acid sequence that has / is believed to have biologial significance
Sequence different than strctural becae doesnt have three dimensioonal arrangements
Determining location and funtion of genes
Annotation is a note that explains the findings
Annotation includes * genomic position * intron and exon boundaries * regulatory sequences * repeats
Steps * Identify Introns (non coding) * Gene prediction – elements of the genome * Attatch biological inffo to elements
Structural Annotation – looks for genomic elements ORF’s (open reading frames) gene structure coding regions location of regulatory motifs
Functional Annotation – attatches biological information to genomic elements * Biochemical function * Biological function * Involved regulation and interactions * Expression
May use biological and in silio analysis
Similarity due to shared anestry between structures or genes in different taxa
Homologous Strutures – different purpose duee to desent with modification fom a commmon anecstor
Sequence Homology – similar protein or DNA sequence as a result of shared ancecstry
Shared ancesttry either through speciation event or gene duplication event
Homology determines based off of sequenec similarity from common anestor
Conserved Sequence – similar in nucleic acids / proteins across species or within genome
Highly Conserved –a sequence that has remained relatively uncchanged going far up the phylogenetic tree
Change can occur as a result of * Insertions and Deletions * Recombine / Delete due to chromosomal rearrangements
Conserved sequences persist even if the above occur because if an organisms had the above in an important region, it likely wouldn’t be able to survive
Selection pressures determine extent of conservation… hhhow stronly the environment selects for the specific functionality of that gene
mutations in amino acid and nucleicc acid sequences are different due to silent mutation sna dislent mutattion can afftect the nucleic acid sequence but not the amino acid sequence
May be replacements with amino acids that maintain the 3d structure
Usally less conserved than coding RNA’s
Structure and Function (Coding) usually conserved more than Non - Coding
Conserved sequences are identified using bioinformatics based on sequence alignment
Can be used in high - throughput DNA sequencing and protein mass spectrometry
Homology uses inputs from one or multiple sequence allignments and looks for similar amino acid sequences and genes between them
Alignment scored based on # of matching amino acids Takes into acccount gaps and deletions
Substitution matrices help find acceptable substitutions
High scores = homologous sequences
This form of analysis shows letters a increasing sizes as a visual representaiton of the level of conservation
WGA - whole genome allignments
Difficult to do with larger organisms due to the computational complexity
Can be done with multiple smaller organisms
Highly conserved sequences – important biological functions – useful in identifying genetic diseases
Congenital metabolic disorders and lysosomal storage diseases are a result off changes to conserved genes – missing / faulty enzymes
Identifying conserved sequences can allow for functional prediction
Graphical representaiton of sequence conservation of nucleotides
Uses alligned sequences
Shows consensus and diversity of sequence
Stack of letters at each position … size represents frequency
Height = information content of position in bits
Simplified versioon of a sequence logo
Doesnt display frequency like in sequence logo
Only shows conservation information
Arranging sequences of a gene that is shared between species and compares the amino acids to look for conserved and non - conserved domains
If sequences have a common ancestor, the mismatches can be shown as point mutations and gaps as indels
Global allignments– forces allignment over a larger space
Local Allignments– finds allignments in smaller regions and alligns themm … better matches… 50% similar but the parts that are simmilar have a 100% conservation rate
Used on only two query sequences at a time and is used to find the best matching piecewise alignment
Efficient … used when extreme precision is not required
3 main methods * dot - matrix methods * dynamic programming * word methods
These methods have difficulty with highly repetitive sequences
MUM - maximum unique match – longest subsequence that occurs in botth querry sequences
Longer MUM - closer relatedness
3+ queries that are assumed to come from a common ancestor
MSA can lead to sequence homology and allows for phylogenetic analusis to see shaed evolutionary origins point mutations– different characters
indels – hyphens
Often used to assess 3d structure of protein by looking at changes in base amino acid structure
Heuristics used to maximize scores – heuristics give insight into evolutionary process
Heuristics means that there is a higher likelihood for errors
More sequences introduce more error because there are more indels that could mess up the algortihm
MSA used for phylogenetics– compares important, highly conserved regions between species to see how similarly related they are and can be used in an evolutionary seting
Mutation that involves an insertion or a deletion of a base in the genome Can result in a frame shift
Point Mutation– replaces a single nucleotide without changing the quantity in the gene
If indel is multiple of 3, there will be no framshift
Shows evolutionary relationship between species – gives phylogeny based on sim. and diff. in physical and genetic characteristics
Rooted Phylogenetic Tree - node is most recent commmon anccestor – branch lengths are time estimates since evolution from common ancestor Often use an outgroup to root
Unrooted trees –relatedness of leaf nodes, dont require knowledge of ancecstral root Relatedness without ancestry Can convert to rooted to unrooted by ommiting the root
Distance Matrix Methods – neighbor joining / UPGMA Geneticc distancce from Multiple seuqnece alignments Simplest Not an evolutionary model
Maximum Parsimony – implies evolution
Optimally ccriterion of maximumm likelihood – Beysian framework –explicit model of evolution to tree estimamtion
Measure of genetic divergence between species
Pop. with similar alleles = smaller genetic distance -- closely related wiith small distance
Applying algorithms to phylogenetic analysis wih tthe goal of assembling a phylogenetic tree showing te hypohesized relatioonship between genes, species, or taxa
Can be morphological or molecules or genetic MSA used for molecular and genetic
Parsimony – used for morphological data bu not really for genetic data… tries to minimize the number of evolutionary steps
Must start wit MSA
Converts MSA to distance matrix simplest just count the diffrences in MSA
Computationally cheap… take MSA, convert to base numbers of sim and diff
Model how species evolve and use this to build a tree
Computationally expensice
BLAST uses a hybrid ccalled minimum evolution
Distane based converts MSA to distancec matrix Simplest convert MSA to counts of similarities and differences and converts this to distance
Distance based makes pairwise differences to give pairwise distances... i.e. distance / difference between each two
This is the one with a box comparing species to each oter.. the diagonals get 0's because they are the same to each other
| Bonobo | Chimp | Human
| Bonobo | 0 | | |
|---|
| Chimp | | 0 | |
Human | | | 0
Compare Chimp and Bonobo… note the differences
Human | | | 0
Bonobo versus Human
Human | 6 | | 0
Chimp Versus Human
Human | 6 | 4 | 0
The above is a symmmetrical matrix.. mirror diagonal immage
We can get rid of info on one side and use it to fill in some otherr information i.e. similarities or percentage difference rather than count
Phylogeny uses disance matrix made of pairwise distances based on differences
Use genetic distance between sequences in questions.. use MSA as input
Distance matrix can be used to construct rooted or unrooted trees
Uses data lustering to sequene using genetic distance as clustering metric
Makes unrooted trees
Doesnt assume constant rate of evolution
UPGMA - Unweighted Pair Group Method with Arithmeic Mean WPGMA - Weightedd pair group metohd with arithmetic mean
rooted tress
Require constant rate assumption– assumes distacne from root to every branch tip arre equal
Weighted least squares method based on geneti differene Closely related sequenes are given greatter weight This compensates for teh increased inaccuracy in measuring distances between distantly related sequenes
Grouping objects so objects in same cluster are more similar
Form of data mining
Used in ecology, transcriptomics, and sequence analysis ## Hierarchial Clustering
builds a hierarchy of clusters in terms of greatest similarity between
certain level of dissimilarity or similarity allows for combining of clusters
Steps * Sequence Alignment * Multiple Sequence Alignment * Distance matrix * UPGMA
Shows inferred evolutionary relationship between a set of organisms
Shows descent with modification
Sequences seperated by shorter evolutionary distances are expected to be more similar
Start with multiple allignment, construct a pairrwise differencec matric and construc a tree from the pairwise distances
Branch length indicates the number of mutations have occurred on that branch
Patrisic disttance… this is the one that kind of looks like the x with teh spacce between … here you add the distance from the branchh end to the mid, the distance of the mid, and the distance froim the end mid and end of the other branch
Simple, can be dont by hand, distance based
Diagram representing a tree
Used in * Hierarchial Clustering * Computational Biology * Phylogenetics
Group of organisms that come from a common ancestor
Clades can be sttacked i.e. clades can be as broad or shallow as we want Clade as long as grouping ccontains all after the common ancestor
Clade different than taxa, taxa not monophyletic
Cladograms– phylogenetic trees of a single clade
Nested clade– clade within a clade
Sisters– clades are sisters if they ave an immediate common ancestor
Reference the distance matrix
Similarity measure … quantifies similarity between two objects
Inverse of distance metrics
straigt line distance between two points in euclidean space
Reference group for phylogeny… mre distantly related group of organisms
Point fo comparison for ingroups
Allows for phylogeny to be rooted
A review worksheet is available here (Links to an external site.) (might have some terms we didn’t cover) A more detailed set of slides is available here (Links to an external site.) (has some info we didn’t cover; lecture video available upon request) taxa sister taxa sister species clade Outgroup tips branch lengths convergent evolution
##Taxonomy vocab A slide deck that reviews basic info on how we name species is here (Links to an external site.); it contains some info beyond what we covered in computational biology. Lecture video available upon request. Species subspecies Pan = genus Pan troglodytes = full species name
##Key Concepts / Ideas You can build a phylogenetic tree of anything Can use DNA, physical features, behaviors Can build trees for things other than species Could do programming languages For the curious: a blog post on this topic https://www.i-programmer.info/news/98-languages/8809-the-evolution-of-programming-languages.html How are species names written? When do you know you are looking at a species name? Species name vs. genus. Phylogenetic hypotheses from a map Things that are close together geographically tend to be related due to migration What do the lengths of branches on a phylogenetic tree mean? It depends! Nothing, time, degree of change/difference, often correlated with time If branch lengths mean something they need to be labeled! The primary interest of PNAS paper - culture; genetics only used a little bit Compare behavioral traits with DNA Draw map/flow chart of the analyses Do chimps have culture? Do they pass behaviors on via culture or only genetically? This is contentious in some circles. PNAS study - only uses 2 subspecies - I find this problematic how is their behavioral trait matrix interpreted (color code matrix of behaviors) how is the cultural hypothesis compared to the genetics hypothesis using phylogenetics?
tree in PNAS paper - no genetics; has a contradiction based on what you’d expect from genetics cultural of convergence of chimp populations despite different genetic histories
Mitochondrial DNA frequently used for phylogenetics studies convergence The contradiction between behavioral and genetic phylogenies
Accession numbers
GenBank
Use contrl+F to find accession numbers
NCBI website
Papers linked to PubMed
Pubmed vs. google scholar
popsets
how do you interpret a genebank entry? - organism, taxonomic information, source (citation), geographic origin,
mitochondria - circular DNA; chloroplast - circular geneome
when I BLAST a sequence - what am I comparing against
setting specific species for comparison
what is a subspecies really?
why aren’t there human sub-species?
set number of search results
Key output of BLAST - E-value - how well your sequence relates to another sequence
Percent identity (PID)
BLAST alignment tab
BLAST alignment is split across multiple lines
what does “-” mean in BLAST alignment (insertion
“|” vertical line = homology, homologous
genetic lineage.
BLAST can make phylogenetic trees (distance trees), but not one you’d use in a paper
simplifying trees - collapsing tips in a clade down a single tip
clades, common ancestors
BLAST can build trees in different ways: minimum evolution vs. neighbor-joining. Usually similar but can have differences.
MSA
blast MSA viewer (can turn letters on/off, color coding, )
BLAST MSA - what do dots represent? = matching base pairs (homology)
UPGMA readings - read them!
##Vocab / functions Bioconductor Dependencies BiocManager :: MSA Sequence logo dots in MSA Indels ,insertions, deletions sequencing error Conserved bases PID Evolutionarily conserved Evolutionarily similar identical sequences information content consensus sequence consensus sequence vs. sequence logo “N” in sequence Accession number meta data FASTA file nchar() Down triangle in RSTudio global alignment PID() str()
MSA
Distance matrix
Sequence logo
Consensus sequence
Major indel in the MSA
Indels due to sequencing errors
Indels are only identifiable in reference to other sequences - looking at a single sequence you don’t know where the indels are; need to compare to consensus sequence, reference sequence etc
Ns in MSA
Sequence errors
Trimming ends of sequences/MSA
Highly polymorphic sites
Basic excel tricks
Excel ruins accession numbers / gene names
COVID 19 problem - not enough lines!
conditional formatting
Pairwise comparison
Pairwise similarities
Pairwise differences
How many unique pairwise comparisons are possible with x sequences?
BLAST returns PID
Scoring an alignment
=count() in Excel
Automatic the scoring of alignment in excel using =if()
Symmetrical matrix
SNP, alleles
matrix()
nrows =
byrow = T
Converting matrix to distance matrix in R
Data reduction
Drawing unrooted tree as a regular phylogenetic tree
Rotation around nodes on a phylogenetic tree
Topology
Multiple methods for computational tasks and implications for results
Interpretation of vertical axes on phylogenetic trees
Updating of matrix during clustering/phylogeny creation