In todays workshop we will be looking to:
We will also see how to:
Sequence motifs are conserved sequences of nucleotides in DNA/RNA or amino acids in proteins that relate to some function. These functions can be regulatory or structural in the case of DNA/RNA or they may be key elements of biochemical function of a protein, just to name a few. The point is that motifs exist because there has been selective pressure to maintain them in DNA, RNA, and proteins. Therefore, detecting them provides us the ability to make predictions about possible functions.
There are a wide variety of methods to detect motifs. The simplest involves generating position specific scoring matrices (PSSM) or sometimes called positional weight matrices (PWM).
A general workflow for finding a motif involves:
This workflow is very general and often motif detection algorithms add or remove various pieces. To get started, let’s start with the following small set of sequences provided in FASTA format. These have already been aligned and regions not aligning have been removed (i.e. we can start at step 4).
>aldB -18->4
attcgtgatagctgtcgtaaag
>ansB 103->125
ttttgttacctgcctctaactt
>araB1 109->131
aagtgtgacgccgtgcaaataa
>araB2 147->169
tgccgtgattatagacactttt
>cdd 1 107->129
atttgcgatgcgtcgcgcattt
>crp 1 115->137
taatgtgacgtcctttgcatac
>cya 151->173
aggtgttaaattgatcacgttt
>cytR 1 125->147
cgatgcgaggcggatcgaaaaa
>dadAX 1 95->117
agatgtgagccagctcaccata
>deoP2 1 75->97
aattgtgatgtgtatcgaagtg
>fur 136->158
aaatgtaagctgtgccacgttt
We’ll start by assessing the information at each position of our alignment. Visit: https://weblogo.berkeley.edu/logo.cgi
Input our sequences above, click “Create Logo”, and let’s see what we get. You should see a graphical representation of the information (or weighted bias) of nucleotides at each position. Notably, we see the classic inverted repeat elements we would expect from a dimeric transcription factor.
Calculating PSSM is a bit more tedious and can be done by hand, but
there is some built in functionality to do this in the R package
Biostrings.
library(Biostrings)
## Loading required package: BiocGenerics
## Loading required package: parallel
##
## Attaching package: 'BiocGenerics'
## The following objects are masked from 'package:parallel':
##
## clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
## clusterExport, clusterMap, parApply, parCapply, parLapply,
## parLapplyLB, parRapply, parSapply, parSapplyLB
## The following objects are masked from 'package:stats':
##
## IQR, mad, sd, var, xtabs
## The following objects are masked from 'package:base':
##
## anyDuplicated, append, as.data.frame, basename, cbind, colnames,
## dirname, do.call, duplicated, eval, evalq, Filter, Find, get, grep,
## grepl, intersect, is.unsorted, lapply, Map, mapply, match, mget,
## order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
## rbind, Reduce, rownames, sapply, setdiff, sort, table, tapply,
## union, unique, unsplit, which.max, which.min
## Loading required package: S4Vectors
## Warning: package 'S4Vectors' was built under R version 4.1.1
## Loading required package: stats4
##
## Attaching package: 'S4Vectors'
## The following objects are masked from 'package:base':
##
## expand.grid, I, unname
## Loading required package: IRanges
## Loading required package: XVector
## Loading required package: GenomeInfoDb
##
## Attaching package: 'Biostrings'
## The following object is masked from 'package:base':
##
## strsplit
x <- c(
'attcgtgatagctgtcgtaaag',
'ttttgttacctgcctctaactt',
'aagtgtgacgccgtgcaaataa',
'tgccgtgattatagacactttt',
'tgccgtgattatagacactttt',
'atttgcgatgcgtcgcgcattt',
'taatgtgacgtcctttgcatac',
'aggtgttaaattgatcacgttt',
'cgatgcgaggcggatcgaaaaa',
'agatgtgagccagctcaccata',
'aattgtgatgtgtatcgaagtg',
'aaatgtaagctgtgccacgttt'
)
pwm <- PWM(x, type = c("log2probratio"),
prior.params = c(A=0.25, C=0.25, G=0.25, T=0.25))
print(pwm) # a PWM
## [,1] [,2] [,3] [,4] [,5] [,6]
## A 0.04636581 0.03628188 0.03628188 -0.01721157 -0.01721157 -0.01721157
## C 0.01317597 -0.01721157 0.02427388 0.03121683 -0.01721157 0.02427388
## G -0.01721157 0.04027158 0.02427388 -0.01721157 0.05626927 -0.01721157
## T 0.03628188 0.03121683 0.03628188 0.05096560 -0.01721157 0.05290380
## [,7] [,8] [,9] [,10] [,11] [,12]
## A 0.01317597 0.05626927 0.01317597 0.02427388 0.02427388 0.01317597
## C -0.01721157 -0.01721157 0.03121683 0.03121683 0.03628188 0.03121683
## G 0.05096560 -0.01721157 0.03121683 0.04027158 0.01317597 0.04027158
## T 0.02427388 -0.01721157 0.04027158 0.02427388 0.04027158 0.03121683
## [,13] [,14] [,15] [,16] [,17] [,18]
## A 0.02427388 0.03121683 0.02427388 -0.01721157 0.04356351 0.03628188
## C 0.02427388 0.03121683 0.01317597 0.05466142 -0.01721157 0.04636581
## G 0.03628188 0.03628188 0.02427388 -0.01721157 0.04027158 -0.01721157
## T 0.03628188 0.02427388 0.04636581 0.01317597 0.01317597 0.01317597
## [,19] [,20] [,21] [,22]
## A 0.04636581 0.03121683 0.03628188 0.03121683
## C 0.01317597 0.01317597 -0.01721157 0.01317597
## G 0.02427388 0.01317597 -0.01721157 0.02427388
## T 0.02427388 0.04636581 0.04880544 0.04356351
Now, with our PSSM (or PWM in this case) you can begin to search for this same motif in new sequences. Let’s have R do this for us. There is simple functionality in the Biostring package to do this, as well.
subject_string = 'CCGCTTGCTGCAACTCTCTCAGGGCCAGGCGGTGAAGGGCAATCAGCTGTTGCCCGTCTCACTGGTGAAAAGAAAAACCACCCTGGCGCCCAATACGCAAACCGCCTCTCCCCGCGCGTTGGCCGATTCATTAATGCAGCTGGCACGACAGGTTTCCCGACTGGAAAGCGGGCAGTGAGCGCAACGCAATTAATGTGAGTTAGCTCACTCATTAGGCACCCCAGGCTTTACACTTTATGCTTCCGGCTCGTATGTTGTGTGGAATTGTGAGCGGATAACAATTTCACACAGGAAACAGCT'
matchPWM(pwm,subject_string)
## Views on a 300-letter DNAString subject
## subject: CCGCTTGCTGCAACTCTCTCAGGGCCAGGCGGTG...GTGAGCGGATAACAATTTCACACAGGAAACAGCT
## views:
## start end width
## [1] 191 212 22 [TAATGTGAGTTAGCTCACTCAT]
Question 1: Compare the sequence logo and your pattern matching output. Does the identified sequence look like the sequence logo? What is similar/different?
There are wide variety of online tools for motif generation and searching. One useful online toolkit is the MEME-Suite (https://meme-suite.org/meme/). MEME-Suite uses a probability-based framework (Bayesian) to identify motifs and use them in subsequent searches. There are a variety of tools for various purposes. Let’s explore!
>lacZ
CCGCTTGCTG CAACTCTCTC AGGGCCAGGC GGTGAAGGGC AATCAGCTGT TGCCCGTCTC
ACTGGTGAAA AGAAAAACCA CCCTGGCGCC CAATACGCAA ACCGCCTCTC CCCGCGCGTT
GGCCGATTCA TTAATGCAGC TGGCACGACA GGTTTCCCGA CTGGAAAGCG GGCAGTGAGC
GCAACGCAAT TAATGTGAGT TAGCTCACTC ATTAGGCACC CCAGGCTTTA CACTTTATGC
TTCCGGCTCG TATGTTGTGT GGAATTGTGA GCGGATAACA ATTTCACACA GGAAACAGCT
>araB
GTCCATATTG CATCAGACAT TGCCGTCACT GCGTCTTTTA CTGGCTCTTC TCGCTAACCC
AACCGGTAAC CCCGCTTATT AAAAGCATTC TGTAACAAAG CGGGACCAAA GCCATGACAA
AAACGCGTAA CAAAAGTGTC TATAATCACG GCAGAAAAGT CCACATTGAT TATTTGCACG
GCGTCACACT TTGCTATGCC ATAGCATTTT TATCCATAAG ATTAGCGGAT CCTACCTGAC
GCTTTTTATC GCAACTCTCT ACTGTTTCTC CATACCCGTT TTTTTGGATG GAGTGAAACG
>xylA
CAATTTTTAG CAACTAAACA GGGGAAAACA ATTACAGATT TTTATCTTTC GATTACGATT
TTTGGTTTAT TTCTTGATTT ATGACCGAGA TCTTACTTTT GTTGCGCAAT TGTACTTATT
GCATTTTTCT CTTCGAGGAA TTACCCAGTT TCATCATTCC ATTTTATTTT GCGAGCGAGC
GCACACTTGT GAATTATCTC AATAGCAGTG TGAAATAACA TAATTGAGCA ACTGAAAGGG
AGTGCCCAAT ATTACGACAT CATCCATCAC CCGCGGCATT ACCTGATTAT GGAGTTCAAT
>marR
CCAAGAAATA ACGCGACAGT TGTTAATGGG TTAGCTAACG GCAGCAACAC CACCAGCCCC
AGGCCAATTG CTTTAAACAA ATCTAACATT GGTGGTTGTT ATCCTGTGTA TCTGGGTTAT
CAGCGAAAAG TATAAGGGGT AAACAAGGAT AAAGTGTCAC TCTTTAGCTA GCCTTGCATC
GCATTGAACA AAACTTGAAC CGATTTAGCA AAACGTGGCA TCGGTCAATT CATTCATTTG
ACTTATACTT GCCTGGGCAA TATTATCCCC TGCAACTAAT TACTTGCCAG GGCAACTAAT
Question 2: Why do you suspect the motifs found by creating a PWM and searching were different than those found using MEME suite?
The answer to this basic question is pretty straightforward: genome browsers aggregate a lot of useful information about the genetic information of an organism into one place. By analogy, we might think of genome browsers as the Amazon of genetic information - there is a lot of information available that serves a wide variety of interests.
Assembling a genome browser is a painstaking effort. All of the information within a genome browser is a reflection if that data being held within a database. In fact, not just one database but a variety of different distinct databases. That said these projects are not perfect and are under constant refinement.
Genome browsers are often tailored for specific needs. We will take a look at a few examples today. Each has similar, yet different, focuses. The UCSC Genome Browser, for example, is very focused on providing users with a highly flexible interface whereas the NCBI Genome Viewer (which is newer) is focused on providing a streamlined user experience and integration into other NCBI tools (e.g. BLAST). Both share nearly identical data since both draw information from the same sources.
Genome browsers provide a visual representation of complex genetic information. In simple terms, we get to look at the physical map of an organisms genome. Ask any geneticist 50 years ago what they would think about this and most would describe this as absolutely revolutionary. This visual representation is the gateway to myriad data we can explore through the click of a link. The best way to get aquainted with genome browsers is to use them!
The UCSC Genome Browser is one of the oldest (modern) genome browser interfaces. It is highly extensible and even allows users to upload their own genomic data for visualization. Let’s check it out.
Select the hg19 assembly for our exercises below. Once hg19 is selected, perform a search for ACE2.
In the UCSC browser, information is organized into tracks. These tracks a distinct components of information related to that region of DNA. There a lot of tracks that can be added or removed based on user preferences in the menus at the bottom. The default settings provide the most general use for novice users.
Let’s look at the gene expression information for a minute.
Question 3: What tissue has the highest levels of ACE2 expression?
The NCBI Genome Data Viewer is a newer platform that allows extensive integration into the NCBI toolkit. Chief among these is the BLAST framework. As we’ll see, for many purposes the UCSC and NCBI interfaces differ slightly with respect to information. For many users, use really comes down to personal and/or historical preference. Notably, there are no well-vetted R packages available to interact with the NCBI Genome Viewer at this time.
Go to: https://www.ncbi.nlm.nih.gov/genome/gdv/
Select the Homo sapiens link and search for ACE2.
Question 4: What is one similarity and one difference you note between the UCSC and NCBI interfaces?
Question 5: What chromosome is ACE2 located on?
If we zoom into the gene and hover our cursor over the protein or nucleotide sequence, we will see a series of links to NCBI record entries as well as various BLAST tools. Go ahead and hover over the protein sequence (red) of isoform 2. Navigate to the NCBI BLASTP interface via the link in the pop-up menu. You’ll note the NCBI RefSeq number has been pasted into the search field. Now you can adjust your BLAST search as you see fit for your purposes.
Note, the homework question will require you to BLAST search with the peptide sequence of an exon and not the entire protein.
The default viewer settings for NCBI are minimal but don’t let that fool you. If you click on the tracks menu, you will see you can add and remove many other tracks to modify your viewing experience.
This is a fairly new tool in the NCBI toolkit. Given our ever-growing amount of genome information, one thing we often wish to do is compare genomes. Doing this, we can learn a lot about how genomes are structured (ie. genome architecture) and how genomes have changed through time in organismal lineages (ie. genome evolution).
Navigate to: https://www.ncbi.nlm.nih.gov/genome/cgv/
Select Homo sapiens and Canis lupis familiaris to compare. You may need to select an Assembly. For the dog, choose the Boxer.
You should see a beautiful ribbon diagram. What is represented in this graphic is the synteny, or preserved structural correlation, of genes between these species. As we might guess this is a little bit of a spaghetti diagram and not easy to follow, broadly. However, we can glean a few things.
Question 6: How many chromosomes do humans have and how many do dogs have?
Question 7: By visual comparison, which chromosome shares the least gene content between these two species? Why is this likely true? (Hint: Consider the biological gender of the sequenced individuals.)