In todays workshop we will be looking to:
We will also see how to:
Sequence motifs are conserved sequences of nucleotides in DNA/RNA or amino acids in proteins that relate to some function. These functions can be regulatory or structural in the case of DNA/RNA or they may be key elements of biochemical function of a protein, just to name a few. The point is that motifs exist because there has been selective pressure to maintain them in DNA, RNA, and proteins. Therefore, detecting them provides us the ability to make predictions about possible functions.
There are a wide variety of methods to detect motifs. The simplest involves generating position specific scoring matrices (PSSM) or sometimes called positional weight matrices (PWM).
A general workflow for finding a motif involves:
This workflow is very general and often motif detection algorithms add or remove various pieces. To get started, let’s start with the following small set of sequences provided in FASTA format. These have already been aligned and regions not aligning have been removed (i.e. we can start at step 4).
>aldB -18->4
attcgtgatagctgtcgtaaag
>ansB 103->125
ttttgttacctgcctctaactt
>araB1 109->131
aagtgtgacgccgtgcaaataa
>araB2 147->169
tgccgtgattatagacactttt
>cdd 1 107->129
atttgcgatgcgtcgcgcattt
>crp 1 115->137
taatgtgacgtcctttgcatac
>cya 151->173
aggtgttaaattgatcacgttt
>cytR 1 125->147
cgatgcgaggcggatcgaaaaa
>dadAX 1 95->117
agatgtgagccagctcaccata
>deoP2 1 75->97
aattgtgatgtgtatcgaagtg
>fur 136->158
aaatgtaagctgtgccacgttt
We’ll start by assessing the information at each position of our alignment. Visit: https://weblogo.berkeley.edu/logo.cgi
Input our sequences above, click “Create Logo”, and let’s see what we get. You should see a graphical representation of the information (or weighted bias) of nucleotides at each position. Notably, we see the classic inverted repeat elements we would expect from a dimeric transcription factor.
Calculating PSSM is a bit more tedious and can be done by hand, but there is some built in functionality to do this in the R package Biostrings.
library(Biostrings)
x <- c(
'attcgtgatagctgtcgtaaag',
'ttttgttacctgcctctaactt',
'aagtgtgacgccgtgcaaataa',
'tgccgtgattatagacactttt',
'tgccgtgattatagacactttt',
'atttgcgatgcgtcgcgcattt',
'taatgtgacgtcctttgcatac',
'aggtgttaaattgatcacgttt',
'cgatgcgaggcggatcgaaaaa',
'agatgtgagccagctcaccata',
'aattgtgatgtgtatcgaagtg',
'aaatgtaagctgtgccacgttt'
)
pwm <- PWM(x, type = c("log2probratio"),
prior.params = c(A=0.25, C=0.25, G=0.25, T=0.25))
print(pwm) # a PWM
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9]
A 0.04636581 0.03628188 0.03628188 -0.01721157 -0.01721157 -0.01721157 0.01317597 0.05626927 0.01317597
C 0.01317597 -0.01721157 0.02427388 0.03121683 -0.01721157 0.02427388 -0.01721157 -0.01721157 0.03121683
G -0.01721157 0.04027158 0.02427388 -0.01721157 0.05626927 -0.01721157 0.05096560 -0.01721157 0.03121683
T 0.03628188 0.03121683 0.03628188 0.05096560 -0.01721157 0.05290380 0.02427388 -0.01721157 0.04027158
[,10] [,11] [,12] [,13] [,14] [,15] [,16] [,17] [,18]
A 0.02427388 0.02427388 0.01317597 0.02427388 0.03121683 0.02427388 -0.01721157 0.04356351 0.03628188
C 0.03121683 0.03628188 0.03121683 0.02427388 0.03121683 0.01317597 0.05466142 -0.01721157 0.04636581
G 0.04027158 0.01317597 0.04027158 0.03628188 0.03628188 0.02427388 -0.01721157 0.04027158 -0.01721157
T 0.02427388 0.04027158 0.03121683 0.03628188 0.02427388 0.04636581 0.01317597 0.01317597 0.01317597
[,19] [,20] [,21] [,22]
A 0.04636581 0.03121683 0.03628188 0.03121683
C 0.01317597 0.01317597 -0.01721157 0.01317597
G 0.02427388 0.01317597 -0.01721157 0.02427388
T 0.02427388 0.04636581 0.04880544 0.04356351
print(-log10(pwm)) # a PSSM
Warning in print(-log10(pwm)) : NaNs produced
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11]
A 1.333802 1.440310 1.440310 NaN NaN NaN 1.880217 1.249729 1.880217 1.614861 1.614861
C 1.880217 NaN 1.614861 1.505611 NaN 1.614861 NaN NaN 1.505611 1.505611 1.440310
G NaN 1.395001 1.614861 NaN 1.249729 NaN 1.292723 NaN 1.505611 1.395001 1.880217
T 1.440310 1.505611 1.440310 1.292723 NaN 1.276513 1.614861 NaN 1.395001 1.614861 1.395001
[,12] [,13] [,14] [,15] [,16] [,17] [,18] [,19] [,20] [,21] [,22]
A 1.880217 1.614861 1.505611 1.614861 NaN 1.360877 1.440310 1.333802 1.505611 1.440310 1.505611
C 1.505611 1.614861 1.505611 1.880217 1.262319 NaN 1.333802 1.880217 1.880217 NaN 1.880217
G 1.395001 1.440310 1.440310 1.614861 NaN 1.395001 NaN 1.614861 1.880217 NaN 1.614861
T 1.505611 1.440310 1.614861 1.333802 1.880217 1.880217 1.880217 1.614861 1.333802 1.311532 1.360877
Now, with our PSSM (or PWM in this case) you can begin to search for this same motif in new sequences. Let’s have R do this for us. There is simple functionality in the Biostring package to do this, as well.
subject_string = 'CCGCTTGCTGCAACTCTCTCAGGGCCAGGCGGTGAAGGGCAATCAGCTGTTGCCCGTCTCACTGGTGAAAAGAAAAACCACCCTGGCGCCCAATACGCAAACCGCCTCTCCCCGCGCGTTGGCCGATTCATTAATGCAGCTGGCACGACAGGTTTCCCGACTGGAAAGCGGGCAGTGAGCGCAACGCAATTAATGTGAGTTAGCTCACTCATTAGGCACCCCAGGCTTTACACTTTATGCTTCCGGCTCGTATGTTGTGTGGAATTGTGAGCGGATAACAATTTCACACAGGAAACAGCT'
matchPWM(pwm,subject_string)
Views on a 300-letter DNAString subject
subject: CCGCTTGCTGCAACTCTCTCAGGGCCAGGCGGTGAAGGGCAATCAGCT...TGTTGTGTGGAATTGTGAGCGGATAACAATTTCACACAGGAAACAGCT
views:
start end width
[1] 191 212 22 [TAATGTGAGTTAGCTCACTCAT]
There are wide variety of online tools for motif generation and searching. One useful online toolkit is the MEME-Suite (https://meme-suite.org/meme/). MEME-Suite uses a probability-based framework (Bayesian) to identify motifs and use them in subsequent searches. There are a variety of tools for various purposes. Let’s explore!
>lacZ
CCGCTTGCTG CAACTCTCTC AGGGCCAGGC GGTGAAGGGC AATCAGCTGT TGCCCGTCTC
ACTGGTGAAA AGAAAAACCA CCCTGGCGCC CAATACGCAA ACCGCCTCTC CCCGCGCGTT
GGCCGATTCA TTAATGCAGC TGGCACGACA GGTTTCCCGA CTGGAAAGCG GGCAGTGAGC
GCAACGCAAT TAATGTGAGT TAGCTCACTC ATTAGGCACC CCAGGCTTTA CACTTTATGC
TTCCGGCTCG TATGTTGTGT GGAATTGTGA GCGGATAACA ATTTCACACA GGAAACAGCT
>araB
GTCCATATTG CATCAGACAT TGCCGTCACT GCGTCTTTTA CTGGCTCTTC TCGCTAACCC
AACCGGTAAC CCCGCTTATT AAAAGCATTC TGTAACAAAG CGGGACCAAA GCCATGACAA
AAACGCGTAA CAAAAGTGTC TATAATCACG GCAGAAAAGT CCACATTGAT TATTTGCACG
GCGTCACACT TTGCTATGCC ATAGCATTTT TATCCATAAG ATTAGCGGAT CCTACCTGAC
GCTTTTTATC GCAACTCTCT ACTGTTTCTC CATACCCGTT TTTTTGGATG GAGTGAAACG
>xylA
CAATTTTTAG CAACTAAACA GGGGAAAACA ATTACAGATT TTTATCTTTC GATTACGATT
TTTGGTTTAT TTCTTGATTT ATGACCGAGA TCTTACTTTT GTTGCGCAAT TGTACTTATT
GCATTTTTCT CTTCGAGGAA TTACCCAGTT TCATCATTCC ATTTTATTTT GCGAGCGAGC
GCACACTTGT GAATTATCTC AATAGCAGTG TGAAATAACA TAATTGAGCA ACTGAAAGGG
AGTGCCCAAT ATTACGACAT CATCCATCAC CCGCGGCATT ACCTGATTAT GGAGTTCAAT
>marR
CCAAGAAATA ACGCGACAGT TGTTAATGGG TTAGCTAACG GCAGCAACAC CACCAGCCCC
AGGCCAATTG CTTTAAACAA ATCTAACATT GGTGGTTGTT ATCCTGTGTA TCTGGGTTAT
CAGCGAAAAG TATAAGGGGT AAACAAGGAT AAAGTGTCAC TCTTTAGCTA GCCTTGCATC
GCATTGAACA AAACTTGAAC CGATTTAGCA AAACGTGGCA TCGGTCAATT CATTCATTTG
ACTTATACTT GCCTGGGCAA TATTATCCCC TGCAACTAAT TACTTGCCAG GGCAACTAAT
Question
Why do you suspect the motifs found by creating a PWM and searching were different than those found using MEME suite?
The answer to this basic question is pretty straightforward: genome browsers aggregate a lot of useful information about the genetic information of an organism into one place. By analogy, we might think of genome browsers as the Amazon of genetic information - there is a lot of information available that serves a wide variety of interests.
Assembling a genome browser is a painstaking effort. All of the information within a genome browser is a reflection if that data being held within a database. In fact, not just one database but a variety of different distinct databases. That said these projects are not perfect and are under constant refinement.
Genome browsers are often tailored for specific needs. We will take a look at a few examples today. Each has similar, yet different, focuses. The UCSC Genome Browser, for example, is very focused on providing users with a highly flexible interface whereas the NCBI Genome Viewer (which is newer) is focused on providing a streamlined user experience and integration into other NCBI tools (e.g. BLAST). Both share nearly identical data since both draw information from the same sources.
Genome browsers provide a visual representation of complex genetic information. In simple terms, we get to look at the physical map of an organisms genome. Ask any geneticist 50 years ago what they would think about this and most would describe this as absolutely revolutionary. This visual representation is the gateway to myriad data we can explore through the click of a link. The best way to get aquainted with genome browsers is to use them!
The UCSC Genome Browser is one of the oldest (modern) genome browser interfaces. It is highly extensible and even allows users to upload their own genomic data for visualization. Let’s check it out.
Select the hg19 assembly for our exercises below. Once hg19 is selected, perform a search for ACE2.
In the UCSC browser, information is organized into tracks. These tracks a distinct components of information related to that region of DNA. There a lot of tracks that can be added or removed based on user preferences in the menus at the bottom. The default settings provide the most general use for novice users.
Let’s look at the gene expression information for a minute.
Question
What tissue has the highest levels of ACE2 expression?
There are extensive R packages allowing R to interact with or use data in the UCSC Genome Browser.
The following is a short example of how the package rtracklayer can be used to
###################################################
### chunk number 1: rtl-init
###################################################
library(rtracklayer)
data(targets)
###################################################
### chunk number 2: rtl-miRNA-track
###################################################
targetTrack <- makeGRangesFromDataFrame(targets,
keep.extra.columns=TRUE)
###################################################
### chunk number 3: rtl-export eval=FALSE
###################################################
## export(targetTrack, "targets.wig")
###################################################
### chunk number 4: rtl-ucsc-start
###################################################
session <- browserSession()
genome(session) <- "hg18"
###################################################
### chunk number 5: rtl-ucsc-lay
###################################################
session$targets <- targetTrack
###################################################
### chunk number 6: rtl-ucsc-view eval=FALSE
###################################################
top <- targetTrack$target == targets$target[1]
range <- targetTrack[top,] * -10
view <- browserView(session, range,
hide = c("refGene", "mgcFullMrna", "intronEst"),
dense = "knownGene", squish = "cons44way")
The NCBI Genome Data Viewer is a newer platform that allows extensive integration into the NCBI toolkit. Chief among these is the BLAST framework. As we’ll see, for many purposes the UCSC and NCBI interfaces differ slightly with respect to information. For many users, use really comes down to personal and/or historical preference. Notably, there are no well-vetted R packages available to interact with the NCBI Genome Viewer at this time.
Go to: https://www.ncbi.nlm.nih.gov/genome/gdv/
Select the Homo sapiens link and search for ACE2.
Question
What is one similarity and one difference you note between the UCSC and NCBI interfaces?
If we zoom into the gene and hover our cursor over the protein or nucleotide sequence, we will see a series of links to NCBI record entries as well as various BLAST tools. Go ahead and hover over the protein sequence (red) of isoform 2. Navigate to the NCBI BLASTP interface via the link in the pop-up menu. You’ll note the NCBI RefSeq number has been pasted into the search field. Now you can adjust your BLAST search as you see fit for your purposes.
Note, the homework question will require you to BLAST search with the peptide sequence of an exon and not the entire protein.
The default viewer settings for NCBI are minimal but don’t let that fool you. If you click on the tracks menu, you will see you can add and remove many other tracks to modify your viewing experience.