Practical II. Identifying and Searching Sequence Patterns in DNA

Introduction

In this exercise, we will deal with the identification of sequence patterns. Initially, we will try to identify small cis-elements in pre-aligned sequences, and then we will search for them in unknown sequences. The example will be a set of binding sites for a transcription factor involved in the differentiation of hematopoietic system cells, GATA, as identified in the human genome. The data consist of 50 hexanucleotides, which you can download from the gata.fa file here. (The data are taken from the book by Deonier, Tavare, and Waterman “Computational Genome Analysis”).

Exercise Objectives

The goals of this exercise are:
1. To read the aligned sequences into a data structure that we can manipulate.
2. To extract a consensus for the entire set.
3. To extract a PWM for the sequences at each position.
4. To graphically represent the PWM in two different ways.

A. Identifying Sequence Patterns from sequence collections

1. Reading Sequences

Declare the folder where you will download the data as the working directory with the command. In my computer this is

setwd("~/Dropbox/github/MolBioMedClass/Rfunctions/")

and load the function you will find here as readfastafile.R as well as seqMotif.R which you will find at the respective links

source("~/Dropbox/github/MolBioMedClass/Rfunctions/seqMotif.R")
source("~/Dropbox/github/MolBioMedClass/Rfunctions/readfastafile.R")

At this point, we will read the GATA sequences file using the readfastafile function and store them in a sequence array called gata. You can download the file from here. In my computer, this is already stored in a different folder so I run:

gata<-readfastafile("../Datasets/gata.fa")

Examine the result to see the contents of the collection. It consists of 50 six-nucleotide sequences each. The fact that all sequences have the same length will be useful later for extracting pattern characteristics, as we will assume we can align the sequences one under the other and compare their residues by position.

gata

## $Site1
## [1] "TTATAG"
## 
## $Site2
## [1] "AGATAT"
## 
## $Site3
## [1] "AGATAG"
## 
## $Site4
## [1] "ATATCT"
## 
## $Site5
## [1] "AGATAG"
## 
## $Site6
## [1] "AGATAG"
## 
## $Site7
## [1] "AGATAG"
## 
## $Site8
## [1] "TGATAA"
## 
## $Site9
## [1] "AGATAA"
## 
## $Site10
## [1] "AGATAA"
## 
## $Site11
## [1] "CGATAG"
## 
## $Site12
## [1] "AGAGTT"
## 
## $Site13
## [1] "TGATAA"
## 
## $Site14
## [1] "TGATAA"
## 
## $Site15
## [1] "AGATGG"
## 
## $Site16
## [1] "AGATAG"
## 
## $Site17
## [1] "AGATTG"
## 
## $Site18
## [1] "AGATAA"
## 
## $Site19
## [1] "TGATAA"
## 
## $Site20
## [1] "AGATAA"
## 
## $Site21
## [1] "AGATAG"
## 
## $Site22
## [1] "TGATAG"
## 
## $Site23
## [1] "TGATCA"
## 
## $Site24
## [1] "TTATCA"
## 
## $Site25
## [1] "AGATGG"
## 
## $Site26
## [1] "TGATAT"
## 
## $Site27
## [1] "AGATAG"
## 
## $Site28
## [1] "TGATAA"
## 
## $Site29
## [1] "GGATAC"
## 
## $Site30
## [1] "AGATAA"
## 
## $Site31
## [1] "CGATAA"
## 
## $Site32
## [1] "TGATAG"
## 
## $Site33
## [1] "AGATAA"
## 
## $Site34
## [1] "TGATTA"
## 
## $Site35
## [1] "AGATAA"
## 
## $Site36
## [1] "AGATAG"
## 
## $Site37
## [1] "TGATAT"
## 
## $Site38
## [1] "AGATAA"
## 
## $Site39
## [1] "TCAGAG"
## 
## $Site40
## [1] "AAGTAG"
## 
## $Site41
## [1] "AGATTA"
## 
## $Site42
## [1] "TGATAG"
## 
## $Site43
## [1] "TGATAG"
## 
## $Site44
## [1] "AGATAC"
## 
## $Site45
## [1] "TGATTG"
## 
## $Site46
## [1] "AGATTA"
## 
## $Site47
## [1] "AGAATA"
## 
## $Site48
## [1] "AGATAA"
## 
## $Site49
## [1] "AGATTA"
## 
## $Site50
## [1] "AGCTTC"

2. Extracting Consensus

We now have the sequences stored in an R data structure. Our next goal is to extract a consensus sequence from the set of sequences in our collection. Remember that the consensus sequence corresponds to the most common residue extracted by position. You can extract the consensus sequence using the namesake function consensus.R as shown below:

source("~/Dropbox/github/MolBioMedClass/Rfunctions/consensus.R")
consensus(gata, threshold=0.3)

## [1] "AGATAA"

You get a sequence corresponding to the consensus. As discussed in class, the probability with which the most prevalent residue appears can vary greatly for each position, affecting the reliability of the consensus. In the above example, we set the threshold to 0.3, meaning that if the maximum value was greater than 0.3, it returns that specific residue. What would the results be if we wanted to be stricter, e.g., requiring the most prevalent residue to have a probability of at least 0.6?

consensus(gata, threshold=0.6)

## [1] "AGATA[AG]"

You see how the consensus changes because for the last position there is relative ambiguity (no residue has p>=0.6). In this case, the function returns all residues with a probability of at least 1-threshold (in this case [AG]). Try increasing the threshold further and see how the consensus becomes increasingly “probabilistic”. This does not mean you can no longer say what happens at each position, just that the initial estimate for a clear consensus is not so reliable. For example:

consensus(gata, threshold=0.9)

## [1] "[AT]GAT[AT][AGT]"

Nevertheless, as we already discussed in class, the best way to estimate a pattern is to look at the Probability Weight Matrix (PWM) and represent it as a logo.

3. Creating a PWM

Beyond a simple consensus sequence, a sequence pattern can be quantified and analyzed better with a PWM, i.e., a position-specific probability matrix. Using the seqMotif.R function, we can calculate the PWM of a sequence collection as follows:

gata_motif<-seqMotif(gata, drawmotif=F)
gata_motif

##           P1         P2         P3         P4         P5         P6
## A 0.57407407 0.03703704 0.90740741 0.03703704 0.68518519 0.42592593
## C 0.05555556 0.03703704 0.03703704 0.01851852 0.07407407 0.07407407
## G 0.03703704 0.85185185 0.03703704 0.05555556 0.05555556 0.38888889
## T 0.33333333 0.07407407 0.01851852 0.88888889 0.18518519 0.11111111

The result is the probability matrix where the preference for the “GATA” pattern in the four middle positions of the sequence is clear, but we see that the trend is not equivalent for all residues. For example, the fifth position of the sequence is A with probability 0.68, while the third (also A) has a corresponding probability of 0.90.

In this case, we chose not to represent the motif with the option (drawmotif=F), but if we want we can see a first visualization of it in the form of a heatmap with the following command:

seqMotif(gata, drawmotif=T)

##           P1         P2         P3         P4         P5         P6
## A 0.57407407 0.03703704 0.90740741 0.03703704 0.68518519 0.42592593
## C 0.05555556 0.03703704 0.03703704 0.01851852 0.07407407 0.07407407
## G 0.03703704 0.85185185 0.03703704 0.05555556 0.05555556 0.38888889
## T 0.33333333 0.07407407 0.01851852 0.88888889 0.18518519 0.11111111

Observe the shape that emerges. You first see on the x-axis the nucleotide positions we are examining (from P1 to P6) and on the y-axis the four possible outcomes (nucleotide). The corresponding elements (base by position) are colored based on the PWM, with red values for high and white for low. Can you see the correspondence with the consensus?

4. Graphical Representation of PWM. Sequence Logo

An even better way to represent a pattern is the one derived from aligned sequences through Shannon information analysis and its representation as a sequence logo. This can be done very easily using a function called ggseqlogo, which you can find here:

#install.packages("ggseqlogo")
library("ggseqlogo")

Now use this function on a slightly differently formatted version of the GATA collection:

gata_seqs <- unlist(gata)
logo <- ggseqlogo(gata_seqs)

## Warning: The `<scale>` argument of `guides()` cannot be `FALSE`. Use "none" instead as
## of ggplot2 3.3.4.
## ℹ The deprecated feature was likely used in the ggseqlogo package.
##   Please report the issue at <https://github.com/omarwagih/ggseqlogo/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

## Warning: `aes_string()` was deprecated in ggplot2 3.0.0.
## ℹ Please use tidy evaluation idioms with `aes()`.
## ℹ See also `vignette("ggplot2-in-packages")` for more information.
## ℹ The deprecated feature was likely used in the ggseqlogo package.
##   Please report the issue at <https://github.com/omarwagih/ggseqlogo/issues>.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.

logo

Observe the logo that was created. To what extent does it contain information you got from the PWM via the heatmap?

B. Searching for sequence patterns in an unknown sequence

In this part we will see how we can use an existing sequence pattern to identify hits, that is sequences in a longer stretch of DNA (e.g. an entire chromosome or a genome) that match the pattern. This is also done probabilistically.

5. Creation of PSSM

To better understand the strength of a probabilistic representation of a DNA sequence, i.e., how characteristic the pattern it contains is, it is good to compare it with a set of random sequences. You can do all the above steps on a set of random sequences of the same length (6 nucleotides) which you will find in the random.fa file.
Now we can take this random sequence collection into account by creating a PSSM matrix.

random<-readfastafile("../Datasets/random.fa")
random_motif<-seqMotif(random, draw=F)
random_motif

##          P1        P2        P3        P4        P5        P6
## A 0.2830189 0.3396226 0.3207547 0.3018868 0.3207547 0.4150943
## C 0.1509434 0.1698113 0.1698113 0.1132075 0.2075472 0.1509434
## G 0.2075472 0.2452830 0.2264151 0.2264151 0.2641509 0.1509434
## T 0.3584906 0.2452830 0.2830189 0.3584906 0.2075472 0.2830189

We can now create a PSSM matrix by comparing the gata.fa and random.fa collections. We will calculate the base2-logarithm of the ratio of the two motifs with gata in the numerator. A necessary addition is to add pseudocounts to both motifs to avoid indeterminacies like N/0 or log(0). An indicative command adds a small amount (0.01) to both parts of the ratio:

pssm<-log2((gata_motif+0.01)/(random_motif+0.01))

where gata_motif and random_motif have been obtained by applying seqMotif() to the respective sequence collections. Take a look at the PSSM you have created

pssm

##           P1        P2        P3        P4         P5          P6
## A  0.9951578 -2.893930  1.471801 -2.729153  1.0716356  0.03629996
## C -1.2957634 -1.934615 -1.934615 -2.111120 -1.3715954 -0.93682049
## G -2.2094592  1.755342 -2.329453 -1.850532 -2.0641805  1.30943356
## T -0.1020178 -1.602365 -3.361023  1.286515 -0.1564847 -1.27466234

6. Searching for matches in a longer sequence

Having created the PSSM, you are now in the position to search for this specific motif in a DNA sequence. You can use this sequence test1.fa as an example. Read it according to what we have done already.

targetseq<-readfastafile("../Datasets/test1.fa")

And now also download the PSSM-Searching function from pssmSearch.R. Once loaded

source("../Rfunctions/pssmSearch.R")

You can now execute the search by placing the pre-calculated PSSM against the target sequence like so:

pssmSearch(pssm, targetseq, threshold=0.8)

##           pssmscore
## [1,] 1128  6.792710
## [2,] 1264  6.792710
## [3,] 2089  6.661765
## [4,] 3152  6.616752
## [5,] 3908  6.616752
## [6,] 4597  6.792710

Compare this with a search using the PWM

pssmSearch(gata_motif, targetseq, threshold=0.8)

##            pssmscore
##  [1,]  343  3.518519
##  [2,]  475  3.481481
##  [3,]  609  3.777778
##  [4,]  729  3.481481
##  [5,]  756  3.518519
##  [6,]  787  3.518519
##  [7,]  998  3.740741
##  [8,] 1128  4.055556
##  [9,] 1168  3.518519
## [10,] 1184  3.555556
## [11,] 1222  3.500000
## [12,] 1264  4.055556
## [13,] 1336  3.777778
## [14,] 1664  3.481481
## [15,] 1669  3.481481
## [16,] 1698  3.555556
## [17,] 1815  3.777778
## [18,] 1824  3.981481
## [19,] 1837  3.759259
## [20,] 1851  3.666667
## [21,] 1893  3.481481
## [22,] 2040  3.777778
## [23,] 2089  3.796296
## [24,] 2194  3.777778
## [25,] 2196  3.555556
## [26,] 2244  3.518519
## [27,] 2572  3.518519
## [28,] 2795  3.777778
## [29,] 2806  3.981481
## [30,] 2862  3.777778
## [31,] 2892  3.481481
## [32,] 3116  3.555556
## [33,] 3120  4.092593
## [34,] 3152  4.333333
## [35,] 3207  4.018519
## [36,] 3209  3.555556
## [37,] 3214  3.981481
## [38,] 3221  3.481481
## [39,] 3231  4.092593
## [40,] 3255  3.722222
## [41,] 3487  3.500000
## [42,] 3540  4.092593
## [43,] 3687  3.481481
## [44,] 3740  3.500000
## [45,] 3908  4.333333
## [46,] 3970  4.092593
## [47,] 4358  3.759259
## [48,] 4368  4.092593
## [49,] 4597  4.055556
## [50,] 4696  3.796296
## [51,] 4704  3.685185
## [52,] 4718  4.092593
## [53,] 4791  3.518519
## [54,] 4873  3.592593
## [55,] 4922  4.092593
## [56,] 4951  3.481481
## [57,] 5047  4.018519
## [58,] 5092  3.518519

Which shows you exactly how less specific a PWM-search is.

7. Comparing PWM and PSSM Motif Strengths

As a last step we will compare the searches with PSSM and PWM at the level of motif clarity and information content. First we will write the code to extract the hits from the target sequence. We will use this simple function we will create:

extract_substrings <- function(sequence, starts, ends) {
  sequence <- as.character(sequence)[1]  # Single string
  starts <- as.numeric(starts)
  ends <- as.numeric(ends)
  
  if(length(starts) != length(ends)) stop("Length mismatch")
  
  result <- mapply(function(s, e) substr(sequence, s, e), starts, ends)
  return(result)
}

Then we will identify the starts of the hits in the target sequence like so:

pssm_output<-seqstarts<-pssmSearch(pssm, targetseq, threshold=0.5)
pssm_output

##            pssmscore
##  [1,]  336  4.349479
##  [2,]  475  4.415511
##  [3,]  609  4.208614
##  [4,]  998  4.546456
##  [5,] 1128  6.792710
##  [6,] 1264  6.792710
##  [7,] 1336  4.208614
##  [8,] 1815  5.598964
##  [9,] 1824  5.643631
## [10,] 1837  4.685268
## [11,] 1851  4.754069
## [12,] 2040  4.208614
## [13,] 2089  6.661765
## [14,] 2095  4.483470
## [15,] 2148  4.370844
## [16,] 2194  4.208614
## [17,] 2244  4.532178
## [18,] 2572  4.077669
## [19,] 2795  4.208614
## [20,] 2806  5.643631
## [21,] 2862  4.208614
## [22,] 3116  5.564589
## [23,] 3120  5.519576
## [24,] 3152  6.616752
## [25,] 3207  5.305789
## [26,] 3214  5.643631
## [27,] 3231  5.519576
## [28,] 3255  4.173521
## [29,] 3540  5.519576
## [30,] 3908  6.616752
## [31,] 3970  5.519576
## [32,] 4140  4.752838
## [33,] 4358  4.685268
## [34,] 4368  5.519576
## [35,] 4597  6.792710
## [36,] 4704  5.446654
## [37,] 4718  5.519576
## [38,] 4791  4.077669
## [39,] 4873  4.291456
## [40,] 4922  5.519576
## [41,] 5047  5.305789
## [42,] 5092  4.077669

and now just get the starts

starts<-pssm_output[,1]
starts

##  [1]  336  475  609  998 1128 1264 1336 1815 1824 1837 1851 2040 2089 2095 2148
## [16] 2194 2244 2572 2795 2806 2862 3116 3120 3152 3207 3214 3231 3255 3540 3908
## [31] 3970 4140 4358 4368 4597 4704 4718 4791 4873 4922 5047 5092

Since we know the patterns are 6 nucleotides long we can simply create the end coordinates by adding 5 to all starts and extract the hits like this:

ends<-starts+5
substringsPSSM <- extract_substrings(targetseq, starts, ends)
substringsPSSM

##  [1] "TGATCG" "AGATTC" "TGATAT" "TGATAC" "TGATAG" "TGATAG" "TGATAT" "CGATAG"
##  [9] "AGATAC" "GGATAG" "AGATGG" "TGATAT" "AGATTG" "AGCTAG" "CGATTG" "TGATAT"
## [17] "ATATAG" "AGATTT" "TGATAT" "AGATAC" "TGATAT" "TGATTG" "TGATAA" "AGATAA"
## [25] "AGATAT" "AGATAC" "TGATAA" "AGATCA" "TGATAA" "AGATAA" "TGATAA" "AGAGAG"
## [33] "GGATAG" "TGATAA" "TGATAG" "AGATCG" "TGATAA" "AGATTT" "TGATTA" "TGATAA"
## [41] "AGATAT" "AGATTT"

We can now apply the logo function to this list

ggseqlogo(substringsPSSM)

and do the same for the PWM hits

pwm_output<-seqstarts<-pssmSearch(gata_motif, targetseq, threshold=0.8)
starts<-pwm_output[,1]
ends<-starts+5
substringsPWM <- extract_substrings(targetseq, starts, ends)
ggseqlogo(substringsPWM)

8. Comparing PSSM and PWM searches at the level of Information Content

As a last step we will calculate the Information Content of the resulting searches. We will need first to create new PWMs from the hits. For example:

pwmtest<-seqMotif(as.list(substringsPSSM))

Then, it is (fairly) simple to calculate the mean Information Content directly from the matrix. We will use our own function which you see here:

shannon_ic <- function(pwm) {
  # pwm: matrix/data.frame with rows A,C,G,T and columns P1,P6...
  bases <- c("A", "C", "G", "T")
  
  # For each position (column)
  ic <- apply(pwm, 2, function(col) {
    # Normalize probabilities (in case they don't sum exactly to 1)
    p <- as.numeric(col)
    p <- p / sum(p)
    p[p == 0] <- NA  # Avoid log(0)
    
    # Shannon entropy: H = -sum(p * log2(p))
    entropy <- -sum(p * log2(p), na.rm = TRUE)
    
    # Information content: IC = 2 - H (max entropy for 4 bases = log2(4) = 2)
    ic_pos <- 2 - entropy
    return(ic_pos)
  })
  
  return(ic)
}

And then run it on our test PWM to get the IC for every position

shannon_ic(pwmtest)

##         P1         P2         P3         P4         P5         P6 
## 0.44293943 1.44333656 1.44333656 1.44333656 0.65272958 0.07532663

Or directly the mean over all positions

mean(shannon_ic(pwmtest))

## [1] 0.9168342

Compare this to the same type of output, now for the PWM search.

pwmtest<-seqMotif(as.list(substringsPWM))
mean(shannon_ic(pwmtest))

## [1] 0.8856323