Mapping PWMs to fasta sequences

Fasta sequences

Biological sequences (DNA, RNA or proteins) are stored usually in simple text formats like a FASTA sequence format.

>SeqName
ACCCATATA

where SeqName is the name of the sequence that follows. For example, if the sequence that follows refers to the BDNF gene then it might be BDNF. The important element is the ‘>’ symbol that specifies that this line contains information about the name of the sequence.

The sequence (e.g. ACTT..) might be expanded into several lines
The sequence ends when there is an empty line, or EOF or a line with a ‘>’ follows.

What kind of sequences you may find in a FASTA file

In fact, any kind of biological seqneces. It may be sequences of the same gene from multiple organisms (e.g. homologue sequences), coding or non-coding sequences, sequences that are located upstream of a gene (e.g., regulatory sequences etc).

Where can you find FASTA sequences and download them?

USCS is a good place to start. e.g. download. Για παράδειγμα: http://hgdownload.soe.ucsc.edu/goldenPath/hg38/multiz100way/

or for example

http://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/upstream1000.fa.gz

The first set contains aligned homologous genes for 100 organisms (human is included). The second set contains only sequences from human.

** We will donwload regulatory sequences from human, i.e. sequences located 1kb upstream the transcriptions start sites.

http://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/upstream1000.fa.gz

What kind of letters you can find in a FASTA file?

If the file contains DNA, then

The DNA alphabet

How do we scan for motifs in these sequences?

The idea is that for every consecutive base in the fasta file, we calculate a score using the PWM. We can score as follows

Sum of partial scores (i.e. scores per position)
Average score of consecutive bases
As a proportion to the maximum possible score

How can we decide if a score is high enough?

The idea is simple: if we had a file with sequences that we know they shouldn’t contain the motif, then we would scan it and compute the scores. Then we would look for scores that are statistically higher than those we find in this “negative” file.

How do we read fasta files

FASTA files are quite simple. In their simplest form they contain just sequence names and sequences. Thus, we could read the file using the read.table command and then get only the even lines, or the lines that do not contain sequence names. For example,

a = read.table("5utr.fa")
linesWithSeq = grep(pattern="^>", x = a[,1], invert = TRUE)
a[linesWithSeq,]

However, for large file we may encounter the following error. If the file is too large it might not fit in computer’s memory. Then, the idea is to read the file line by line. The following code might be useful for this:

processFile = function(filepath) {
  con = file(filepath, "r")
  while ( TRUE ) {
    line = readLines(con, n = 1)
    if ( length(line) == 0 ) {
      break
    }
    print(line)
  }

  close(con)
}

With this function we define as filepath the address of the file we want to read from the hard drive. Then, we open it with the function file, which is similar to the function `open’ used in other computer languages. Then, using a combination of while and break we read the file line by line.

create a file with random sequences

Often, we need to define a threshold for the PWM scores. Then, it’s helpful to have a `negative’ file, i.e. a file that contains no motifs. We can scan this file and then define a threshold

seqLength=1000
alphabet = c("A", "C", "G", "T")
fastaOut = ""
for(i in 1:1000){
  seqName = paste(">seq", i, "\n", sep="")
  seq = paste(sample(x = alphabet, size=seqLength, prob = rep(0.25, 4), replace=TRUE), collapse="")
  fastaOut = paste(fastaOut, seqName,seq,"\n", sep="")
}
write.table(fastaOut, "randomSeqFile.FA", quote=F, row.names=F, col.names=F)

Scanning a fasta sequence for a motif

processFile = function(filepath, pwm) {
  con = file(filepath, "r")
  while ( TRUE ) {
    line = readLines(con, n = 1)
    if ( length(line) == 0 ) {
      break
    }
   isSequence = grep(pattern="^>", x=line, invert = TRUE)
   
   if(length(isSequence) > 0){
      ## we split the line and we create a vector 
      seq.vec = unlist(strsplit(line, split=""))
      ## code for analyzing sequence
      ### 
      ## It is necessary to go up to the "length_vec - length motif + 1"
      for(i in 1:(length(seq.vec)-ncol(pwm)+1)){
        # This function evaluates the score for a substring of the fasta
        ## seq.vec
        ## at position i
        ## for the position weight matrix pwm
        ## note that pwm consists of some columns. For this number of columns the score will be evaluated.
        getScore(seq.vec, i, pwm)
      }
   }
  }

  close(con)
}



processFile("5utr.fa", pwm=pwm)

Of course, we should implement the function getScore first,which takes as an argument a vector representing the FASTA sequence, the position i and the pwm.

alphabet=c("A", "C", "G", "T")


getPwm = function(motifFile){
  a = read.table(motifFile, colClasses="character")
  for(i in 1:nrow(a)){
    v = unlist(strsplit(a[i,], ""))
    ## initialize the pwm at first iteration only
    if(i == 1){
      pwm = matrix(0, nrow=length(alphabet), ncol=length(v))
      rownames(pwm) = alphabet
    }
    for(j in 1:ncol(pwm)){
      letter = v[j] 
      pwm[letter, j] = pwm[letter, j] + 1
    }
  }
  pwm = t(t(pwm)/colSums(pwm))
  pwm = log2((pwm+1e-6)/0.25)
  return(pwm)
}

pwm = getPwm("pwm.txt")
pwm

##         [,1]       [,2]      [,3]       [,4]       [,5]       [,6]
## A  0.2630392  1.2630368 -1.321914 -17.931569 -17.931569  1.2630368
## C -0.3219209 -0.3219209 -1.321914 -17.931569 -17.931569 -0.3219209
## G -1.3219137 -1.3219137  1.485429   2.000001 -17.931569 -1.3219137
## T  0.6780755 -1.3219137 -1.321914 -17.931569   2.000001 -1.3219137
##        [,7]       [,8]       [,9]
## A  1.485429 -0.3219209 -1.3219137
## C -1.321914 -1.3219137 -0.3219209
## G -1.321914  1.0000029 -1.3219137
## T -1.321914 -0.3219209  1.2630368

getScore  = function(v, i, pwm){
  score = 0
  maxScore = 0
  minScore = 0
  for(j in 1:ncol(pwm)){
    MAX = max(pwm[,j])
    MIN = min(pwm[,j])
    maxScore = maxScore + MAX
    minScore = minScore + MIN
  }
  for(j in 1:ncol(pwm)){
    letter = v[i+j-1]
    score = score + pwm[letter, j] 
  }
  score = (score - minScore)/(maxScore - minScore)
  return(score)
}


processFile = function(filepath, pwm) {
  seq = 1
   seq.scores = list()
  con = file(filepath, "r")
  while ( TRUE ) {
    line = readLines(con, n = 1)
    if ( length(line) == 0 ) {
      break
    }
   isSequence = grep(pattern="^>", x=line, invert = TRUE)
   
  
   if(length(isSequence) > 0){
     
     ## initialize Scores for each line
     lineScores = c()
      ## we split the line and we create a vector 
      seq.vec = unlist(strsplit(line, split=""))
      ## code for analyzing sequence
      ### 
      ## It is necessary to go up to the "length_vec - length motif + 1"
      for(i in 1:(length(seq.vec)-ncol(pwm)+1)){
        # This function evaluates the score for a substring of the fasta
        ## seq.vec
        ## at position i
        ## for the position weight matrix pwm
        ## note that pwm consists of some columns. For this number of columns the score will be evaluated.
        score = getScore(seq.vec, i, pwm)
        lineScores = c(lineScores, score)
      }
     ## print(lineScores)
      max.score = max(lineScores)
     ## print(max.score)
      seq.scores[[seq]] = max.score
      ## increment the counter
      seq = seq + 1
   }
  }
  close(con)
  return(seq.scores)
}

seqScores = processFile("5utr.fa", pwm=pwm)
randomScores = processFile("randomSeqFile.FA", pwm=pwm)

#plot(1:length(seqScores), unlist(seqScores))
#plot(1:length(randomScores), unlist(randomScores))
#unlist(randomScores)

{plot(1:length(seqScores), unlist(seqScores), ylim=c(0.8, 1))
points(1:length(randomScores), unlist(randomScores), col="red")
abline(h=mean(unlist(seqScores)), col="black")
abline(h=mean(unlist(randomScores)), col="red")
legend("topleft", legend=c("sequences", "random sequences"), col=c("black", "red"), pch=19)}