Word Prediction Using N-Grams

Cathy Wyss
9/9/2017

Word Prediction

iphone

  • Given an input phrase (or word), predict the next word
  • Useful for messaging, searching, …
  • Depends on input corpus (bank of documents)
  • Reduces need for typing on tiny keyboards
  • Can correct errors (spelling, for example)

google

N-Grams for Word Prediction

ngrams

  • Model consists of “n-grams” which are sequences of words from the corpus
  • Example 2-grams: “keeps people”, “always talking”
  • Store these in a data frame
    • first column is the ngram
    • subsequent columns are the most frequently following words
  • Project model consists of 2,660,110 n-grams of length 4
  • Project model is 77 MB
    • small enough to deploy on shinyapps.io

The Backoff Algorithm

predictBackoff <- function(s, m, k=3) {
  backoff_phrase <- cleanInputText(s)
  while (TRUE) {
    W <- m[m$ngram == backoff_phrase,]
    if (nrow(W) > 0) # return matching words here
    wordVec <- strsplit(backoff_phrase, " ")[[1]]
    lwv <- length(wordVec)
    if (lwv > 1) { backoff_phrase <- paste(wordVec[2:lwv], collapse=" ") }
    else { backoff_phrase <- "" } } }

  • cleanInputText:
    • remove punctuation, numbers, foreign characters
    • translate to lower case and strip whitespace
  • compare input string to the model
    • if match is found, return it
    • if no match is found, remove first word and try again
    • repeat until a match is found
    • “” (empty string) matches most frequently occurring words

The Application

app

  • Please note: first phrase takes time for model to load
    • subsequent phrases are fast
  • You can select how many words to return
    • use the slider bar
  • App URL: https://datacathy.shinyapps.io/word_prediction/
  • Enter a word or phrase and click the “Predict” submit button

Results

Phrase imessage my app
“i want a” new, job, little new, guy, relationship
“he is” a, the, my a, the, not
“how now brown” she, is, I said, cow
“and a case of” course, the, a beer
“oh my” god, gosh, I god, gosh, goodness
  • overlap with imessage was about 80%
  • performance on capstone quizzes about 50%
  • returned words intuitively reasonable
    • results vary greatly with input corpus
  • capstone corpus is old
    • new topics not represented
    • production model would refresh corpus periodically