Word Prediction Using N-Grams

Cathy Wyss
9/9/2017

Word Prediction

iphone

Given an input phrase (or word), predict the next word
Useful for messaging, searching, …
Depends on input corpus (bank of documents)
Reduces need for typing on tiny keyboards
Can correct errors (spelling, for example)

google

N-Grams for Word Prediction

ngrams

Model consists of “n-grams” which are sequences of words from the corpus
Example 2-grams: “keeps people”, “always talking”
Store these in a data frame
- first column is the ngram
- subsequent columns are the most frequently following words
Project model consists of 2,660,110 n-grams of length $\leq 4$
Project model is 77 MB
- small enough to deploy on shinyapps.io

The Backoff Algorithm

predictBackoff <- function(s, m, k=3) {
  backoff_phrase <- cleanInputText(s)
  while (TRUE) {
    W <- m[m$ngram == backoff_phrase,]
    if (nrow(W) > 0) # return matching words here
    wordVec <- strsplit(backoff_phrase, " ")[[1]]
    lwv <- length(wordVec)
    if (lwv > 1) { backoff_phrase <- paste(wordVec[2:lwv], collapse=" ") }
    else { backoff_phrase <- "" } } }

cleanInputText:
- remove punctuation, numbers, foreign characters
- translate to lower case and strip whitespace
compare input string to the model
- if match is found, return it
- if no match is found, remove first word and try again
- repeat until a match is found
- “” (empty string) matches most frequently occurring words

The Application

app

Please note: first phrase takes time for model to load
- subsequent phrases are fast
You can select how many words to return
- use the slider bar
App URL: https://datacathy.shinyapps.io/word_prediction/
Enter a word or phrase and click the “Predict” submit button

Results

Phrase	imessage	my app
“i want a”	new, job, little	new, guy, relationship
“he is”	a, the, my	a, the, not
“how now brown”	she, is, I	said, cow
“and a case of”	course, the, a	beer
“oh my”	god, gosh, I	god, gosh, goodness

overlap with imessage was about 80%
performance on capstone quizzes about 50%
returned words intuitively reasonable
- results vary greatly with input corpus
capstone corpus is old
- new topics not represented
- production model would refresh corpus periodically