A Shiny App
by Richard Ambler
December 2014

How do we use the app?

The application is pretty straightforward to use: we simply input a partial phrase in the text box and let the app try to predict the next word.

For example, we input the following text:

The app responds by providing us with a set of up to 5 prediction buttons containing the words that the app thinks are most likely to come up next (arranged from left to right in order of descending estimated probability).

The app also produces an estimated probability distribution for our further scrutiny and presents it as a word-cloud.

We can continue our input by either clicking on one of the buttons or by typing more words.

Some features

The app is able to take partial input into account when making predictions. For example, consider the fairly open-ended input:

The app responds with a distribution containing a broad range of tokens:

If we input the first letter of the next word we are looking for, however:

The app responds with a much more focused distribution:

As a result of the implementation of this feature, we should have a space after the last word to let the app know that we would like to predict a new word, otherwise the app will assume that it is working with a partial match.

How does the app work?

The short answer is: linear interpolation of n-gram models.

The slightly longer answer is: n-grams (sequences of n words), for \( 1\le n\le 4 \), were collected and counted from a large corpus containing text from news, blogs and Twitter sources (provided by the instructors). As the user inputs text, the app searches through its database of n-grams for matches and suggests the most likely among the matches it finds. Higher order n-grams are given more weight.

Specifically, the model used by the app is

\[ P(T) = \sum_{k=1}^{4}{\lambda_k\cdot P_k(T)} \]

where \( P(T) \) is the estimated probability of token (word) \( T \), \( P_k(T) \) is the estimated probability of token \( T \) based on data collected from \( k \)-grams and \( \lambda_k \) is the weight assigned to each respective model. The \( \lambda \)’s are, of course, constrained to sum to 1 in order to preserve the probability distribution.

The final weights selected (see right column) were 0.0025, 0.0285, 0.2181, 0.7509 for the 1-gram to the 4-gram models respectively.

Trainng the \( \lambda \)’s

The specific values of the weights were determined by a reward-punish learning algorithm that ran through hold-out text from the various sources. Specifically, each of the four weights were initially set to 0.25 and then, as text was read in, the models were assessed according to how well they predicted the word that actually came up next compared to the overall model. Models that performed relatively well were rewarded with a greater weight at the expense of models that performed relatively less well.

The following graph shows how the weights changed as the learning progressed.

Some final details:

  • A vocabulary of 15421 words was incorporated.
  • The number of 2- 3- and 4-grams stored was respectively 582 775, 543 227 and 559 038.
  • The n-grams data took up 5.7 MB on disk, 25.7 MB loaded in an R session.
  • The n-grams were stored as integers in a list of data.frames.
  • The lookup algorithm is essentially: For each model:
    • Mark every token as “to be predicted”.
    • Iteratively go through the token matches and un-mark non matches.
    • Return the tokens that are still marked and their estimated probabilities.
  • The code fragment on the right exists within a lapply call that creates a list of data.frame’s and represents the heart of the search mechanism. In the code, x represents the model number (e.g., a value of 4 represents the 4-gram model) and the boolean new.word is true if the last character in the input box is a space (i.e., if the user is seeking a prediction for the next word).
  • Visit the Ye Old Sluggardly Key Repository hosted at Github for more info!


  
seek.condition <- rep( TRUE, nrow(ngrams[[x]]) ) if (length(tokens.to.take) > 0) for (i in 1:tokens.to.take) { col.key <- sprintf("V%d", i + 1) seek.condition <- seek.condition & if (i == tokens.to.take && (!new.word)) ( str_sub( vocabulary[ngrams[[x]][[col.key]]], 1, str_length(lc.tokens[i]) ) == lc.tokens[i] ) else vocabulary[ngrams[[x]][[col.key]]] == lc.tokens[i] } head(ngrams[[x]][seek.condition, ])



Thanks for your time!