The application is pretty straightforward to use: we simply input a partial phrase in the text box and let the app try to predict the next word.
For example, we input the following text:
The app responds by providing us with a set of up to 5 prediction buttons containing the words that the app thinks are most likely to come up next (arranged from left to right in order of descending estimated probability).
The app also produces an estimated probability distribution for our further scrutiny and presents it as a word-cloud.
We can continue our input by either clicking on one of the buttons or by typing more words.
The app is able to take partial input into account when making predictions. For example, consider the fairly open-ended input:
The app responds with a distribution containing a broad range of tokens:
If we input the first letter of the next word we are looking for, however:
The app responds with a much more focused distribution:
As a result of the implementation of this feature, we should have a space after the last word to let the app know that we would like to predict a new word, otherwise the app will assume that it is working with a partial match.
The short answer is: linear interpolation of n-gram models.
The slightly longer answer is: n-grams (sequences of n words), for \( 1\le n\le 4 \), were collected and counted from a large corpus containing text from news, blogs and Twitter sources (provided by the instructors). As the user inputs text, the app searches through its database of n-grams for matches and suggests the most likely among the matches it finds. Higher order n-grams are given more weight.
Specifically, the model used by the app is
\[ P(T) = \sum_{k=1}^{4}{\lambda_k\cdot P_k(T)} \]where \( P(T) \) is the estimated probability of token (word) \( T \), \( P_k(T) \) is the estimated probability of token \( T \) based on data collected from \( k \)-grams and \( \lambda_k \) is the weight assigned to each respective model. The \( \lambda \)’s are, of course, constrained to sum to 1 in order to preserve the probability distribution.
The final weights selected (see right column) were 0.0025, 0.0285, 0.2181, 0.7509 for the 1-gram to the 4-gram models respectively.
The specific values of the weights were determined by a reward-punish learning algorithm that ran through hold-out text from the various sources. Specifically, each of the four weights were initially set to 0.25 and then, as text was read in, the models were assessed according to how well they predicted the word that actually came up next compared to the overall model. Models that performed relatively well were rewarded with a greater weight at the expense of models that performed relatively less well.
The following graph shows how the weights changed as the learning progressed.
list of data.frames.lapply call that creates a list of data.frame’s and represents the heart of the search mechanism. In the code, x represents the model number (e.g., a value of 4 represents the 4-gram model) and the boolean new.word is true if the last character in the input box is a space (i.e., if the user is seeking a prediction for the next word).
seek.condition <- rep( TRUE, nrow(ngrams[[x]]) ) if (length(tokens.to.take) > 0) for (i in 1:tokens.to.take) { col.key <- sprintf("V%d", i + 1) seek.condition <- seek.condition & if (i == tokens.to.take && (!new.word)) ( str_sub( vocabulary[ngrams[[x]][[col.key]]], 1, str_length(lc.tokens[i]) ) == lc.tokens[i] ) else vocabulary[ngrams[[x]][[col.key]]] == lc.tokens[i] } head(ngrams[[x]][seek.condition, ])