Predict the Next word - Shiny App

Bo Suzow
March 23, 2018

The Shiny App predicts the likely next word following a phrase you enter in the “Enter your phrase” text box.
The predicted words are displayed in the main pane labeled “Next Word Prediction.”
You may change the number of words predicted in the slider labeled “Number of predicdted words.” It is initially set to 10, but you may choose to see up to 50 of them.

The source textual data consists of 3 large corpora – blogs (900K), news (77K) and tweets (2.4M).
20% of each corpus gets subsetted to form the training data with 700K obserservations.
With the functions of the quantera package, the training data gets broken down to tokens from which a document feature matrix (DFM) gets generated. The DFM then further transforms to 4 n-gram (uni through quad) tables.
To achieve the app execution efficiency, the terms with only one occurrence are removed from n-gram tables.
As common words (aka stop words) are predominant in counts, they have been removed along with punctuation marks, digits, and single-letter words.
Twitter specific characters (i.e. @ or #) and URLs are removed as well.

The LM is built based on Katz's backoff (KBO) for “smoothing” which is a mechanism of treating unobserved phrases for which a likely next word is predicted.
The KBO model applies a discount to observed n-gram counts to create a probability mass.
The probability mass then gets distributed to the terms observed in the lower n-gram table.

Assumptions:
- Discount: .75
- N-gram to start with: 4-gram
- Phrase for which the likely next words to be predicted: world hunger is serious
Search in the 4-gram table for terms starting with the phrase. The word is gets removed from the phrase before the search as it is a common word.
Let's say, 3 terms in the 4-gram starting in “world hunger serious” are question, problem and issues with 5, 2 and 10 counts respectively. The probability (based on the Maximum Likelyhood Estimate) of each term is as follows:

term	count	prob	prob_formula
world_hunger_serious_question	5	0.250	(5-.75)/17
world_hunger_serious_problem	10	0.544	(10-.75)/17
world_hunger_serious_issues	2	0.074	(2-.75)/17

The total probability of the 4-gram terms is 0.87 resulting in 0.13 as a probability mass.

Move on to the lower gram (3-gram) and search for terms starting in “hunger serious”
Let's suppose 5 terms are found but two new terms, issue & note in the 3-gram. Their probabilities are computed as follows:

term count prob prob_formula

hunger_serious_issues 10 0.087 10*.13 /15

hunger_serious_note 5 0.043 5*.13/15

term	count	prob	prob_formula
hunger_serious_issues	10	0.087	10*.13 /15
hunger_serious_note	5	0.043	5*.13/15

Combine the two tables (the 4-gram terms in red and the 3-gram terms in blue), and the final result is (again, discount =.75, prob mass= .13):

term	count	prob	prob_formula
world_hunger_serious_question	5	0.250	(5-.75)/17
world_hunger_serious_problem	10	0.544	(10-.75)/17
world_hunger_serious_issues	2	0.074	(2-.75)/17
hunger_serious_issues	10	0.087	10*.13 /15
hunger_serious_note	5	0.043	5*.13/15

The app sorts the results by the descending order of the probabilities and presents the words in the top of the list per how many words the user chooses to see on the slider.
If no terms are observed in the 4-gram matching the phrase, start the above process with the lower n-gram (3-gram).
Disclaimer: The term acounts and probabilities shown in the table above do not reflect the actual search results, but fictitious values to achieve simplicity in the example.