Bo Suzow
March 23, 2018
The Shiny App predicts the likely next word following a phrase you enter in the “Enter your phrase” text box.
The predicted words are displayed in the main pane labeled “Next Word Prediction.”
You may change the number of words predicted in the slider labeled “Number of predicdted words.” It is initially set to 10, but you may choose to see up to 50 of them.
The source textual data consists of 3 large corpora – blogs (900K), news (77K) and tweets (2.4M).
20% of each corpus gets subsetted to form the training data with 700K obserservations.
With the functions of the quantera package, the training data gets broken down to tokens from which a document feature matrix (DFM) gets generated. The DFM then further transforms to 4 n-gram (uni through quad) tables.
To achieve the app execution efficiency, the terms with only one occurrence are removed from n-gram tables.
As common words (aka stop words) are predominant in counts, they have been removed along with punctuation marks, digits, and single-letter words.
Twitter specific characters (i.e. @ or #) and URLs are removed as well.
The LM is built based on Katz's backoff (KBO) for “smoothing” which is a mechanism of treating unobserved phrases for which a likely next word is predicted.
The KBO model applies a discount to observed n-gram counts to create a probability mass.
The probability mass then gets distributed to the terms observed in the lower n-gram table.
| term | count | prob | prob_formula |
|---|---|---|---|
| world_hunger_serious_question | 5 | 0.250 | (5-.75)/17 |
| world_hunger_serious_problem | 10 | 0.544 | (10-.75)/17 |
| world_hunger_serious_issues | 2 | 0.074 | (2-.75)/17 |
Let's suppose 5 terms are found but two new terms, issue & note in the 3-gram. Their probabilities are computed as follows:
| term | count | prob | prob_formula |
|---|---|---|---|
| hunger_serious_issues | 10 | 0.087 | 10*.13 /15 |
| hunger_serious_note | 5 | 0.043 | 5*.13/15 |
Combine the two tables (the 4-gram terms in red and the 3-gram terms in blue), and the final result is (again, discount =.75, prob mass= .13):
| term | count | prob | prob_formula |
|---|---|---|---|
| world_hunger_serious_question | 5 | 0.250 | (5-.75)/17 |
| world_hunger_serious_problem | 10 | 0.544 | (10-.75)/17 |
| world_hunger_serious_issues | 2 | 0.074 | (2-.75)/17 |
| hunger_serious_issues | 10 | 0.087 | 10*.13 /15 |
| hunger_serious_note | 5 | 0.043 | 5*.13/15 |
The app sorts the results by the descending order of the probabilities and presents the words in the top of the list per how many words the user chooses to see on the slider.
If no terms are observed in the 4-gram matching the phrase, start the above process with the lower n-gram (3-gram).
Disclaimer: The term acounts and probabilities shown in the table above do not reflect the actual search results, but fictitious values to achieve simplicity in the example.