Additional funding request based on promising application performance.
Kenneth Bury, Coursera Johns Hopkins Data Science Capstone April, 2015
The MIT Natural Language Processing lecture notes provided practical insight into the use of n-gram language models. The combined interpolation and back-off model was used to rank the trigrams and bigrams appearing in the sample texts. Given that the application would be tested using sample phrases from the Twitter and news articles only those texts were used to train the model. The model parameter lambda was tuned using a hold-out sample. Using phrases from the hold-out sample the lambda value that maximized the next word prediction match was selected and used to produce the trigram and bigram rank tables.
MIT 6.863J/9.611J Natural Language Processing - Home. (2012). Retrieved April 25, 2015, http://web.mit.edu/6.863/www/fall2012/lectures/lecture2&3-notes12.pdf
The different model sizes were tested and the results showed that increasing model size only marginally increased the next prediction rate. Text prediction applications usually present multiple words to choose from, in this case four words. If the next word was not in that set we considered that a miss.
A model using a sample of 10% of the text was used in the application. We expect a first word prediction match about 15% of the time and the predicted word within the 4 word set 30% of the time.
| Sample % | Phrases tested | First match % | Miss % | Within 4 words % |
|---|---|---|---|---|
| 0.1 | 381 | 11.5 | 76.6 | 23.4 |
| 1 | 3931 | 13.7 | 71.6 | 28.4 |
| 10 | 38982 | 14.8 | 69.3 | 30.7 |
| 50 | 194314 | 15.7 | 86.4 | 31.6 |
An input text box is presented for text entry. The application looks for a word followed by a space character to begin the next word prediction. The input text is cleaned and tokenized using the same process used to clean the Twitter and news article texts. The last two tokens are passed to the prediction function. The function uses the trigram and bigram ranking tables as models of the text. If the two tokens return a list of 4 next words from the trigram model then those words are returned. If less than 4 words are returned then one token is used to find next words from the bigram model. A unique list of up to 4 predicted next words are returned. The four words are used as labels on buttons. When a button is pressed then that prediction word is added to input text box. The prediction process will start again when another space character is entered.
1) Enter a phrase in the empty text input box followed a space after the last word.
Try the application and see how well it works with your test phrases.
Link to Shiny application
2) The prediction buttons will display only after entering the space character.