nextWord prediction app

John Walker - Data Science Capstone
18 April 2016

Creating the Corpus

Using the concept of Markov chains the general approach is to build a set of n-grams and estimate their probabilities of use. From the initial datasets provided for blog, news and Twitter text, a corpus was made by removing numbers and punctuation except apostrophe (“don't”) and hyphen when it touches as word (so kept “one-sided” but not"hey - do you"). The probability of the bigram phrase “\( Word_{n-1}Word_n \)”“ = \( P(W_n|W_{n-1}) \) = \( \frac{Count(W_{n-1}W_n)}{Count(W_{n-1})} \)

The trigram probability is "\( Word_{n-2} Word_{n-1} Word_n \)” = \( P(W_n|W_{n-2}W_{n-1}) \) = \( P(W_n|W_{n-1}) \) * \( P(W_{n-1}|W_{n-2}) \)

For trigrams, accuracy is improved somewhat with a technique of interpolation where \( P(trigram_){interpolated} \) = \( \lambda_1 * P(trigram) + \lambda_2 * P(bigram) + \lambda_3 * P(unigram) \) where \( 1 = \lambda_1 + \lambda_2 + \lambda_3 \) During testing with a reserved portion of the text interpolation values for best results were \( \lambda_1 \) = 0.6, \( \lambda_2 \) = 0.3, \( \lambda_3 \) = 0.1

Quadgram or 4-gram probabilities are also a chain of bigram probabilities \( P(W_n|W_{n-3}W_{n-2}W_{n-1}) \) = \( P(W_n|W_{n-1}) \) * \( P(W_{n-1}|W_{n-2}) \) * \( P(W_{n-2}|W_{n-3}) \)

After the probabilities are built, profanity is removed from the predicted words (not generally from the corpus) using this list

App prediction approach

  • First if 3 or more words are entered, the nextWord app searches for a known quadgrams using the last 3 words of input
  • If no result then it tries to find trigrams using the final 2 words of input
  • If no result then it tries to find bigrams using the final word of input
  • If still no result it displays the 3 most common words

The most likely words are displayed (up to 10 words. The image below shows the testing accuracy (prediction in top 3 words) used to determine the lambda values used for interpolation. Final testing showed that more data could be used in the shinyapps.io than expected and accuracy was improved to 23.1%

Using the App

The nextWord app has a sidebar column on the left and three tabs in the main panel on the right. At the prompt “Enter text for prediction” the user types or pastes a word or phrase to be used to predict the next word. “Prediction” tab displays a table with up to ten predicted words, estimated probability, frequency of the word or phrase and the “tactic” used to make the prediction.

The tab “Probability plot” shows the same list of predicted next words in a graphical form showing the estimated probability for each word. Words further to the right are more probable. The middle image on the left shows the plot for input “hello”.

The “About the app” tab explains a bit about the approach used in the app. This tab does not change.

The app is at http://jrwalker.shinyapps.io/nextWord/

References and Acknowledgements

Daniel Jurafsky & James H. Martin “Speech and Language Processing: An Introduction to Natural Language Processing.” Chapter 4

Daniel Jurafsky & christopher Manning “Natural Language Processing - Video Lectures” Coursera with Stanford University

Thorsten Brants et al “Large Language Models in Machine Translation” aclweb.org

I'd also like to thank people for their contributions in the discussion forum for the Data Science Capstone, in particular our mentor Ray Jones and classmate Mario Melchiori.