John Walker - Data Science Capstone
18 April 2016
The trigram probability is "\( Word_{n-2} Word_{n-1} Word_n \)” = \( P(W_n|W_{n-2}W_{n-1}) \) = \( P(W_n|W_{n-1}) \) * \( P(W_{n-1}|W_{n-2}) \)
For trigrams, accuracy is improved somewhat with a technique of interpolation where \( P(trigram_){interpolated} \) = \( \lambda_1 * P(trigram) + \lambda_2 * P(bigram) + \lambda_3 * P(unigram) \) where \( 1 = \lambda_1 + \lambda_2 + \lambda_3 \) During testing with a reserved portion of the text interpolation values for best results were \( \lambda_1 \) = 0.6, \( \lambda_2 \) = 0.3, \( \lambda_3 \) = 0.1
Quadgram or 4-gram probabilities are also a chain of bigram probabilities \( P(W_n|W_{n-3}W_{n-2}W_{n-1}) \) = \( P(W_n|W_{n-1}) \) * \( P(W_{n-1}|W_{n-2}) \) * \( P(W_{n-2}|W_{n-3}) \)
After the probabilities are built, profanity is removed from the predicted words (not generally from the corpus) using this list
The most likely words are displayed (up to 10 words. The image below shows the testing accuracy (prediction in top 3 words) used to determine the lambda values used for interpolation. Final testing showed that more data could be used in the shinyapps.io than expected and accuracy was improved to 23.1%
The nextWord app has a sidebar column on the left and three tabs in the main panel on the right. At the prompt “Enter text for prediction” the user types or pastes a word or phrase to be used to predict the next word. “Prediction” tab displays a table with up to ten predicted words, estimated probability, frequency of the word or phrase and the “tactic” used to make the prediction.
The tab “Probability plot” shows the same list of predicted next words in a graphical form showing the estimated probability for each word. Words further to the right are more probable. The middle image on the left shows the plot for input “hello”.
The “About the app” tab explains a bit about the approach used in the app. This tab does not change.
The app is at http://jrwalker.shinyapps.io/nextWord/
Daniel Jurafsky & James H. Martin “Speech and Language Processing: An Introduction to Natural Language Processing.” Chapter 4
Daniel Jurafsky & christopher Manning “Natural Language Processing - Video Lectures” Coursera with Stanford University
Thorsten Brants et al “Large Language Models in Machine Translation” aclweb.org
I'd also like to thank people for their contributions in the discussion forum for the Data Science Capstone, in particular our mentor Ray Jones and classmate Mario Melchiori.