Word Prediction

Vinh Hang
Oct 28, 2017

Introduction

The goal of this application is to predict the next word from any sentence. For now, this app only supports English. This is a familiar feature on almost on platform such as google search, smart phones…

Swiftkeys has provided the data with over 3 millions documents in English.

The app will be the the proof of concept and thus, we make a conscious choice of testing accuracy over speed. However, one can decrease the sample size and get a faster model.

For more details on the code please visit https://github.com/vh42720/NLP_Coursera_Capstone

Cleaning Data

To prepare our data for training the model, we first need to clean it thoroughly. The principles are as followed:

  • The documents will not be case sensitive since it is not important for prediction.
  • Similarly, stopwords will not be remoe. However, bad words will be.
  • Wordform will stay the same (go and goes are different).
  • Punctuation will be removed and replace (I'm will be I am).
  • Numbers will be removed.
  • Sparsewords will be retained.

The packages tm and qdap helps us quite nicely. However, one must consider the order or cleaning carefully (cannot replace I'm if you remove '; Im will be left behind!)

N-grams model

In the fields of computational linguistics and probability, an n-gram is a contiguous sequence of n items from a given sequence of text or speech. This application will use words as the smallest unit.

An n-gram of size 1 is referred to as a “unigram”; size 2 is a “bigram” ; size 3 is a “trigram”.

The underlying intuition is that the next word can mostly be predicted based on 3, 2, or 1 words before it. Thus, this is no more than a probability assignment to words.

[1] "'I love this application so much' in unigrams will be:"
[1] "so much"          "application so"   "love this"       
[4] "this application" "I love"          

Algorithm

The algorithm will work as followed. If the inputs is greater than 3, we only consider the last 3 words. Then, we will match these last 3 words with our 4-grams dictionary and extract the predicted words based on probability (frequency).

If there is no match in 4-grams dictionary, the model will consider last 2 words of inputs and match with 3-grams dictionary and so on. This is called the Katz backoff smoothing. This methods will predict most unseen phrase in our sample sets.

[1] "4-grams Dictionary contains 'Sam reads a book'"
[1] "Without smoothing, this phrase cannot predict the next word for 'Linda reads a' since it never happens before."
[1] "Thus, we back off 'reads a' which will match with 'read a book' in 3 grams dictionary"

Speed or Power

As one might predict, n-grams model trades speed for predictive power. The bigger the dictionary, the more it can predict, the slower the algorithm takes. The file itself is insignificant with proper cleaning. (The 2, 3, 4 grams dictionary with 210K observations takes up less than 100MB)

Nevertheless, without more sophisticated methods, we cannot keep increasing our sample size. Looking for the context using cluster analysis would decrease finding time significantly. However, it will be for another time!

Final Notes

  • Use the ngram package from Hadley Wickham will cut the process time massively.
  • Stringr/stringi is also much faster than base grep/grepl

Finally, give the app a try: https://vh42720.shinyapps.io/Predict_Word/