Word prediction in R

B. Schwenk
2-1-2019

Introduction

For the Data Science Capstone project I created a word prediction app. In general the following steps were taken to get to the end result:

  • Research about NLP in R (Natural Language Processing)
  • Data exploration of twitter, blogs and news corpora
  • Extraction and datapreparation to prepare a model
  • Create a prediction algoritm to predict the next word in a sentence
  • Check quality of the results based on test set / algoritm statistics.

The app is easy to use. Just enter a sentence and click the “predict”-button. The algoritm will automatically predict the best next word. It also gives a few alternative words and gives some quality metrics.

Data Exploration

The first step was data exploration of the very large Twitter, Blogs and News corpora. A few results:

  • The corpora are too big to handle and load as a whole. A smaller random sample was taken to get arround this. Based on the law of large numbers, this sample represents the whole corpora good enough.
  • The Twitter and Blogs corpora where much larger than the News corpus. I chose to get a larger portion of the News corpus to even things out.
  • The most common words are stop words (the, and, etc.). These are left in, because they are often the next word.
  • Abusive words where removed from the dataset, to prevent predicting those. This was done by deleting all N-grams having abusive words in them.

Extraction and datapreparation

The main step in data extraction is the creation of N-grams with R tm package. After basic text cleaning/preparation, I focussed on extracting 2,3,4 and 5-grams. NB. The onegrams are less usefull for prediction and are only used to solve equal probability problems or are not in the dataset at all.

The 3-4-5 grams proved to be most usefull to be able to get next words that really fit into the sentence. Because the amount of N-grams set is however limited, to get a good balance between speed and prediction performance, not all sentences have a 3,4 or 5 gram match. The bi-grams are necessary to get arround this and also have a higher hit rate on rare words. Counts for ngrams are: 3,4,5 grams: 300k, bigram: 111k, unigram: 28k.

I chose to gather all ngrams, filter on only the most important and save these to file. The file contains also e.g. the next word to predict, counts and probablities. This saves considerable execution time when running the model. It takes about 20 seconds to download the dataset when loading the app. This is I think an acceptable time considering an acceptable prediction performance.

A random sample (about 10%) was set a side to serve as a testset.

Algoritm and quality

The app uses the N-grams for predicting the next word.

  • I corrected probablity for higher N-grams, so these are more likely to be used in the next word prediction. This proved to get better results getting more natural sounding sentences.
  • A back off model was used to get even better results. I chose to only use the bigram models if higher order ngrams didn't give good enough results. This was done to prevent very frequent stop words in bigrams to get a too high chance of being predicted.
  • In case of equal probablity, I used a mix of bigram and unigram in an algoritm to make the best decision.
  • Add-k smoothing was used to account for unseen words. Also unigrams where used to get predictions for unseen words. Checks with the testset revealed e.g. that words/n-grams in the testset but not in the trainingset are rare. This gives strengthens the vality of the trainingset used for modelling the algoritm.