WORD PREDICTOR

Sonia Sharma
June 16, 2016

Overview of the project

Aim: Use data from a corpus called HC Corpora consisting of news, blogs and twitter text, to build a language model to predict the next word in a sentence.

Here is an overview of the steps involved in the process of building and implementing the model

  • Clean the data by removing profane words
  • Sample the data into training and held off sets of varying sizes (1%, 2%, 5%)
  • Tokenize the training set into unigrams, bigrams, trigrams and quadgrams
  • Use ngrams model with various techniques backoff and interpolation techniques to build the prediction algorithm
  • Evaluate the model by comparing the accuracy rate on the held off data
  • Make the model efficient and fast by pruning the n-grams
  • Build a shiny app to showcase the model

Our Prediction Model

Our prediction model is given below where V-vocabulary size, count() or N = number of tokens, \( \lambda \) = weight

How the algorithm works and its main features

  • the uses trigram, bigram and unigram model together with a combination of interpolation and backoff
  • Backoff means it use a lower order (n-1)gram model when no information available for a ngram
  • Interpolation means use the information about the ‘lower order’ models all the time – not only when the counts of, say, a trigram are 0, but even when the trigram count is nonzero
  • \( \lambda = .0005 \) chosen to be the one that gives the smallest error rate on the held out data

Evaluation

  • Our training set was \( 5\% \) of the data comprising of all three sources news, blogs, and twitter (Due to lack of computational power we could not use a bigger sample)
  • We use a \( 1 \% \) held out data set from our training set to evaluate the accuracy of our model.
  • We got an accuracy of approximately \( 47 \% \) with our model.
  • The accuracy with this model was slightly higher than other methods such as Katz Back off and Stupid Back off models which were both close to \( 46 \% \).
  • We pruned the ngrams with frequnecy less than 2 to reduce ngram size to improve processing speed of the model.

Our model seems to perform fairly well even when it is built on only \( 5\% \) of the sample. Using a bigger sample, say \( 10\% \), \( 20\% \), \( 30\% \) will certainly improve the accuracy of the model even more.

Getting to know the app

The shiny application is very simple and easy to use, with instructions provided on the webpage. The user can type in a string of English language words and as soon as they stop, the next word predictions appear in the tab below it, as can be seen in the snapshot below.

THANK YOU!