Next word prediction

Veronika Nuretdinova
26.04.2015

Prediction of the word following a phrase

Word prediction: program explanation

  • The aim of the program is to predict the most probable word following the input phrase
  • To find out the most probable word I edtimated which words are following most frequently after the same words in the training texts. Thus, I built the algorithm based on n-grams
  • The lexicon of the program is build on 30K random lines from blog, twitter and news files. These have provided for 25K words dictionary
  • These 30K lines provided for ~400K 2-grams, ~850K 3-grams, ~950K 4-grams and ~1000K 5-grams. That means that the quality of prediction would improve up to 4-grams, but doesnt significantly improve when using 5-grams

Building n-gram dictionaries

The following operations were done with source files for building the dictionaries:

  • profanity words have been cleaned
  • symbols have been removed. End of sentences are tagged as STOP since for certain phrases the end-of-sentence might be the most probable
  • Words which appear only once in the text are tagged RARE
  • I have tagged numbers as NUM, so eg the phrase I bought 5 cakes turns into I bought NUM cakes
  • I have tagged certain words which POS could be identified (eg “ly” is identified as ADVERB), and added them as additional ngrams.

Building n-gram dictionaries (part 2)

  • Using tm and Rweka packages I have built 2-gram, 3-gram and 4-grams and left only n-grams providing the most probable next word. Eg 3-gram “are you going” is followed by 3 possible words out of which word “to” is the most frequent. Therefore, are you going to 4-gram is taken into the dictionary
      term4 term3 term2    term1 count
197     are   you going       to    10
1408    are   you going     STOP     4
83507   are   you going barefoot     1
  • I do the same operations for 3-grams and 2-grams. In the end I have ~700K lines dictionary

Program algorithm

  • once the user inserts the phrase, it's split into words. Since I have stopped at 4-grams, I take only the last 3 words into account. If the phrase is 3 words or shorter, it's taken as it is
  • the corresponding 3-phrase is searched in the dictionary and the most probable next word is provided
getword("Tell me when you are going")
[1] "to"
  • in case the exact the exact 3-word match is not found, the program tries to find the match with the first word tagged RARE or the corresponding Part-of-Speach
  • if even then the match is not found, the program searches for the corresponding 2-gram, taking only the last 2 words. And then the algorithm is repeated

Estimation of the program performance

These is how the program performs for 20 quiz questions

  • share of correctly predicted words 15% (3 out of 20)
  • average calculation time, 0.245 msec
  • one way to improve the predictability is to increase the volume of training data above 30K lines, thus the most probable words would be estimated based on larger representation. This, however, would require substantial time for dictionary calculation
  • another way is through POS tagging with the use of open NLP package. Thus, the next word takes also POS tags, besides n-grams. However, this significantly slows the application