Word prediction by n-gram model

Parikshit Sanyal
16 Nov 18

Introduction

This prediction model takes a text string as input, a decay factor f (between 0 and 1) and a number top_n as input, and then prints the top n predictions following the string.

  • Example input: predict(“i think”,f=1,top_n=5)
  • Output: “i” “its” “so” “that” “about”

Training data

  • The n-grams used for prediction were collected from a dataset of blogs, tweets and news. A set of 1, 2, 3, 4 and 5-grams were generated from the training set.
  • A subset containing 25% of the data was used for training
  • All characters converted to lowercase, numbers and puncutuation removed
  • Stop words (like 'that', 'the', 'so') were preserved (because they are usually the commonest predictions)

Algorithm

Given a string, the function finds

  • the last four words of the string (in case there are less than four words, then the last 3, 2 or 1 words respectively)
  • finds commonest 5-grams beginning with the last four words with their frequencies
  • applies the function recursively to last 3, 2 and 1-grams of the string in sequence
  • accumulates the frequencies of commonest predicted words in a dataframe
  • the variable f (between 0 and 1) is the decay factor; for every recursive call to the function, the word counts are multiplied by fxf (i.e. the relative importance of words predicted diminishes as the function propagates farther deep into the string)

Plotting probable words

Probable words follwing 'i think' with top_n = 5 and f=1 (left); note that the relative frequencies of the predictions will change with change in f (right, f=0.5)

Predictions with f=1 Predictions with f=0.5

Profiling and conclusion

Following is the results of profiling the function 'predict(“i think”,top_n=1,f=1) on an i3 laptop with no GPU

x <-summaryRprof("~/core/code/R/coursera_R/capstone/profile")
x$by.self
          self.time self.pct total.time total.pct
"grepl"        0.14    53.85       0.14     53.85
"ls"           0.06    23.08       0.06     23.08
"tolower"      0.04    15.38       0.04     15.38
"c"            0.02     7.69       0.02      7.69

The program runs with only moderate memory usage (19.7 MB saved environment data), which makes it useful for smartphones. However, predictive accuracy is yet to be tested in real world.