Word prediction by n-gram model

Parikshit Sanyal
16 Nov 18

Introduction

This prediction model takes a text string as input, a decay factor f (between 0 and 1) and a number top_n as input, and then prints the top n predictions following the string.

Example input: predict(“i think”,f=1,top_n=5)
Output: “i” “its” “so” “that” “about”

Training data

The n-grams used for prediction were collected from a dataset of blogs, tweets and news. A set of 1, 2, 3, 4 and 5-grams were generated from the training set.
A subset containing 25% of the data was used for training
All characters converted to lowercase, numbers and puncutuation removed
Stop words (like 'that', 'the', 'so') were preserved (because they are usually the commonest predictions)

Algorithm

Given a string, the function finds

the last four words of the string (in case there are less than four words, then the last 3, 2 or 1 words respectively)
finds commonest 5-grams beginning with the last four words with their frequencies
applies the function recursively to last 3, 2 and 1-grams of the string in sequence
accumulates the frequencies of commonest predicted words in a dataframe
the variable f (between 0 and 1) is the decay factor; for every recursive call to the function, the word counts are multiplied by fxf (i.e. the relative importance of words predicted diminishes as the function propagates farther deep into the string)

Plotting probable words

Probable words follwing 'i think' with top_n = 5 and f=1 (left); note that the relative frequencies of the predictions will change with change in f (right, f=0.5)

Predictions with f=1 Predictions with f=0.5

Profiling and conclusion

Following is the results of profiling the function 'predict(“i think”,top_n=1,f=1) on an i3 laptop with no GPU

x <-summaryRprof("~/core/code/R/coursera_R/capstone/profile")
x$by.self

          self.time self.pct total.time total.pct
"grepl"        0.14    53.85       0.14     53.85
"ls"           0.06    23.08       0.06     23.08
"tolower"      0.04    15.38       0.04     15.38
"c"            0.02     7.69       0.02      7.69

The program runs with only moderate memory usage (19.7 MB saved environment data), which makes it useful for smartphones. However, predictive accuracy is yet to be tested in real world.