Word prediction by n-gram model

Parikshit Sanyal
16 Nov 18

Introduction

This prediction model takes a text string, a decay factor f (between 0 and 1) and a number top_n as input, and then prints the top n predictions following the string.

Example input: predict(“i think”,f=1,top_n=5)
Output: “i” “its” “so” “that” “about”

The n-grams used for prediction were collected from a dataset of blogs, tweets and news. A set of 1, 2, 3, 4 and 5-grams were generated from the training set.

Training data

The n-grams used for prediction were collected from a dataset of blogs, tweets and news. A set of 1, 2, 3, 4 and 5-grams were generated from the training set.
A subset containing 25% of the data was sued for training
All characters converted to lowercase, numbers and puncutuation removes

Algorithm

Given a string, the function finds

the last four words of the string
finds commonest 5-grams beginning with the string with their frequencies
applies the function recursively to last 3, 2 and 1-grams of the string in sequence
accumulates the frequencies of commonest predicted words in a dataframe
the variable f (between 0 and 1) is the decay factor; for every recursive call to the function, the word counts are multiplied by fxf (i.e. the relative importance of words predicted diminishes)

Plotting probable words

Probable words follwing 'i think' Predictions

Conclusion

The program runs with only moderate memory usage (19.7 MB saved environment data), which makes it useful for smartphones. However, predictive accuracy is yet to be tested in real world.