Text predictor

Gouthami Senthamaraikkannan

July 19, 2016

Introduction

A text prediction app is presented to you. It has been built based on the exploration of blogs, news articles and twits in the American sub continent. The following is a snapshot of the “Results” and “Documentation” tabs in the app.

Text predictor

Text predictor

Summary of data sizes

I present here the stats of the data sets. This is to impress upon you the scale of the model.

From all the blogs, news and twits published in English, a small subset was selected for model building.

2-grams and 3-grams composing a total of about 2 GB was deduced from this.

n-grams occurring more than twice, were then saved into look-up tables composing about 50 MB.

Prediction algorithm

Each n-gram was separated into (n-1)gram, which is the prior. Its \(n^{th}\) word is the posterior. The probability of the posterior was computed as,

\[ \begin{equation} \text{Probability} = \frac{\text{Frequency of occurence of the (n)gram}}{\text{Frequency of occurence of the (n-1)gram }} \end{equation} \]

On comparing the given text with the (n-1)gram (also referred to as the prior), possible \(n^{th}\) words are obtained. They are returned to the user, in the decreasing order of the computed probability.

NOTE: When there is no match against the available 2-grams or 3-grams, the predictor produces the most commonly occurring words.

Improvements to model

Conclusion

Thus a pretty effective model was built. Type away common phrases and check out the predictions!

NOTE: Please note that the prediction model has a certain running time. Wait for the results, until the end of the “loading…” message.

The following are a few results from the app:

Examples

Examples