Introduction

A text prediction app is presented to you. It has been built based on the exploration of blogs, news articles and twits in the American sub continent. The following is a snapshot of the “Results” and “Documentation” tabs in the app.

Text predictor

Summary of data sizes

I present here the stats of the data sets. This is to impress upon you the scale of the model.

From all the blogs, news and twits published in English, a small subset was selected for model building.

540,000 lines of blogs
540,000 lines of news
540,000 lines of twits

2-grams and 3-grams composing a total of about 2 GB was deduced from this.

n-grams occurring more than twice, were then saved into look-up tables composing about 50 MB.

Prediction algorithm

Each n-gram was separated into (n-1)gram, which is the prior. Its \(n^{th}\) word is the posterior. The probability of the posterior was computed as,

\[ \begin{equation} \text{Probability} = \frac{\text{Frequency of occurence of the (n)gram}}{\text{Frequency of occurence of the (n-1)gram }} \end{equation} \]

On comparing the given text with the (n-1)gram (also referred to as the prior), possible \(n^{th}\) words are obtained. They are returned to the user, in the decreasing order of the computed probability.

NOTE: When there is no match against the available 2-grams or 3-grams, the predictor produces the most commonly occurring words.

Improvements to model

Model can be improved by considering all the lines of the blogs, news, twits (a total of 1 GB). Currently only 260 MB of the data is used.
Model can be improved by extracting more n-grams
Also, n-grams with higher frequencies only is to be retained
Text mining did not eliminate many unknown symbols. That would have improved model quality
Better prediction algorithms could be carried out, to improve time and accuracy
A method to develop a comprehensive lookup table must be developed

Conclusion

Thus a pretty effective model was built. Type away common phrases and check out the predictions!

NOTE: Please note that the prediction model has a certain running time. Wait for the results, until the end of the “loading…” message.

The following are a few results from the app:

Examples