Word Prediction

Patrick J
February 26, 2017

Word Prediction System

This application take a phrase as user input, and attempts to predict the next word in the sequence. To accomplish this, the algorithm must build language models based on corpuses from Twitter, Blogs and News sites.

In desining this machine learning model, we faced 3 key performance challenges:

1) how to read/analyze a large data corpus with limited time and processing power,
2) how to maximize the accuracy of the predictions, and
3) how to optimize performance for hosting on a public web server.

Screenshot

Performance Challenges

1) Analyze Large Data Corpus

It would take significant computing resources to extract 100% of all n-grams from every corpus. However, using regression and statistical inference, I can aproximate the minimum number of random sampled texts required to obtain 95% of all n-grams. This allows us to achieve nearly complete analysis, with only a small fraction of the work.

n-gram Regression

Performance Challenges

2) Maximize Accuracy of Predictions

The algorithm uses an n-gram step-down model to find the best prediction. If it can't find a 4-gram match, then it will look for 3-gram, 2-gram and finally 1-gram. When several matches are found, it will select the one with the highest occurrence frequency.

Once we collect n-gram samples from the 3 corpuses, we can't combine them or it will create sampling errors. Instead, we compile a set of random selections from all 3 corpuses, and test each against all 3 n-gram sets. Based on this, we create a probabilistic model for deciding which of the 3 answers will be chosen for any given test.

Probabilistic Choices

3) Optimize Performance

We run the algorithm for every unique n-gram across all 3 corpuses. This will serve as a pre-compiled cache that can be saved as a text file and loaded at runtime. Using cached results provides significantly faster search time, since the heavy lifting was done in advance.

The cache columns are also set as factors, instead of strings. This will further improve efficiency.

Benchmarks

We've performed a test of the algorithm against a sample of 750 random texts evenly distributed across all 3 corpuses. Based on this test, the current algorithm was found to have an accuracy of 71% (+/-3.3%), with a 5% margin of error.

Interstingly, a tiny amount of error can be attributed to the fact that, although profanities are used within the prediction algorithm, the algorithm is blocked from suggesting profanities as an outcome.

How To View

Please click the following link to view the application in action:

https://numbap.shinyapps.io/word_prediction_system/

Key Advantages

Compared to other approaches, this algorithm offers fast and efficient corpus analysis and n-gram indexing on the back-end. The cache also makes for fast and accurate searches.

The accuracy of this algorithm can easily be improved if the context of the writing is known. (ex: Are you writing a Facebook post, personal email, business document, text message, news story, blog article, Twitter post, etc..) And if the preloaded cache file is replaced with a relational database such as MySQL, the search performance will be much faster and significantly more scalable.

In fact, the data visualization portion of the application is responsible for most of the processing overhead in the search algorithm of the live example linked above.

This algorithm is a very practical and efficient way to perform text prediction.