5th February 2018

Introduction

The idea behind this project is to develop an algorithm for predicting text, based on the previous words entered by the user. This work is the result of the Capstone Project inside the Data Science Specialization Course, run by the John Hopkins University in collaboration with SwiftKey.

The following presentation will briefly explain how the model works, descrive its predictive performance and show off the developed shiny app and its operation.

Three different English corpora has been used to feed the model: one from blog posts, one from news articles and another from Twitter. The data has been splitted into training and test set, with 80-20% proportions.

The first part, consisting of data cleaning and exploratory analysis is part of another report that can be found here

Natural Languaje predictive model

The prediction model itself is based on the N-Gram model, where the probability of the next word could be approximated by using the N precedent words. Specifically, a 4-Ngram probabilistic language model with a Stupid Back-off method has been implemented. A maximum likelihood estimation (MLE) method is used to rank next-word candidates.

Basically, the model works as follow: firtsly, the model cleans the text string entered by the user. Once cleaned, the model compares the last 3 words of the text string with the stored 3-grams dataset and returns the top three frequent words that follow this text string. If zero matches are obtained, the model uses a backoff method and tries to find any match for N=2 or N=1. If no match is obtained, the model shows NA result.

The entire training set (80% of the total data) is used to construct the predictive model. However, ultra-low frecuency N-grams (frequencies under 3 times) have been removed, since it uses a significant amount of memory while adding little value to our predictive model.

Accuracy of the model

In order to obtain the accuracy of the model, the testing set have been used as input of the predictive model, applying it to predict the upcoming words. Three thousand random phrases have been selected from the testing set and used as input text string of the model and it has been repetead four times.

For each of these four runs, the accuracy has been measured. If one of the three predicted words given by the model matches with the upcoming real word, the indiviudal prediction is considered accurated. Considering all the simulations per run, the results are the followings:

Run Num of simulations Accuracy
Run 1 3000 31.8%
Run 2 3000 32.1%
Run 3 3000 32.7%
Run 4 3000 30.4%


From these results we could conclude that the accuracy of the model is around 30%.

Shiny next word prediction tool

The developed app predicts the most probable words to follow the phrase entered by the user.

The left pane of the app is where the user enter the text string and choose the settings for model, basically if he wants to enable (or not) the profanity filtering and the check speller.

By clicking the Predict! button, the app returns the three most probable upcoming words, ranked in decreasing order, as well as the frequency of the text structure and the N-gram associated structure (4,3,2,NA).

The shiny app could be accessed through the following link