Next Word Predictor Coursera Capstone Project

Beatriz Ortiz
Tue Aug 18, 2015

Overview

This project aim to develope a predictive text model that could be used for example with a keyboard on mobile device, to accelerate the writing, reduce the effort needed to type and suggest the correct word.

We will use a corpus, a large amount of text from HC Corpora, collected from publicly web sources. Given this corpus, we’d like to estimate the parameters of a predictive language model.

Statisitcal NLP methods can be useful in order to capture human knowledge, needed to allow prediction and assess the likelihood of various hypotheses.

Predictive algorithm are based in N-gram language models wich provide a natual approach to the construction of sentence completion systems.

Preproccess and Cleaning the data

To build our predictive model we need to preproccess and cleaning data.

A preliminary analysis allows to become familiar with Blog, News and Twitter files, get the size, the number of lines and words and deepen the rows content, identifying appropriate tokens.

I don't need to load the entire data. I build two different corpora. Probabilities are extracted from a training corpus, with 60% of data. A test corpus is used to run trials to evaluate the model with 40%.

Using regular expressions and tm function, cleaning up and structuring the input text for further analysis removing strange characters, whitespace or punttuation, converting to lowercase or stemming data. I don't remove english stopwords.

Build a Next Word Predictive Model

The proposed solution consists in generalizing an N-gram model:

We build a 4, 3, 2 and 1-gram models to predict the conditional probability of the next word, making a Markov chain aproximation P(Wn|W1,…,Wn−1) ≈ P(Wn|Wn−N+1,…,Wn−1) where only the most recent n-1 token is relevant when predicting next word.

We estimate these N-gram probabilities using maximum likelihood estimation, MLE where the probability are given by the relative frecuencies P(Wn|Wn-1) = C(Wn-1Wn) / C(Wn-1)

It is vital that the probability of the unseen objects not be estimated as zero. we use Backoff combined with Good-Turing smoothed methods to estimate a probability for an object which has not been seen before.

Probabilities are best calculated by consider the Good-Turing estimator: r* = (r+1) Nr+1/Nr

The Shiny App

I may have to sacrifice some accuracy to have a fast enough model to load into Shiny. We reduce the size of the files to 5%(aprox 8MB).

Shiny app works in this way :

You can introduce a word or a phrase in the input box.

Then, you get a prediction of a single word after a suitable delay.

Also you can see the complete phrase in the last box.