Developing a Predictive Text Model

author : Brad Allen

Understanding language a bit more deeply.

This report summarizes the Swiftkey / JHU Coursera Capstone project. The goal of the project is to develop a working predictive text application that is hosted on RStudio's ShinyApps servers.

Having never approached a Natural Language Processing (NLP) problem before, I was very curious about the process of modeling language.

We used the Heliohost HC Corpora as our background text, and I referred frequently to a Stanford NLP Smoothing Tutorial.

My Approach

The following few slides will go through my approach to the problem. For my solution, I will show:

What does it do?
How does it work?, and
How might it be improved?

You can find an interactive companion at my shinyapps page (bradaallen).

What does it do?

This model works by first matching a database of different 'n-grams' to the provided text, and then assigning strengths to different outcomes based on how the match takes place.

For example, take the phrase, “The quick brown fox jumps over the lazy dog.” If I have typed “The quick brown fox jumps…” my backoff model would first look at “brown fox jumps” (a 3-gram), then “fox jumps” (a 2-gram), then “jumps” (a 1-gram) - if a match occurs in those three lookups, that answer is provided.

How does it work?

When visiting the site, enter text into the provided box to see what match might occur.

Screenshot

How might it be improved?

Creating triggers for when to use 'stopwords' and when to refrain.
Using more of the Corpora when developing the underlying database.
With a greater database, exploring different options for fast recall. For example, Redis or MongoDB.
Greater exploration in measuring predictive accuracy.
As an extension, I am also curious how to develop a model that improves dynamically.

More detail on these thoughts can be found at the application itself. Thank you!