Capstone Project: Predictive Text Input

Ali Baghshomali
August 20th, 2015

Overview

The process of going from large corpus to a text prediction app is filled with unforeseen obstacles and nuances. The diagram below provides a general outline of the overall process. tables.

We start with a corpus of data. First, the data gets cleaned and prepared - numbers, white spaces, and punctuations are removed -, getting ready to create lookup tables. In order to create the lookup tables, we used R functions to create N-grams (repeated combinations of words), and sorted them according to frequency and N-value.

Creating The Model

It’s very likely that a value that is looked up in the tables has numerous matches, so we need to come up with a model that is able to pick the most likely outcome based on assigned probability values. This is where we employ our “back off strategy”.

In this case, we assign a probability to each term based on:

  • Frequency of occurence of the N-gram
  • Total number of N-grams (for each N-value)
  • Value of N (higher N-values correspond to higher probability)

The Final Product

The Final text prediction app is placed in the “Prediction” tab. It's very simple: you input text and it will predict the next most likely word.

In order to demonstrate more of the app, I made a second part included in the “Play” tab of the app. In this portion we use our prediction model to automatically generate some words after a set of famous phrases. The user clicks the “Try!” button next to each snippet to see the prediction. Unlike the main model, this portion of the app doesn't always select the most likely next word, instead taking a random pick from some of the top choices. This way clicking the button generates a new term every time. Less accurate but more fun!

Notes / Conclusion

While the app fulfills its objective and can suggest the next word based on the initial corpus, it has its shortcomings. Some of these include:

  • Memory problems resulted in having to sample the data and remove many sparse terms from the document term matrix, resulting in a smaller lookup table
  • Including stopwords in the corpus makes the predictions more accurate, but also fills the tables with a lot of stopword combinations
  • When the app finds no N-gram matches, it picks a high-occuring word in the corpus. Adding context to this selection could help the app make better suggestions

Overall, there's definitely a lot of room for improvement, but the app is successful in implementing the core objective of the project.