Ali Baghshomali
August 20th, 2015
The process of going from large corpus to a text prediction app is filled with unforeseen obstacles and nuances. The diagram below provides a general outline of the overall process. tables.
We start with a corpus of data. First, the data gets cleaned and prepared - numbers, white spaces, and punctuations are removed -, getting ready to create lookup tables. In order to create the lookup tables, we used R functions to create N-grams (repeated combinations of words), and sorted them according to frequency and N-value.
It’s very likely that a value that is looked up in the tables has numerous matches, so we need to come up with a model that is able to pick the most likely outcome based on assigned probability values. This is where we employ our “back off strategy”.
In this case, we assign a probability to each term based on:
The Final text prediction app is placed in the “Prediction” tab. It's very simple: you input text and it will predict the next most likely word.
In order to demonstrate more of the app, I made a second part included in the “Play” tab of the app. In this portion we use our prediction model to automatically generate some words after a set of famous phrases. The user clicks the “Try!” button next to each snippet to see the prediction. Unlike the main model, this portion of the app doesn't always select the most likely next word, instead taking a random pick from some of the top choices. This way clicking the button generates a new term every time. Less accurate but more fun!
While the app fulfills its objective and can suggest the next word based on the initial corpus, it has its shortcomings. Some of these include:
Overall, there's definitely a lot of room for improvement, but the app is successful in implementing the core objective of the project.