Johns Hopkins Coursera Capstone

December 7, 2014

The purpose of this project is to create an algorithm that is capable of predicting the next word that a user might type. For example, consider if the user had typed “how are you” and the predicted word might be “today”. There are a wide range of applications for such an algorithm. One such application is to speed data entry on mobile devices, such as cell phones.

Using the Application

The application is very easy to use.

Start typing into the text field.
You should see up to 3 predictions appear as buttons below what you type.
You can click the individual buttons if you see the word you desire.

Algorithm Details

This algorithm works by using an n-gram and backoff modeling. I began by reducing scanning the twitter, blog and news data for unique words. I removed the 7 Dirty Words and built a lexicon of words, assigning each word a unique index. I found a total of 5452 unique words. My indexes were ordered according to frequency. This allowed me to use individual word frequency to break ties of n-grams. Using a 3-gram I built a tree of all possible 3-grams and tracked the frequency of each.

I ran into considerable memory issues with Shiny and had to prune my tree considerably. I ended up only keeping the top 25% of my n-grams. This decreased accuracy, but allowed the shiny app to run.

Handling Unkown Words & Phrases

It can be difficult to handle a word that is not in the known lexicon. To handle this situation, I used Katz's back-off model:

\[ P_{bo} (w_i | w_{i-n+1} \cdots w_{i-1}) = \begin{cases} d_{w_{i-n+1} \cdots w_{i}} \frac{C(w_{i-n+1}...w_{i-1}w_{i})}{C(w_{i-n+1} \cdots w_{i-1})} & \mbox{ if } C(w_{i-n+1} \cdots w_i) > k \\ \alpha_{w_{i-n+1} \cdots w_{i-1}} P_{bo}(w_i | w_{i-n+2} \cdots w_{i-1}) & \mbox{ otherwise} \end{cases} \]

The above formula looks at the case where an n-gram of size n is not available and calculates the probability of an ngram of n-1 size, using the same words. We reduce n until we find sutable n-grams and choose the highest probability.

Conclusions and Next Steps

Using a basic n-gram model I did not get a really high accuracy rate against the entire data set. I also ran into considerably high error rates. Obviously, NLP is a complex error of research. To achieve better results I would attempt the following:

If I am constrained to the shiny server I would investigate how to compress and use as little memory as I could.
I would investigate models beyond just n-gram. Perhaps Markov models.
I would also investigate lexical analysis if the words using tools such as OpenNLP.