December 7, 2014
The purpose of this project is to create an algorithm that is capable of predicting the next word that a user might type. For example, consider if the user had typed “how are you” and the predicted word might be “today”. There are a wide range of applications for such an algorithm. One such application is to speed data entry on mobile devices, such as cell phones.
The application is very easy to use.
This algorithm works by using an n-gram and backoff modeling. I began by reducing scanning the twitter, blog and news data for unique words. I removed the 7 Dirty Words and built a lexicon of words, assigning each word a unique index. I found a total of 5452 unique words. My indexes were ordered according to frequency. This allowed me to use individual word frequency to break ties of n-grams. Using a 3-gram I built a tree of all possible 3-grams and tracked the frequency of each.
I ran into considerable memory issues with Shiny and had to prune my tree considerably. I ended up only keeping the top 25% of my n-grams. This decreased accuracy, but allowed the shiny app to run.
It can be difficult to handle a word that is not in the known lexicon. To handle this situation, I used Katz's back-off model:
\[ P_{bo} (w_i | w_{i-n+1} \cdots w_{i-1}) = \begin{cases} d_{w_{i-n+1} \cdots w_{i}} \frac{C(w_{i-n+1}...w_{i-1}w_{i})}{C(w_{i-n+1} \cdots w_{i-1})} & \mbox{ if } C(w_{i-n+1} \cdots w_i) > k \\ \alpha_{w_{i-n+1} \cdots w_{i-1}} P_{bo}(w_i | w_{i-n+2} \cdots w_{i-1}) & \mbox{ otherwise} \end{cases} \]
The above formula looks at the case where an n-gram of size n is not available and calculates the probability of an ngram of n-1 size, using the same words. We reduce n until we find sutable n-grams and choose the highest probability.
Using a basic n-gram model I did not get a really high accuracy rate against the entire data set. I also ran into considerably high error rates. Obviously, NLP is a complex error of research. To achieve better results I would attempt the following: