Cell Phone Texting Word Prediction

Coursera Data Science Capstone Project

Derek Sollberger
code monkey

The Task

We are going to produce R code that mimics word-prediction algorithms for mobile text messaging. The joint venture between John Hopkins University and Swiftkey provided training data composed of Twitter posts, blogs, and news feeds. The files are also in the following languages: American English, Finnish, German, and Russian. The algorithm presented below will emphasize speed, and yet will hopefully yield similar results---that is, still predict what would the user wants to type next---as the current, memory-intensive methods.

EnglishFinnish
blogsnewsTwitterblogsnewsTwitter
Lines898384772582302307439715485758278943
Size (Mb)
2052001631059224
GermanRussian
blogsnewsTwitterblogsnewsTwitter
Lines181909244739929660337075196360875002
Size (Mb)
839373114116102

The Plan

As a user types into the cell phone, the first letters and word lengths for each word is computed. This is also done for a sample of the given data.

The application finds the next word in situations with the same, first letter or word length, and then finds the 5 most frequent words that came up next. The user is given this list of 5 candidates for a choice of the next word.

The Future

This algorithm emphasizes speed and user comfort. Furthermore, the program can be adapted to calculate the word ranks from the first letter of the current word, and also calculate the word ranks from the changing word length as the user types. A lot of preprocessing could be done in advance in a parallelizable way for the different possibilities of first letters and word lengths.

Finally, it would be interesting to implement Christian Rudder's 2D visualization for word discrimination and run these calculations for several, typed words at once.