The Task

We are going to produce R code that mimics word-prediction algorithms for mobile text messaging. The joint venture between John Hopkins University and Swiftkey provided training data composed of Twitter posts, blogs, and news feeds. The files are also in the following languages: American English, Finnish, German, and Russian. The algorithm presented below will emphasize speed, and yet will hopefully yield similar results—that is, still predict what would the user wants to type next—as the current, memory-intensive methods.

English Finnish
blogs news Twitter blogs news Twitter
Lines 898384 77258 2302307 439715 485758 278943
Size (Mb)
205 200 163 105 92 24
German Russian
blogs news Twitter blogs news Twitter
Lines 181909 244739 929660 337075 196360 875002
Size (Mb)
83 93 73 114 116 102

The Plan

As a user types into the cell phone, the first letters and word lengths for each word is computed. This is also done for a sample of the given data.

The application finds the next word in situations with the same, first letter or word length, and then finds the 5 most frequent words that came up next. The user is given this list of 5 candidates for a choice of the next word.


The Future

This algorithm emphasizes speed and user comfort. Furthermore, the program can be adapted to calculate the word ranks from the first letter of the current word, and also calculate the word ranks from the changing word length as the user types. A lot of preprocessing could be done in advance in a parallelizable way for the different possibilities of first letters and word lengths.

Finally, it would be interesting to implement Christian Rudder’s 2D visualization for word discrimination and run these calculations for several, typed words at once.

Mountain View Mountain View


The App

Use the application found at http://freexstate.shinyapps.io/CapstoneShiny/

Try out a few phrases and see what the app predicts for you!