Cell Phone Texting Word Prediction

Coursera Data Science Capstone Project

Derek Sollberger
code monkey

The Task

We are going to produce R code that mimics word-prediction algorithms for mobile text messaging. The joint venture between John Hopkins University and Swiftkey provided training data composed of Twitter posts, blogs, and news feeds. The files are also in the following languages: American English, Finnish, German, and Russian. The algorithm presented below will emphasize speed, and yet will hopefully yield similar results---that is, still predict what would the user wants to type next---as the current, memory-intensive methods.

	English			Finnish
	blogs	news	Twitter	blogs	news	Twitter
Lines	898384	77258	2302307	439715	485758	278943
Size (Mb)	205	200	163	105	92	24

	German			Russian
	blogs	news	Twitter	blogs	news	Twitter
Lines	181909	244739	929660	337075	196360	875002
Size (Mb)	83	93	73	114	116	102

The Plan

As a user types into the cell phone, the first letters and word lengths for each word is computed. This is also done for a sample of the given data.

The application finds the next word in situations with the same, first letter or word length, and then finds the 5 most frequent words that came up next. The user is given this list of 5 candidates for a choice of the next word.

The Future

This algorithm emphasizes speed and user comfort. Furthermore, the program can be adapted to calculate the word ranks from the first letter of the current word, and also calculate the word ranks from the changing word length as the user types. A lot of preprocessing could be done in advance in a parallelizable way for the different possibilities of first letters and word lengths.

Finally, it would be interesting to implement Christian Rudder's 2D visualization for word discrimination and run these calculations for several, typed words at once.

Mountain View

The App

Use the application found at http://freexstate.shinyapps.io/CapstoneShiny/

Try out a few phrases and see what the app predicts for you!