Word Prediction for Smart Keyboards - Milestone Report

Brian Francis
04-September-2016

Motivation

Typing text into a smartphone can be quite slow. If the phone could guess the next word, it could go much faster.

The goal of this project is to create a Ib appliciation that will present three possible words based on previous text typed. By using blog, news, and twitter text gathered from the internet, a prediction model will be created.

What follows is some exploratory summaries and charts describing the data set. In addition, a proposal for how to prediction model will be developed.

Summary Statistics

Below are the # of lines, total # of words, # of unique words, and unique 2 and 3 word phrases from 1% random sampling of each of the three data sources (blog posts, news articles, and twitter messages).

                       Blog   News Twitter
Line Count             9003  10012   23676
Total Word Count     188050 189276  159158
Unique Word Count     20033  21122   19563
Unique Word Pairs    158052 157724  118507
Unique Word Triplets 173292 171093  120956

Frequency Plots

plot of chunk unnamed-chunk-5

Frequency Plots (Continued)

The histograms previous show the frequency of unique words. The x-axis is the number of times a word appears in the raw text (on log 10 scale, so 1; 10; 100; 1000). The y-axis is the number of unique words with that count.

As you can see, most words appear very infrequently (just once), but a small number of words are seen very often. The number of infrequent words may point to some incosistency in the data cleaning.

Note that very common words that provide little information for prediction (e.g., “and”, “the”) were removed from the data set during pre-processing.

Word Coverage

plot of chunk unnamed-chunk-6

Word Coverage (Continued)

The previous plots show how many unqiue words are needed to cover a certain percentage of all the words in the raw text. Horizontal lines indicate 50% and 90% coverage. A relatively small number of words cover half the text. And half of the unique words will get us > 90% coverage.

This indicates that I gain relatively little information by including the low frequency words, so I may consider dropping them if that will help performance.

Most Frequent Two Word Phrases

plot of chunk unnamed-chunk-7

Performing the Prediction

To do the final prediction, I will find probabilities of a new word given the previous words typed and present the top three possibilities. I will use up to 3 of the previously entered words as a basis for that prediction.

Assuming 3 words are available, the algorithm will do four probability calculations and do a weighted average of the four probabilities. The four probability calculations will be based on knowing the 3 words typed, knowing just the last 2 words, knowing just the last word, and not knowing any of the words. If less inforamtion is given, then fewer probabilities will be used accordingly.

This approach will allow us to provide likely words when the information given doesn't match things we've seen previously.

Evaluating the Model

To evaluate the accuracy of the model, I will randomly sample records from the original data source that were not used to create the model. I will then run the model against this new data set and compare predictions to the actual next word in the data set. This will need to be evaluated for single words as well as two, three, and four word phrases.