Brian Francis
04-September-2016
Typing text into a smartphone can be quite slow. If the phone could guess the next word, it could go much faster.
The goal of this project is to create a Ib appliciation that will present three possible words based on previous text typed. By using blog, news, and twitter text gathered from the internet, a prediction model will be created.
What follows is some exploratory summaries and charts describing the data set. In addition, a proposal for how to prediction model will be developed.
Below are the # of lines, total # of words, # of unique words, and unique 2 and 3 word phrases from 1% random sampling of each of the three data sources (blog posts, news articles, and twitter messages).
Blog News Twitter
Line Count 9003 10012 23676
Total Word Count 188050 189276 159158
Unique Word Count 20033 21122 19563
Unique Word Pairs 158052 157724 118507
Unique Word Triplets 173292 171093 120956
The histograms previous show the frequency of unique words. The x-axis is the number of times a word appears in the raw text (on log 10 scale, so 1; 10; 100; 1000). The y-axis is the number of unique words with that count.
As you can see, most words appear very infrequently (just once), but a small number of words are seen very often. The number of infrequent words may point to some incosistency in the data cleaning.
Note that very common words that provide little information for prediction (e.g., “and”, “the”) were removed from the data set during pre-processing.
The previous plots show how many unqiue words are needed to cover a certain percentage of all the words in the raw text. Horizontal lines indicate 50% and 90% coverage. A relatively small number of words cover half the text. And half of the unique words will get us > 90% coverage.
This indicates that I gain relatively little information by including the low frequency words, so I may consider dropping them if that will help performance.
To do the final prediction, I will find probabilities of a new word given the previous words typed and present the top three possibilities. I will use up to 3 of the previously entered words as a basis for that prediction.
Assuming 3 words are available, the algorithm will do four probability calculations and do a weighted average of the four probabilities. The four probability calculations will be based on knowing the 3 words typed, knowing just the last 2 words, knowing just the last word, and not knowing any of the words. If less inforamtion is given, then fewer probabilities will be used accordingly.
This approach will allow us to provide likely words when the information given doesn't match things we've seen previously.
To evaluate the accuracy of the model, I will randomly sample records from the original data source that were not used to create the model. I will then run the model against this new data set and compare predictions to the actual next word in the data set. This will need to be evaluated for single words as well as two, three, and four word phrases.