The Coursera Data Science Capstone project involves building a word prediction model that would be useful to users of virtual keyboards on smartphones or tablets. A number of files that can be used to produce such a model were provided, containing examples of twitter messages (tweets), blog entries, and news stories, in a number of different languages.
For the purposes of this project, we will focus on the three English-language files provided.
The table below summarizes the three files.
| File Name | Size (MB) | Lines | Sentences | Words | Unique Words | Words 90th Percentile | Single Occurrence Words | Words per Sentence |
|---|---|---|---|---|---|---|---|---|
| en_US.twitter.txt | 301 | 2,360,148 | 3,765,661 | 29,679,912 | 426,962 | 5,795 | 262,103 | 7.9 |
| en_US.blogs.txt | 248 | 899,288 | 2,346,724 | 37,536,159 | 415,802 | 7,156 | 227,562 | 16.0 |
| en_US.news.txt | 19 | 77,259 | 151,935 | 2,594,946 | 95,277 | 8,883 | 47,500 | 17.1 |
| Totals | 569 | 3,336,695 | 6,264,320 | 69,811,017 | 732,978 | 7,291 | 439,269 | 11.1 |
The twitter file has the largest number of lines and sentences, but the blog file has the largest number of words. Tweets have the greatest number of unique words, but many of those are contrived letter strings rather than English-language words. For example, there are 75,001 unique hashtags represented, with values such as #neversaynever, #oscars, and #obama. Both blog posts and news articles have more than twice as many words per sentence as tweets.
An important statistic is the fact that of the 732,978 unique words in the combined corpora, 439,269 or 60% occur only once.
The chart below plots the cumulative percentage of unique words against cumulative percentage of total words across all three corpora. Less than 1% of the unique words account for 90% of all word occurrences. Any prediction model is therefore likely to only make use of a small fraction of the available words.
The chart below shows the frequency of occurrence for the top 50 unique words in the combined corpus.
The goal of the project is to develop a tool that suggests possible next words when someone is typing on a smartphone or tablet virtual keyboard. Note that such tools typically offer mutliple alternatives, usually four or even five possible words. The basic strategy will be to treat typed words as a Markov Chain, which means that predictions about future words can be made based only on the most recently-typed words. The most recently-typed words are referred to as an “n-gram”, where n refers to an arbitrary number of words in sequence. For example, a 1-gram is the most recently typed word, a bi-gram is the two most recently typed words, a tri-gram is three words, etc. The approach we will use to make our predictions is called Katz’s Backoff Model, which can be described as follows:
The word prediction app will present a text box into which the user can type. Below the text box will be a horizontal list of possible next words; the user can click on any one of them and it will be added as the next word in the text box. There will be a separate small window that can be minimized which holds the running total accuracy score (how often did the user choose a predicted word for inclusion in their message vs. number of words typed) as well as a reset button to restart the accuracy measurement. Words in the text box that were chosen from the prediction list will be shown in a different font or color to enable visualization of the prediction effectiveness.
There are a number of aspects of the model that we will evaluate during development:
Prediction accuracy is likely to be quite low. Even with the 90th percentile of all word occurrences being limited to 7,291 unique words, the probability that a handful of choices will match a given next word seems relatively low. A key part of the implementation process will be to evaluate accuracy and look for ways to improve it.
We assume for this project that the process of building the lookup table used by the model is an offline process and does not need to be done dynamically (although it should be possible to extend the model to start with an a priori lookup table, and then dynamically modify it based on the typing habits of an individual user.) The prediction tool will use that table to do the lookups “on the fly”. A key consideration will be making sure that the lookup table can be searched and options presented to the user in real time (i.e., it must have response time on the order of tens of milliseconds).
One goal of our next word predictor is to avoid suggesting profane words. We will filter out any n-grams that contain words which occur in a list that was released by Google in 2013, prior to building our lookup table.