Overview

The Coursera Data Science Capstone project involves building a word prediction model that would be useful to users of virtual keyboards on smartphones or tablets. A number of files that can be used to produce such a model were provided, containing examples of twitter messages (tweets), blog entries, and news stories, in a number of different languages.

For the purposes of this project, we will focus on the three English-language files provided.

Sample Corpora

The table below summarizes the three files.

File Name Size (MB) Lines Sentences Words Unique Words Words 90th Percentile Single Occurrence Words Words per Sentence
en_US.twitter.txt 301 2,360,148 3,765,661 29,679,912 426,962 5,795 262,103 7.9
en_US.blogs.txt 248 899,288 2,346,724 37,536,159 415,802 7,156 227,562 16.0
en_US.news.txt 19 77,259 151,935 2,594,946 95,277 8,883 47,500 17.1
Totals 569 3,336,695 6,264,320 69,811,017 732,978 7,291 439,269 11.1

The twitter file has the largest number of lines and sentences, but the blog file has the largest number of words. Tweets have the greatest number of unique words, but many of those are contrived letter strings rather than English-language words. For example, there are 75,001 unique hashtags represented, with values such as #neversaynever, #oscars, and #obama. Both blog posts and news articles have more than twice as many words per sentence as tweets.

An important statistic is the fact that of the 732,978 unique words in the combined corpora, 439,269 or 60% occur only once.

Distribution of Words

The chart below plots the cumulative percentage of unique words against cumulative percentage of total words across all three corpora. Less than 1% of the unique words account for 90% of all word occurrences. Any prediction model is therefore likely to only make use of a small fraction of the available words.

The chart below shows the frequency of occurrence for the top 50 unique words in the combined corpus.

Prediction Model Strategy

The goal of the project is to develop a tool that suggests possible next words when someone is typing on a smartphone or tablet virtual keyboard. Note that such tools typically offer mutliple alternatives, usually four or even five possible words. The basic strategy will be to treat typed words as a Markov Chain, which means that predictions about future words can be made based only on the most recently-typed words. The most recently-typed words are referred to as an “n-gram”, where n refers to an arbitrary number of words in sequence. For example, a 1-gram is the most recently typed word, a bi-gram is the two most recently typed words, a tri-gram is three words, etc. The approach we will use to make our predictions is called Katz’s Backoff Model, which can be described as follows:

  1. Create a table that has all n-grams observed in the corpora, where n is between 1 and x (a number to be determined as part of the model development process). For each such n-gram, there will be an entry for each potential next word, and a probability of that next word occurring. The table should be in descending order of n-gram probability within n-gram.
  2. Establish the number of words that will be offered as alternative predictions. Call that number predcount.
  3. To predict the next word, begin by looking up the n-gram representing the most recent x words in the lookup table, and add the first predcount words included there to the list of alternatives.
  4. If the count of alternative next words is less than predcount, continue by subracting 1 from the n-gram size, and repeating the process in step 3.
  5. If there are still less than predcount alternatives after exhausting all n-grams, add as many words from a table of the most frequently occurring words in the corpora to the list as is needed to have a full list of alternative next words.

The Shiny App

The word prediction app will present a text box into which the user can type. Below the text box will be a horizontal list of possible next words; the user can click on any one of them and it will be added as the next word in the text box. There will be a separate small window that can be minimized which holds the running total accuracy score (how often did the user choose a predicted word for inclusion in their message vs. number of words typed) as well as a reset button to restart the accuracy measurement. Words in the text box that were chosen from the prediction list will be shown in a different font or color to enable visualization of the prediction effectiveness.

Implementation Considerations

There are a number of aspects of the model that we will evaluate during development:

Expected Accuracy

Prediction accuracy is likely to be quite low. Even with the 90th percentile of all word occurrences being limited to 7,291 unique words, the probability that a handful of choices will match a given next word seems relatively low. A key part of the implementation process will be to evaluate accuracy and look for ways to improve it.

Performance

We assume for this project that the process of building the lookup table used by the model is an offline process and does not need to be done dynamically (although it should be possible to extend the model to start with an a priori lookup table, and then dynamically modify it based on the typing habits of an individual user.) The prediction tool will use that table to do the lookups “on the fly”. A key consideration will be making sure that the lookup table can be searched and options presented to the user in real time (i.e., it must have response time on the order of tens of milliseconds).

Profanity

One goal of our next word predictor is to avoid suggesting profane words. We will filter out any n-grams that contain words which occur in a list that was released by Google in 2013, prior to building our lookup table.