Word Prediction Data Exploratory Analysis

Overview

The Coursera Data Science Capstone project involves building a word prediction model that would be useful to users of virtual keyboards on smartphones or tablets. A number of files that can be used to produce such a model were provided, containing examples of twitter messages (tweets), blog entries, and news stories, in a number of different languages.

For the purposes of this project, we will focus on the three English-language files provided.

Sample Corpora

The table below summarizes the three files.

File Name	Size (MB)	Lines	Sentences	Words	Unique Words	Words 90th Percentile	Single Occurrence Words	Words per Sentence
en_US.twitter.txt	301	2,360,148	3,765,661	29,679,912	426,962	5,795	262,103	7.9
en_US.blogs.txt	248	899,288	2,346,724	37,536,159	415,802	7,156	227,562	16.0
en_US.news.txt	19	77,259	151,935	2,594,946	95,277	8,883	47,500	17.1
Totals	569	3,336,695	6,264,320	69,811,017	732,978	7,291	439,269	11.1

The twitter file has the largest number of lines and sentences, but the blog file has the largest number of words. Tweets have the greatest number of unique words, but many of those are contrived letter strings rather than English-language words. For example, there are 75,001 unique hashtags represented, with values such as #neversaynever, #oscars, and #obama. Both blog posts and news articles have more than twice as many words per sentence as tweets.

An important statistic is the fact that of the 732,978 unique words in the combined corpora, 439,269 or 60% occur only once.

Distribution of Words

The chart below plots the cumulative percentage of unique words against cumulative percentage of total words across all three corpora. Less than 1% of the unique words account for 90% of all word occurrences. Any prediction model is therefore likely to only make use of a small fraction of the available words.

The chart below shows the frequency of occurrence for the top 50 unique words in the combined corpus.

Prediction Model Strategy

The goal of the project is to develop a tool that suggests possible next words when someone is typing on a smartphone or tablet virtual keyboard. Note that such tools typically offer mutliple alternatives, usually four or even five possible words. The basic strategy will be to treat typed words as a Markov Chain, which means that predictions about future words can be made based only on the most recently-typed words. The most recently-typed words are referred to as an “n-gram”, where n refers to an arbitrary number of words in sequence. For example, a 1-gram is the most recently typed word, a bi-gram is the two most recently typed words, a tri-gram is three words, etc. The approach we will use to make our predictions is called Katz’s Backoff Model, which can be described as follows:

Create a table that has all n-grams observed in the corpora, where n is between 1 and x (a number to be determined as part of the model development process). For each such n-gram, there will be an entry for each potential next word, and a probability of that next word occurring. The table should be in descending order of n-gram probability within n-gram.
Establish the number of words that will be offered as alternative predictions. Call that number predcount.
To predict the next word, begin by looking up the n-gram representing the most recent x words in the lookup table, and add the first predcount words included there to the list of alternatives.
If the count of alternative next words is less than predcount, continue by subracting 1 from the n-gram size, and repeating the process in step 3.
If there are still less than predcount alternatives after exhausting all n-grams, add as many words from a table of the most frequently occurring words in the corpora to the list as is needed to have a full list of alternative next words.

The Shiny App

The word prediction app will present a text box into which the user can type. Below the text box will be a horizontal list of possible next words; the user can click on any one of them and it will be added as the next word in the text box. There will be a separate small window that can be minimized which holds the running total accuracy score (how often did the user choose a predicted word for inclusion in their message vs. number of words typed) as well as a reset button to restart the accuracy measurement. Words in the text box that were chosen from the prediction list will be shown in a different font or color to enable visualization of the prediction effectiveness.

Implementation Considerations

There are a number of aspects of the model that we will evaluate during development:

Sentence Breaks: It probably doesn’t make sense to include any n-grams in the lookup table that span sentence boundaries. So we will generate n-grams sentence-by-sentence. Furthermore, the predictor tool will recognize the end of one sentence and the beginning of the next, and make it’s prediction for the first word of each sentence based on a separate “first word frequency” lookup table.
Should we build a single model for all situations, or three different models, one for each of the three corpora? It may be that the most likely next word in a tweet is very different than that of a blog entry or a news post.
We will look at ways to minimize the size of the lookup table in order to avoid memory limits and slow initialization.
Maximum n-gram size: we will evaluate the prediction accuracy of models with different maximum values for n, and choose the value that optimizes performance and prediction accuracy
In order to evaluate implementation alternatives, we will need to develop a function that can be used to measure the prediction accuracy of the model against a sample corpora. That function will walk through the sample one word at a time, predicting the next word and then determining if the actual next word matches one of the offered suggestions. If it does, the number of successful matches is incremented by one. The accuracy measurement will be the number of successful matches divided by the total words predicted.

Expected Accuracy

Prediction accuracy is likely to be quite low. Even with the 90th percentile of all word occurrences being limited to 7,291 unique words, the probability that a handful of choices will match a given next word seems relatively low. A key part of the implementation process will be to evaluate accuracy and look for ways to improve it.

Performance

We assume for this project that the process of building the lookup table used by the model is an offline process and does not need to be done dynamically (although it should be possible to extend the model to start with an a priori lookup table, and then dynamically modify it based on the typing habits of an individual user.) The prediction tool will use that table to do the lookups “on the fly”. A key consideration will be making sure that the lookup table can be searched and options presented to the user in real time (i.e., it must have response time on the order of tens of milliseconds).

Profanity

One goal of our next word predictor is to avoid suggesting profane words. We will filter out any n-grams that contain words which occur in a list that was released by Google in 2013, prior to building our lookup table.