Milestone Report

for Capstone course, Data Science specialization, Coursera

Executive Summary

Exploratory data analysis of a modest dataset combining tweets, blog posts and news articles suggests how to build a machine learning model to predict the next word typed,a small-screen device like on a smart phone. Since these texts are not a scrambled “bag of words”, the model built must be capable of predicting sequences of words. To help keep the model both accurate and frugal with limited computational resources, the data should be whittled down, transformed into short strings of words, re-coded for punctuation, and filled out with some missing words.

Introduction

While a huge corpora like the 440-million word Corpus of Contemporary American English would not work on a small-screen device like a smart phone, the roughly 1/2-million sample of tweets, blog posts and news articles in the Helio Host English dataset might well facilitate training a predictive model accurate enough to satisfy most users. So, this exploratory analysis describes the dimensions of that more manageable coprora of everyday (natural) language, transforms it into short, consecutive word strings (N-grams), non-consecutive word strings (Skip-grams), and full sentences (Sentence-grams). It demonstrates that, pruned down to the top 50% or 90% of words–or of one “gram” or another, this corpora requires little purging of either foreign or profane words. And, since such a modest corpora is, by definition, incomplete, the analysis
augmenting it with several hundred common “missing”" words.

A Balance of Tweets, Blogs and News

140-word tweets require more inventive abbreviation, and free-form blog posts allow more creative word sequences than news stories edited for spelling and grammar. So, the Helio Host collection of all three types of everyday written English is a promising starting point. And, as Table 1 shows, it is well balanced by number (“lines”), “characters” and “words”“, even as the shorter tweets use a few more characters in a few more words than the blogs or news articles.

Table 1. A Balance of Tweets, Blogs and News

Corpus Lines Chars Words
combined 2360148 152773402 29344757
twitter 899288 199211946 36816390
blogs 1010242 193824076 33464870
news 1423225 182031137 33224696

This reports analyses a 1/3 random “combined” sample from those full “twitter,”" “blogs” and “news” datasets. It ended up with a few more texts using slightly fewer characters and words than any one of the three originals.

The most frequent words in English are a few function, or “stop”, words like the is at, and which, and that’s true for all three original datasets, as well as the “combined” sample. So, to get a feel for the slight differences in the vocabularies of each type English, Fig. 1 looks just below the topmost layer of most frequent words used. It lists words equally spaced across each collection’s top decile, or 10%. (Specificlly, they are the eight words sitting on the .01, .02, .03, .04, .05, .06, .07, .08 and .09 percentile break points of each dataset’s word frequency distribution.)

Those representative examples of most-frequent words suggest that news articles tend to repeat words more (the abbreviation “mph” a 1,000 times), but blogs use more long words (malnourished, blemishes), and the long words in tweets are simpler (secondary, gentleness). While the vocabulary differences may be subtle, they support the expectation that a combined sample might cover a broader range of frequent words (mph less fequently, some simple mounted and some sophisticated filibuster).

3. Whittling down the combined sample

An overwhelming majority of the nearly 450,000 unique words in the combined sample appear very infrequently, most only once. So, to handle the much larger datasets of N-grams (below) and keep the model computationally efficient (for a smartphone web app), the sample musted be shrunken further. As Fig. 2 shows, although 138 words suffice cover 50% of the current combined sample, they are much too vague (say, being) for predicting word sequences.

Fig. 2: Words from Bottom 10% that cover 50% & 90% of all

Instead, It might make more sense to use the 45,954 words that cover fully 90% (tedium, the possessive proper name kanes), which would pack greater predictive punch while still pruning down the unique words list significantly from about 450,000. (The representative words in Fig. 2 lie on the .91,.92,.93, .94,.95,.96,.97,.98, and .99 breakpoints in the top 10% of the frequency distribution.)

Note: Only one of 140 unique “Quiz 1”" words, pointbreak, is missing from large combined sample of about 450,000 unique words, and only three more are missing from the whittled-down 90% coverage list of almost 46,000 words (“bluest”, “smelliest” and “ohhhhh”).

Fig. 3: N-grams across Top 5% Most Frequent by Type

One reason to trim down the sample used to build the model is that the larger, more complex word sequences, or “grams”–which pack more of a predictive punch, as well– demand significantly more computational power. A glance at typical N-, Skip- and Sentence-grams from the current sample suggests how much more information the longer sequences contain (Compare the short sentences “oh i know” and big time rush to 2-grams like across from and is somewhat, or 4-word Skip-grams like lets sox go lets and shoot moon you land to a 4-gram like i appreciate all of and the 3-gram one i just.)

If feasible after whittling down the combined sample, the model algorithm should include the few, short Sentence-Grams that are repeated, supplementing them with whichever combination of N- and Skip-grams experimentation discovers is most accurate. (The representative “grams” above sit on the .001,.002,.003,.004,.005 breakpoints in the top 5% for each type.)

Very few foreign or profane words

Ironically, perhaps, the 11,457 top 10% most-frequent 3-grams contain only one foreign-sounding word phrase, “cinco de mayo,” that’s the well-known Spanish name of a holiday invented not in Latin America or Spain, but in the English-speaking US. That’s not counting the even more well-known British-English spellings “centre” and “favourite”. (The custom cleaning function written to clean out everything but non-alphanumeric character “class” removed not only words in non-western scripts like Chinese, Greek and Cyrillic, but also words with accented characters in western languages like German, French and Portuguese.)

And, only two of the most common profane words remain to be cleaned up later.

Remaining Issues: “missing words”, apostrophes & dummy markers

Only two words on the freely-available COCA Top-5000 List are missing from this medium-sized, whittled-down sub-sample covering 90% of the initial combined sample (constraint and mmhmm). So, some larger open-source word list still must be found.

One piece of punctuation cleaned away that should be restored is the lowly apostrophe since the contractions it forms are so common in grams" of all types. There are 26 different contractions, from im to companys, scattered throughout those 11,457 top 10% most frequent 3-grams, for instance. So, having to predict one is only to be expected.

Finally, the common practice of restoring not every single cleaned out punctuation mark, but one global, dummy marker to represent all of them (except the apostrophe, perhaps, probably makes sense, too.

A State-of-the-Art Predictive Model

Since any word sequence is not static but dynamic and temporal, a Hidden Markov Chain should be able to capture the relationship between typed and next words, and certainly much better than even sophisticted but static classification models like Random Forests or Support Vector Machines. So, the recommended benchmark model should be Hidden Markov with a sliding window to capture short sequences of words, Turing smoothing to impute frequency of “missing” words, and a “back off” process to accomodate more than one N-gram.

Then, time and computational resources permitting, that benchmark Hidden Markov model might usefully be tested against a single-layer Recurrent Neural Network since its unsupervised neuron feedback loops also can capture the temporal aspect of word sequences. Unless its accuracy suffers from the whittling down of the everyday English sample, then there is good reason to suspect it might outperform Hidden Markov.

A Prototype Web App

To demonstrate that the chosen model is accurate enough for most users even while running efficiently on a modest corpora, a prototype web app, optimized for small screens like smartphones and tablets with R’s Shiny package that interfaces with JavaScript, HTML & CSS. Hosting it on the free shinyapps.io provider service will impose a strict 1G overall data and computation hard limit.