Data Science Capstone: Milestone Report

Synopsis

This Milestone Report details the current progress of our text prediction algorithm. The report describes our exploratory analysis of the data and our plans for our eventual app and algorithm.

The data

Our prediction algorithm is based on data from three text corpora: one of tweets, one of blog posts, and one of news stories, all in U.S. English. We loaded these corpora via the read_lines function from the readr package and then combined them all into a single corpus. These corpora have the following general properties:

Corpus	NumLines	NumChars	AverageLine	LongestLine
Blogs	899288	206824505	230	40833
News	1010242	203223159	201	11384
Tweets	2360148	162096022	69	140
All	4269678	572143686	134	40833

The full corpus is 831.5 Mb. Due to limitations in computer memory and processing power, our predicated algorithm will be based on a random sample of 30% of this data.

Cleaning the data

Our training data is cleaned in a simple, three-step process:

Transform/remove non-ASCII characters (using the stringi package)
Reshape from lines to sentences (using the corpus_reshape function from the quanteda package)
Remove punctuation, symbols, numbers, & URLs (using the tokens function from the quanteda package)

This cleaning is done in tandem with other data processing steps, which optimizes our overall processing time.

Note that we have chosen not to filter out profanity from the data, because (a) it is difficult to do this without also filtering out non-profane words (such as “analysis” and “pass”); (b) this would have significantly increased our processing time; and (c) profanity is a natural part of any human language.

Exploring the data

Our plan for our eventual app and algorithm is to make our predictions using a n-gram model. Using the quanteda package, we have built 2- to 6-grams from our training data and compiled feature frequency tables for each of these n-grams. Below we present barplots of the ten most frequent n-grams in our training data.

The most frequent 2-grams are not particular interesting, consisting mostly of common prepositional phrases consisting of a preposition and a definite or indefinite article.

The most frequent 3-grams are slightly more interesting, including some common phrases like “thanks for the” and “I want to”. But still, mostly what we’d expect.

The most frequent 4-grams are a little less obvious. Amusingly, “thanks for the follow” ranks fourth, presumably a result of a third of our data consisting of tweets.

The most frequent 5-grams continue the pattern seen in the 4-gram, consisting of common 5-word connecting phrases.

The most frequent 6-grams are more interesting, consisting of several full phrases. Amusingly, “could not be reached for comment” ranks fourth, presumably a result of a third of our data consisting of news stories.

Looking ahead

The next step in designing our algorithm and deploying our app is to take our full 2- to 6-gram frequency tables and use these to generate next word predictions. For example, if we wanted to predict the next word after the phrase ‘at the end of the’, we could check our table of 6-grams and subset just those rows that start with ‘at the end of the’ and compare their respective frequencies, as follows:

attheendofthe <- sixgrams[feature %like% "^at_the_end_of_the_"]

Of course, the actual math that will be used to generate our prediction will be much more sophisticated than this simple example. But that’s the essence of our approach. Onward and upward!