Executive Summary

The capstone project of the JHU Data Science specialization involves the creation of a text prediction algorithm based on a set of English language texts taken from Twitter, blogs and news sources. In this milestone report, those three source files are accessed for the first time, basic features such as line and word counts are compared, and for one source file the most common words and two-word combinations (2-grams) are described. Future plans for the final Shiny app are outlined.

Exploratory Analysis

The basis for the text prediction app is the English language part of the HC Corpora, which can be accessed at the following address: http://corpora.epizy.com/index.html.

In the first step, the three txt-files are read into R, and the number of lines (i.e. entries), words, and the maximum number of words per line are summarized in the following table.

##     Source   Lines    Words MaxWords
## 1: Twitter 2360148 30373792       47
## 2:   Blogs  899288 37334441     6630
## 3:    News 1010243 34372625     1792

To get an idea of the distribution of the words/line for all three sources, one can look at the following histograms. Note that bins with high word counts were cut off for both the Blogs and the News sources.

In the final version of the app, it will be important to tabulate many different possible word sequences (called n-grams, where the number n refers to how many words there are in a sequence), so that we can predict the next word by looking at, say, the previous 2, 3 or 4 words. In an exemplary case for the sake of exploratory analysis, here are the ten most common words or two-word-sequences (2-grams) from just the Blogs source.

Further Plans

For the final version of the app, it will be important to have a system that runs fast, in order to correspond to real-time user input, and which also does not use too much memory. As such, there will have to be limits as to how many possible word combinations can be accounted for. The plan is to calculate at least all possible 4-grams, then remove all that only appear one to three times, and save the rest in a data table that can be accessed quickly by the interface. There also needs to be a kind of scoring system, probably based on a simple backoff approach, that will for example weigh very common 3-grams against extremely rare 4-grams, in order to make the most likely prediction. Other things that will be included are profanity filtering, plus a sort of preliminary benchmark test in order to assess the quality of the algorithm.