Introduction

In the Data Science Capstone we have to build a Shiny applet that predicts the next word, based on the previous words introduced by the user. This report describes, in broad terms, the acquisition, cleaning, tokenization and exploratory analysis of the base data. It also describes the strategies that will be employed to develop the predictive model empowering the final data product.

Data Acquisition

The capstone data was downloaded from https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip. A US english list of words was downloaded from http://www.math.sjsu.edu/~foster/dictionary.txt.

From each of the three relevant data documents (en_US.twitter.txt, en_US.blogs.txt, en_US.news.txt), the 10000 first lines of text were red and extracted to smaller, more manageable, text files (twiter.txt, blogs.txt, news.txt). This files were the basis for all subsequent work, being loaded to a corpus named docs.

Data Cleaning

The corpus were cleaned using the “tm” package functions; numbers, punctuation, extra whitespace and special characters were removed; uppercase letters were converted to lowercase. A document term matrix (dtm) was created using as control filter the US english words list above referred. Sparse terms of the document term matrix were removed, resulting in an 0% sparsity dtm.

Tokenization

Words, bigrams, trigrams and tetragrams were tokenized. For words, the dtm matrix above described did the job. For bigrams, trigrams and tetragrams, special tokenization functions were written and applied in the building of the respective document term matrices. This all process resulted in four, non-sparsed, dtms: dtm (words), dtmBi (bigrams), dtmTri(trigrams) and dtmTet (tetragrams).

For each of the four groups of tokens, terms were sorted by increasing order of frequency and converted into data frames: wds (words), bif (bigrams), trif (trigrams) and tetf (tetragrams)

Exploratory Analysis

Each of the group of tokens were analysed in respect to their respective frequency of terms:

Words:

wdFreqDistri

We can see that are numerous terms that appear a small number of times and that above 400 there are lots of outliers. Words that appear at least 306 times constitute the 90% quantile, i.e. the upper 10% of the most frequent words, which add to 2703 terms in a total of 26987 words.

wCloud

We can see, looking at this word cloud that “the”, “and”, “you”, “was”, “that”, “for”, “with” are more frequent words than all other.

Bigrams:

biFreqDistri

We can see that are numerous terms that appear a small number of times and that above 100 there are lots of outliers. Bigrams that appear at least 71 times constitute the 90% quantile, i.e. the upper 10% of the most frequent bigrams, which add to 11230 terms in a total of 111132 bigrams.

bigramCloud

We can see, looking at this bigram cloud, that bigrams that end in “the”, like “in the” are much more frequent bigrams than others.

Trigrams:

triFreqDistri

We can see that are numerous terms that appear a small number of times and that above 40 there are lots of outliers. Trigrams that appear at least 36 times constitute the 90% quantile, i.e. the upper 10% of the most frequent trigrams, which add to 7387 terms in a total of 72003 trigrams.

trigramCloud

We can see, looking at this trigram cloud, that trigrams like “one of the”, “a lot of”, “as well as”, “going to be”, “out of the”, “to be a”, “the end of”, “some of the” are the more frequent trigrams.

Tetragrams:

tetFreqDistri

We can see that are numerous terms that appear a small number of times and that above 50 there are lots of outliers. Tetragrams that appear at least 22 times constitute the 90% quantile, i.e. the upper 10% of the most frequent tetragrams, which add to 1831 terms in a total of 17944 tetragrams.

tetragramCloud

We can see, looking at this tetragram cloud, that tetragrams like “the end of the”, “at the end of”, “the rest of the”, “for the first time”,“is one of the”, “one of the most”, “at the same time” are the more frequent tetragrams.

Strategies for Predictive Model

My predictive model will be structured around four functions:

The first one will be named stripTease, acquires the user input and will eliminate punctuation, special characters, extraspaces and convert words to lower case.

The second function will have as argument the string generated by stripTease, will be named stringToWords and his duty will be to capture the last three, two or one words of the string and convert them to, respectively, a trigram, a bigram or a single word.

The third function will be named lookUp and accepts the output of stringToWords. If the input is a trigram, it will find the three most frequent tetragrams that begin with the inputed trigram. If the input is a bigram, it will find the three most frequent trigrams that begin with the inputed bigram. If the input is a single word, it will find the three most frequent biggrams that begin with the inputed word.

The fourth function, inOut, will wrap up all the three previous functions, trying to deliver to the user the three most probable words following the inputed sentence. If it can’t find a suitable tetragram, will try the trigrams; if no suitable trigram is found, will try the bigrams; if no suitable bigram is avaiable, will deliver the three most frequent words. It will allways give a guess to the client!