Introduction

This is the milestone report for the Capstone project of the Data Science Specialization on Coursera taught by Jeff Leek, Roger D. Peng, and Brian Caffo. This Capstone project is to use data science to build a predictive text models like those used by SwiftKey.

For the milestone report, I would like to show:

The dataset

The data are provided on the Coursera website.

I will be working only on the en_US dataset, which includes three files: en_US.blogs.txt, en_US.news.txt, and en_US.twitter.txt.

Basic summary of the data

The output below shows the number of lines and the number of words for each file:

##    Lines    Words Filename
##   899288 37334690 en_US.blogs.txt
##  1010242 34372720 en_US.news.txt
##  2360148 30374206 en_US.twitter.txt

Other features of the data

Single word frequency

Due to the size of the data, for all the downstream analysis, I used a random sample of each data file, including 1/40 number of lines in each file.

I cleaned up the data by removing numbers, removing punctuations, stripping of extra white spaces and convert all text to lower case.

Next, I counts the number of occurance of each word in each data file and calculate the probability of finding a word in a data file (count of a word/count of total words). The graph below shows the probability of the top words found in each data file.

plot of chunk unnamed-chunk-8

We can see that the word “the” is found most frequently in all three file; the second place is “and” in blogs and news, “you” in tweets.

This nicely illustrates the concept of “stopwords”, which are words that are so common in a language that their information value is almost zero. The R package tm includes a list of 174 stopwords in the English language.

To gain more information into the content of the files, I decided to remove the stopwords and look at the top word found again as shown in the plot below.

plot of chunk unnamed-chunk-12

Now we can see “said” is the most frequenlty found in news, which can be explained by the grammatical style of writing reported speech in news articles. “Just” is found most frequently in tweets, followed by “like”. “lol” and “thanks” are found much more frequently in tweets than in blogs or news.

n-gram frequency

Next, I look at the frequency of group of words (n-gram). I do this using the combination of all three files. In the graph below, I show the top 10 most frequent groups of 2 words (bigrams), 3 words (trigrams) and 4 words (4-grams).

plot of chunk unnamed-chunk-14

Future plan

To build a predictive text application, I will need to do the following steps: