This is the milestone report for the Capstone project of the Data Science Specialization on Coursera taught by Jeff Leek, Roger D. Peng, and Brian Caffo. This Capstone project is to use data science to build a predictive text models like those used by SwiftKey.
For the milestone report, I would like to show:
The data are provided on the Coursera website.
I will be working only on the en_US dataset, which includes three files: en_US.blogs.txt, en_US.news.txt, and en_US.twitter.txt.
The output below shows the number of lines and the number of words for each file:
## Lines Words Filename
## 899288 37334690 en_US.blogs.txt
## 1010242 34372720 en_US.news.txt
## 2360148 30374206 en_US.twitter.txt
Due to the size of the data, for all the downstream analysis, I used a random sample of each data file, including 1/40 number of lines in each file.
I cleaned up the data by removing numbers, removing punctuations, stripping of extra white spaces and convert all text to lower case.
Next, I counts the number of occurance of each word in each data file and calculate the probability of finding a word in a data file (count of a word/count of total words). The graph below shows the probability of the top words found in each data file.
We can see that the word “the” is found most frequently in all three file; the second place is “and” in blogs and news, “you” in tweets.
This nicely illustrates the concept of “stopwords”, which are words that are so common in a language that their information value is almost zero. The R package tm includes a list of 174 stopwords in the English language.
To gain more information into the content of the files, I decided to remove the stopwords and look at the top word found again as shown in the plot below.
Now we can see “said” is the most frequenlty found in news, which can be explained by the grammatical style of writing reported speech in news articles. “Just” is found most frequently in tweets, followed by “like”. “lol” and “thanks” are found much more frequently in tweets than in blogs or news.
Next, I look at the frequency of group of words (n-gram). I do this using the combination of all three files. In the graph below, I show the top 10 most frequent groups of 2 words (bigrams), 3 words (trigrams) and 4 words (4-grams).
To build a predictive text application, I will need to do the following steps: