Synopsis

This milestone report is for the Coursera Data Science Capstone project. The purpose of the overall project is to construct an application that predicts the next word in a user defined sentence. We are provided with a text corpus that we are to perform exploratory analysis upon and use in model building for our word prediction algorithm.

The text contains special characters, unneeded spaces, and profanity that must first be removed. The text comes from a document of Tweets, news articles, and blog posts. R’s base package functions and regular expressions are used in this analysis to clean the data and make it ready to split into n-gram models. N-grams are parsed out groups of words in sentences with n items from the sentence. For example the sentence “How are you today?” could be split into the bigrams “How are” or “are you”. These n-grams will be used to predict the next word in the sentence based upon the user’s input.

Note: Due to system memory limits only 10,000 lines from each file will be read into this analysis.

The data is available at the below link. The files used in the analysis are entitled en_US.news.txt, en_US.blogs.txt, and en_US.twitter.txt.

Capstone Data

File Statistics

##      File             Size   Lines
## 1    News 196.277512550354 1010242
## 2   Blogs 200.424207687378  899288
## 3 Twitter 159.364068984985 2360148

Exploratory Analysis

As seen from the below plots and tables the most common unigrams are stop words and for the most part the bi, tri, and quartgrams contain these same English stop words. Stop words are words such as “the”, “that”, and “it”. Rather than trying to predict these words in the model they may need to be removed in order to make the model more accurate. Additionally the current profanity filter is set to replace the profane words to “expletive” using the base R function gsub().

Number of Words in the Sample Data

length(n1)
## [1] 654818

Number of Profane Words Removed from the text

length(n1[n1 %in% "expletive"])
## [1] 2108

Unigram Table

##      n1  Freq
## 1   the 27132
## 2   and 18029
## 3  that  8948
## 4   for  8331
## 5  with  6062
## 6   you  5906
## 7   was  5284
## 8  this  4556
## 9   but  4339
## 10 have  4134

Unigram Plot

##           n2 Freq
## 1    for the 1660
## 2    and the 1310
## 3   with the  971
## 4   from the  870
## 5   that the  727
## 6  the first  513
## 7    all the  496
## 8    you can  469
## 9  have been  449
## 10  has been  446

Bigram Analysis

Trigram Table

##                   n3 Freq
## 1     thanks for the  105
## 2     the first time  100
## 3      the fact that   90
## 4      for the first   79
## 5        the end the   76
## 6  the united states   68
## 7      thank you for   59
## 8       the rest the   59
## 9       one the most   47
## 10     the same time   47

Trigram Analysis

Quartgram Table

##                             n4 Freq
## 1           for the first time   60
## 2            thank you for the   17
## 3         the first time since   17
## 4           the new york times   16
## 5  the expletive ociated press   15
## 6            for the most part   13
## 7        thanks for the follow   13
## 8              you can see the   13
## 9           all over the world   12
## 10           the fact that the   11

Quartgram Analysis

Modeling & Next Steps

My current working model takes the sampled corpus as a character vector and finds all of the matches of the previous 1, 2, 3, or 4 words. It then searches the vector for all possible matches and returns the next possible word based upon frequency in the vector. This approach is extraordinarily inefficient and does not produce the most accurate results. Additionally cleaning takes longer than desired with the regex I am using. I need to explore the TM package and others in order to find a more efficient way to clean the text and remove profanity words.

My idea(s) for the new model are as follows: