This milestone report is for the Coursera Data Science Capstone project. The purpose of the overall project is to construct an application that predicts the next word in a user defined sentence. We are provided with a text corpus that we are to perform exploratory analysis upon and use in model building for our word prediction algorithm.
The text contains special characters, unneeded spaces, and profanity that must first be removed. The text comes from a document of Tweets, news articles, and blog posts. R’s base package functions and regular expressions are used in this analysis to clean the data and make it ready to split into n-gram models. N-grams are parsed out groups of words in sentences with n items from the sentence. For example the sentence “How are you today?” could be split into the bigrams “How are” or “are you”. These n-grams will be used to predict the next word in the sentence based upon the user’s input.
Note: Due to system memory limits only 10,000 lines from each file will be read into this analysis.
The data is available at the below link. The files used in the analysis are entitled en_US.news.txt, en_US.blogs.txt, and en_US.twitter.txt.
## File Size Lines
## 1 News 196.277512550354 1010242
## 2 Blogs 200.424207687378 899288
## 3 Twitter 159.364068984985 2360148
As seen from the below plots and tables the most common unigrams are stop words and for the most part the bi, tri, and quartgrams contain these same English stop words. Stop words are words such as “the”, “that”, and “it”. Rather than trying to predict these words in the model they may need to be removed in order to make the model more accurate. Additionally the current profanity filter is set to replace the profane words to “expletive” using the base R function gsub().
length(n1)
## [1] 654818
length(n1[n1 %in% "expletive"])
## [1] 2108
## n1 Freq
## 1 the 27132
## 2 and 18029
## 3 that 8948
## 4 for 8331
## 5 with 6062
## 6 you 5906
## 7 was 5284
## 8 this 4556
## 9 but 4339
## 10 have 4134
## n2 Freq
## 1 for the 1660
## 2 and the 1310
## 3 with the 971
## 4 from the 870
## 5 that the 727
## 6 the first 513
## 7 all the 496
## 8 you can 469
## 9 have been 449
## 10 has been 446
## n3 Freq
## 1 thanks for the 105
## 2 the first time 100
## 3 the fact that 90
## 4 for the first 79
## 5 the end the 76
## 6 the united states 68
## 7 thank you for 59
## 8 the rest the 59
## 9 one the most 47
## 10 the same time 47
## n4 Freq
## 1 for the first time 60
## 2 thank you for the 17
## 3 the first time since 17
## 4 the new york times 16
## 5 the expletive ociated press 15
## 6 for the most part 13
## 7 thanks for the follow 13
## 8 you can see the 13
## 9 all over the world 12
## 10 the fact that the 11
My current working model takes the sampled corpus as a character vector and finds all of the matches of the previous 1, 2, 3, or 4 words. It then searches the vector for all possible matches and returns the next possible word based upon frequency in the vector. This approach is extraordinarily inefficient and does not produce the most accurate results. Additionally cleaning takes longer than desired with the regex I am using. I need to explore the TM package and others in order to find a more efficient way to clean the text and remove profanity words.
My idea(s) for the new model are as follows: