This report contains a summary of the exploratory analysis conducted on the data for the Coursera Data Science Capstone. The final task is to develop a predictive model that predicts the next word based on the previous words typed. The three datasets include a Twitter file, a News File, and a Blog file. These files will be combined and sampled from create the eventual model
The following libraries were loaded for the exploratory analysis.
library(rJava)
library(ggplot2)
library(stringr)
library(tm)
library(R.utils)
library(RTextTools)
library(wordcloud)
library(RWeka)
library(NLP)
library(dplyr)
One of the largest challenges in working with these datasets is their size and the amount of memory they require. The news and blog files are over 200 megabytes, and the twitter file is 163 megabytes. When these three files are loaded in their entirety, the take a significant time to analyze on this PC with Intel i3 processor with 4 GB of RAM. The strategy taken is to load a subset of each file and then sample from each to form a data set for exploratory analysis.
#read in samples from the data files. Then combine into a Corpus.
blog <- readLines("~/Data_Science_Specialization/CapstoneProject/final/en_US/en_US.blogs.txt", 100000)
news <- readLines("~/Data_Science_Specialization/CapstoneProject/final/en_US/en_US.news.txt", 100000)
## Warning in readLines("~/Data_Science_Specialization/CapstoneProject/
## final/en_US/en_US.news.txt", : incomplete final line found on '~/
## Data_Science_Specialization/CapstoneProject/final/en_US/en_US.news.txt'
twit <- readLines("~/Data_Science_Specialization/CapstoneProject/final/en_US/en_US.twitter.txt", 100000)
sampleSize <- 1000
set.seed(1234)
combined <- sample(paste(blog, news, twit), size = sampleSize, replace = TRUE)
rm(blog, news, twit)
combined <- VectorSource(combined)
combined <- Corpus(combined)
Once we have a Corpus using samples from the three we need to prepare it for effective analysis. This involves standardizing capitalization, removing punctuation, whitespace. All this can be accomplished using the tm_map command in R from the tm package. To remove vulgar words from the files, a vector of naughty words is developed and removeWords function in tm_map to remove the vulgar words.
naughtyWords <- c("vectorOfNaughtyWordsThatGoHere")
combined <- tm_map(combined, content_transformer(tolower))
combined <- tm_map(combined, removePunctuation)
combined <- tm_map(combined, removeNumbers)
combined <- tm_map(combined, removeWords, naughtyWords)
combined <- tm_map(combined, stripWhitespace)
This section details the exploratory analysis conducted on the sample data. The key tasks in the analysis included determining which words occur most frequently in the data to include 1-Grams, 2-Grams, and 3-Grams. Additionally, we would like to determine a subset of words that we can use to cover 50% and 90% of the dictionary. This may help later during model building as we will be able to save memory and computing time which is key for smaller systems such as mobile devices with limited processing power.
In this section, we would like to know which words or word-groupings (n-grams) occur most frequently. To find the 1,2, and 3-gram pairs, the NGramTokenizer function from the RWeka package is used. This allows us to build a term document matrix of n-gram combinations that we can plot to see the most common combinations of words.
oneGramTokens <- function(x) NGramTokenizer(x, Weka_control(min = 1, max = 1))
twoGramTokens <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
threeGramTokens <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))
oneGramMatrix <- TermDocumentMatrix(combined, control = list(tokenize = oneGramTokens))
twoGramMatrix <- TermDocumentMatrix(combined, control = list(tokenize = twoGramTokens))
threeGramMatrix <- TermDocumentMatrix(combined, control = list(tokenize = threeGramTokens))
##One-grams
commonTerms <- findFreqTerms(oneGramMatrix, lowfreq = 200)
commonTerms.sum <- rowSums(as.matrix(oneGramMatrix[commonTerms,]))
commonTerms.df <- data.frame(onegram=names(commonTerms.sum), occurs=commonTerms.sum)
commonTerms.df <- arrange(commonTerms.df, occurs)
old.par <- par(no.readonly=T)
par(las= 1, mar=c(4.1,7.5,4.1,2.1))
barplot(height = commonTerms.df$occurs, names.arg = commonTerms.df$onegram,
horiz = TRUE, col = "light blue", axisnames = TRUE)
title(main = paste("Distribution of One-Grams in the Sample Data \n With ",
sampleSize, " Lines of Sample Data", sep = ""))
##two-grams
commonTerms <- findFreqTerms(twoGramMatrix, lowfreq = 60)
commonTerms.sum <- rowSums(as.matrix(twoGramMatrix[commonTerms,]))
commonTerms.df <- data.frame(twogram=names(commonTerms.sum), occurs=commonTerms.sum)
commonTerms.df <- arrange(commonTerms.df, occurs)
old.par <- par(no.readonly=T)
par(las= 1, mar=c(4.1,7.5,4.1,2.1))
barplot(height = commonTerms.df$occurs, names.arg = commonTerms.df$twogram,
horiz = TRUE, col = "light blue", axisnames = TRUE)
title(main = paste("Distribution of Two-Grams in the Sample Data \n With ",
sampleSize, " Lines of Sample Data", sep = ""))
##Three-grams
commonTerms <- findFreqTerms(threeGramMatrix, lowfreq = 10)
commonTerms.sum <- rowSums(as.matrix(threeGramMatrix[commonTerms,]))
commonTerms.df <- data.frame(threegram=names(commonTerms.sum), occurs=commonTerms.sum)
commonTerms.df <- arrange(commonTerms.df, occurs)
par(las= 1, mar=c(4.1,7.5,4.1,2.1))
barplot(height = commonTerms.df$occurs, names.arg = commonTerms.df$threegram,
horiz = TRUE, col = "light blue", axisnames = TRUE)
title(main = paste("Distribution of Three-Grams in the Sample Data \n With ",
sampleSize, " Lines of Sample Data", sep = ""))
In this section, we would like to determine how many words we would need to include in order to cover 50 and 90 percent of the total words in the sample data respectively. First we build a sorted data frame of the all the words in the sample and their frequency of use. Then the frequency of occurence is summed in a loop until we are at 50 or 90 percent of the total words. To cover 50 percent, we would need to include approximately just over 300 unique words. For 90 percent coverage, we would need to include over 7000 words. This is important, as it should save computing time and memory if we only include the 90-95 percent of the words in our final model.
## Find the words that cover 50% and 90% of the data
commonTerms <- findFreqTerms(oneGramMatrix, lowfreq = 1)
commonTerms.sum <- rowSums(as.matrix(oneGramMatrix[commonTerms,]))
commonTerms.df <- data.frame(onegram=names(commonTerms.sum), occurs=commonTerms.sum)
# now we have a data frame of all the words with occurence in decending order
commonTerms.df <- arrange(commonTerms.df, desc(occurs))
total <- sum(commonTerms.df$occurs)
#fiftyPercent
wordSum <- 0
for(i in 1:length(commonTerms.df$occurs)) {
wordSum <- wordSum + commonTerms.df$occurs[i]
if(wordSum >= 0.5* total) {break}
}
print(paste(i, " unique words cover approximately 50% of the sample", sep=""))
## [1] "347 unique words cover approximately 50% of the sample"
#ninetyPercent
wordSum <- 0
for(i in 1:length(commonTerms.df$occurs)) {
wordSum <- wordSum + commonTerms.df$occurs[i]
if(wordSum >= 0.9* total) {break}
}
print(paste(i, " unique words cover approximately 90% of the sample", sep=""))
## [1] "7294 unique words cover approximately 90% of the sample"
For the final model, I am considering possible a random forest type of prediction algorithm that uses the top 90-95 percent of words to predict the next word based on the previous one, two, or three words chosen. Only including the top words used should save memory and time. One drawback to this approach is that it may be computationally intensive. A simple naive bayes model may also be a choice as it shouldn’t be as computationally intensive. To estimate the probability of unobserved n-grams, I am considering a backoff model.