This is the week 2 Milestone Report for the Data Science specialization. The purpose of this report is to conduct exploratory data analysis of the training data set. There are three data sets collected from internet blogs, news sites and twitter feeds. The work performed at this stage will support the final project, which is to use natural language processing techniques to build a text prediction model. The model will then be incorporated in a Shiny web application.
The three training data sets from SwiftKey have previously been downloaded from the Coursera website to the working directory. The three text files will be loaded into R and basic statistics calculated for each file including file size, number of lines, and number of words.
# the data files en_US.blogs.txt, en_US.news.txt, and en_US.twitter.txt must be in the working directory
blogs <- readLines('en_US.blogs.txt', skipNul = TRUE)
news <- readLines('en_US.news.txt', skipNul = TRUE)
twitter <- readLines('en_US.twitter.txt', skipNul = TRUE)
# calculate file size in MB, number of lines, and number of words:
# (assumes 1 MB = 1024 KB and 1 KB = 1024 bytes)
US.blogs <- c(round(file.size('en_US.blogs.txt') / 1024^2, 2),
length(blogs),
sum(stri_count_words(blogs)))
US.news <- c(round(file.size('en_US.news.txt') / 1024^2, 2),
length(news),
sum(stri_count_words(news)))
US.twitter <- c(round(file.size('en_US.twitter.txt') / 1024^2, 2),
length(twitter),
sum(stri_count_words(twitter)))
info <- data.frame(rbind(US.blogs, US.news, US.twitter))
colnames(info) <- c('File_Size', 'Lines', 'Words')
row.names(info) <- c('blogs', 'news', 'twitter')
# quick look at the data
head(blogs,3); head(news,3); head(twitter,3)
## [1] "In the years thereafter, most of the Oil fields and platforms were named after pagan “gods”."
## [2] "We love you Mr. Brown."
## [3] "Chad has been awesome with the kids and holding down the fort while I work later than usual! The kids have been busy together playing Skylander on the XBox together, after Kyan cashed in his $$$ from his piggy bank. He wanted that game so bad and used his gift card from his birthday he has been saving and the money to get it (he never taps into that thing either, that is how we know he wanted it so bad). We made him count all of his money to make sure that he had enough! It was very cute to watch his reaction when he realized he did! He also does a very good job of letting Lola feel like she is playing too, by letting her switch out the characters! She loves it almost as much as him."
## [1] "He wasn't home alone, apparently."
## [2] "The St. Louis plant had to close. It would die of old age. Workers had been making cars there since the onset of mass automotive production in the 1920s."
## [3] "WSU's plans quickly became a hot topic on local online sites. Though most people applauded plans for the new biomedical center, many deplored the potential loss of the building."
## [1] "How are you? Btw thanks for the RT. You gonna be in DC anytime soon? Love to see you. Been way, way too long."
## [2] "When you meet someone special... you'll know. Your heart will beat more rapidly and you'll smile for no reason."
## [3] "they've decided its more fun if I don't."
info
## File_Size Lines Words
## blogs 200.42 899288 37546246
## news 196.28 1010242 34762395
## twitter 159.36 2360148 30093410
The above chart lists the file size (MB), number of lines, and number of words in each of the three text files–en_US.blogs.txt, en_US.news.txt, and en_US.twitter.txt.
Given the large text data set, it is reasonable to use a sample of the data to conduct the text mining analysis. A test sample of one percent of the data will be created by using the rbinom function to randomly select 1 out of every 100 lines for further use.
# set seed for reproducibility
set.seed(1999)
# create function to randomly sample lines of text from the files using the rbinom function
sampleText <- function(fileName) {
p <- 1/100
textFile <- character()
for (i in 1:length(fileName)) {
if (rbinom(1,1,p)) {
textFile <- c(textFile, fileName[i])
}
}
textFile
}
# create new text file 'dat' containing approximately 1 out of every 100 lines (1%) from the blogs, news, and twitter text files, and combine to one file:
dat <- character()
dat <- sampleText(blogs)
dat <- c(dat, sampleText(news))
dat <- c(dat, sampleText(twitter))
# save reduced text file
write.table(dat, 'en_US.dat.txt', row.names = FALSE, col.names = FALSE)
# remove the 3 large text files to save memory space
rm(list = c('blogs','news', 'twitter'))
# see stats on new 'dat' text file
str(dat)
## chr [1:42509] "Nicholas Ridley agreed with the Act of Supremacy, supporting Henry’s action. He became the king’s chaplain and later, in the ti"| __truncated__ ...
US.dat <- c(round(file.size('en_US.dat.txt') / 1024^2, 2),
length(dat),
sum(stri_count_words(dat)))
info <- data.frame(rbind(info, US.dat))
row.names(info) <- c('blogs', 'news', 'twitter', 'dat')
info
## File_Size Lines Words
## blogs 200.42 899288 37546246
## news 196.28 1010242 34762395
## twitter 159.36 2360148 30093410
## dat 5.64 42509 1030057
# I initially ran this with keeping 1/20 lines (5%), and ended up with a dat file with over five million words, so I am now keeping just 1/100 (1%) lines, resulting in a dat file of approximatley one million words.
The updated chart shows that we now have a sample data file dat that is 5.6 MB in size and about one million words. Here is a quick look at four lines from the data file pre-processing:
# clear workspace and (re)load the dat file if necessary
# rm(list=ls())
# dat <- readLines('en_US.dat.txt', skipNul = TRUE)
dat[1]
## [1] "Nicholas Ridley agreed with the Act of Supremacy, supporting Henry’s action. He became the king’s chaplain and later, in the time of Edward VI, he was Bishop of Rochester. He helped to write the Book of Common Prayer. He became Bishop of London and worked to improve the conditions of the poor."
dat[9111]
## [1] "Plus, he's seeing Katherine (Anna Kendrick), a hospital-suggested therapist with almost no experience -- Adam is her third-ever patient. She's fumbling her way toward her doctorate, but beneath her insecurities and Adam's fear it's clear there is real feeling."
dat[23456]
## [1] "Unless commitment is made, there are only promises and hopes"
dat[35000]
## [1] "LET'S GO"
I will use the tm package commands to preprocess the data. The following commands will create a corpus file and remove punctuation, numbers, stopwords, profanity, and extra white space, covert all letters to lowercase, and then create a plain text document.
#from the `tm` package, use the tm_map function to process text file
dat <- VCorpus(VectorSource(dat)) # creates a `tm` Source object
dat <- tm_map(dat, removePunctuation) # remove punctuation characters
dat <- tm_map(dat, removeNumbers) # remove numbers
dat <- tm_map(dat, tolower) # change all characters to lowercasea
dat <- tm_map(dat, removeWords, stopwords('english')) # remove stopwords
swearWords <- readLines('swearWords.txt') # read in profanity word file; the swearWords.txt file is already in the working directory
dat <- tm_map(dat, removeWords, swearWords) # remove profane words
dat <- tm_map(dat, stripWhitespace) # remove extra spaces
dat <- tm_map(dat, PlainTextDocument) # create plain text document
# Note: at this time I have chosen not to use the "stem" function
Now, a look at the same four lines after text processing:
# look at the same four lines of the dat file:
dat[[1]]$content
## [1] "nicholas ridley agreed act supremacy supporting henrys action became kings chaplain later time edward vi bishop rochester helped write book common prayer became bishop london worked improve conditions poor"
dat[[9111]]$content
## [1] "plus hes seeing katherine anna kendrick hospitalsuggested therapist almost experience adam thirdever patient shes fumbling way toward doctorate beneath insecurities adams fear clear real feeling"
dat[[23456]]$content
## [1] "unless commitment made promises hopes"
dat[[35000]]$content
## [1] "lets go"
One thing I noticed here as that hyphens have been deleted as part of the punctuation removal, but instead of inserting a space (" “) between the two words, the words are pasted together. I would like to find a way to correct this before the final model is built.
This next step will create the ngram tables that we need for the word prediction application. The tables will list an ngram with its associated frequency. For now, I will calculate a unigram, bigram, and trigram table.
# this is where I gave up on rWeka and found after searching online a method to
# develope tokenizers using the ngrams function in tm and NLP package;
# I had a really hard time getting ngrom to work and RWeka to install on OSX 10.11, these posts helped solve the installation problem, but I still got a java error on running: http://stackoverflow.com/questions/34971966/how-does-one-configure-rjava-on-osx-to-select-the-right-jvm-jinit-failing and http://stackoverflow.com/questions/37817975/error-in-rweka-in-r-package
# also the graphing and freq functions are mostly based upon stackoverflow threads, see:
# http://blogging2.humanities.manchester.ac.uk/R/wp-content/uploads/2016/12/tm101.pdf
# http://blogging2.humanities.manchester.ac.uk/R/author/radmin/
# from http://stackoverflow.com/questions/37817975/error-in-rweka-in-r-package
# http://tm.r-forge.r-project.org/faq.html
BigramTokenizer <-
function(x)
unlist(lapply(ngrams(words(x), 2), paste, collapse = " "), use.names = FALSE)
TrigramTokenizer <-
function(x)
unlist(lapply(ngrams(words(x), 3), paste, collapse = " "), use.names = FALSE)
# function to create viewable data frames with frequency data:
freq_df <- function(x){
freq <- sort(rowSums(as.matrix(x)), decreasing=TRUE)
freq_df <- data.frame(word=names(freq), freq=freq)
return(freq_df)
}
# function to plot top 30 n-grams:
freq_plot <- function(data, title) {
ggplot(data[1:30,], aes(reorder(word, -freq), freq)) +
labs(x = "Words/Phrases", y = "Frequency") +
ggtitle(title) +
theme(axis.text.x = element_text(angle = 90, size = 12, hjust = 1)) +
geom_bar(stat = "identity")
}
Histograms will be created from the three ngram files listing the top 30 ngrams for each file.
# the tm function TermDocumentMatrix creates the matrix we want, and removeSparseTerms is used and set at 0.9999 initally, gradually raised to 0.999975 for the bigrams and trigrams because 0.99998 and higher would crash R
unigram <- removeSparseTerms(TermDocumentMatrix(dat), 0.9999)
unigram_freq <- freq_df(unigram)
freq_plot(unigram_freq, "Unigrams")
save(unigram_freq, file = "unigrams.RData")
bigram <- removeSparseTerms(TermDocumentMatrix(dat, control = list(tokenize = BigramTokenizer)), 0.999975)
bigram_freq <- freq_df(bigram)
freq_plot(bigram_freq, "Bigrams")
save(bigram_freq, file = "bigrams.RData")
trigram <- removeSparseTerms(TermDocumentMatrix(dat, control = list(tokenize = TrigramTokenizer)), 0.999975)
trigram_freq <- freq_df(trigram)
freq_plot(trigram_freq, "Trigrams")
save(trigram_freq, file = "trigrams.RData")
From the above histograms created from the ngram tables, the most popular word or unigram is “will”, the most frequent bigram is “right now”, and the most frequent trigram is “happy mothers day”.
# save the files for later use
write.table(unigram_freq, file = "unigram_freq.Rdata")
write.table(bigram_freq, file = "bigram_freq.Rdata")
write.table(trigram_freq, file = "trigram_freq.Rdata")
One lesson learned is working large data sets is not always easy. I had trouble creating large enough frequency tables that would be useful enough for a prediction model. The frequency ngram tables created will need to be further processed before final use. I intend to use what is generally known as a “back-off” algorithm. For example, the user may enter two words of a sentence (a bigram), and the model will then use the trigram table to find the most frequently word that follows the given bigram. If no word is found, then the second word of the user’s bigram (now unigram) will be used to look up a word in the bigram frequency table. If no word is found in the second look up, then the model will need to use the unigram table. Once a model is built, it will be used in a Shiny app for the final capstone project.