I dowloaded the Capstone Dataset from the Coursera site. The data includes sample text in English, Russian, German, and Finnish, however, for this project we are only using the English text. The English text consists of 3 files: twitter, blog, and news.
The data size is rather large, so to reduce the data size I used a sampling technique leveraging the rbinom function. By setting the probability of “success” in the function I can control how much data is retained. I set the probability to 1% to use about 1% of the original file sizes. For the actual project I may increase this, however, kept it small for now to help with performance.
## make smaller versions of our sample files
read_files <- c('en_US.blogs.txt','en_US.news.txt','en_US.twitter.txt')
write_files <- c('en_US.blogs_samp.txt','en_US.news_samp.txt','en_US.twitter_samp.txt')
for (i in 1:length(read_files)) {
my_lines <- readLines(read_files[i])
sample <- rbinom(length(my_lines), 1, 0.01)
sample <- ifelse(sample==1,TRUE,FALSE)
samp_lines <- my_lines[sample]
con <- file(write_files[i])
writeLines(samp_lines, con = con)
close(con)
}
rm(my_lines)
rm(samp_lines)
rm(sample)
I leveraged the tm package to help explore, analyze, and proces the text. The first step in this process is to create a Corpus which is the main structure for managing documents in tm. Part of the clean up process involves removing offensive and profane words. For this I obtained a list of words supposedly banned by Google. One thing I noticed is that the tolower transformation runs extremely slow as does removeWords. From the Corpus we can also create a document term matrix.
pathname <- c("C:/Users/beecher/Documents/Coursera/DataScienceCapstone/Coursera-SwiftKey/final/en_US_small")
# create corpus
corpus <- Corpus(DirSource(pathname), readerControl = list(reader=readPlain, language = "eng"))
# read in list of bad words
bad_words <- read.csv("./Coursera-SwiftKey/final/bad_word_list.csv", strip.white = TRUE, header = FALSE, stringsAsFactors = FALSE)
bad_words <- as.vector(bad_words$V1)
#cleanup corpus
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, tolower)
corpus <- tm_map(corpus, removeWords, bad_words)
#create document term matrix
dtm <- DocumentTermMatrix(corpus)
Now that we have a corpus and document term matrix (dtm), we can perform some operations to examine characteristics of the text. We look at which files comprise the corpus, how much memory the corpus and dtm occupy, the number of words in each document, the overall number of unique words, words which occur at least 5K times, the top 10 words in terms of frequency, and a word cloud for those words which occur at least 1K times.
# show files in the corpus
summary(corpus)
## Length Class Mode
## en_US.blogs_samp.txt 2 PlainTextDocument list
## en_US.news_samp.txt 2 PlainTextDocument list
## en_US.twitter_samp.txt 2 PlainTextDocument list
# see how large the corpus and dtm is
object.size(corpus)
## 5588280 bytes
object.size(dtm)
## 4893880 bytes
# count the number of words in each document
rowSums(as.matrix(dtm))
## en_US.blogs_samp.txt en_US.news_samp.txt en_US.twitter_samp.txt
## 288374 275599 225408
# find the number of unique words
freq <- colSums(as.matrix(dtm))
length(freq)
## [1] 58213
# find words which occur at least 5000 times
findFreqTerms(dtm, 5000)
## [1] "and" "for" "have" "that" "the" "this" "was" "with" "you"
# show the top 10 words and their frequency
ord <- order(freq,decreasing=TRUE)
freq[head(ord,10)]
## the and for that you with was have this are
## 47717 23957 10991 10240 9388 7182 6146 5490 5317 4940
# plot the top 10 words and their frequency
wf=data.frame(term=names(freq),occurrences=freq)
wf <- wf[order(-wf$occurrences),]
p <- ggplot(wf[1:10,], aes(term, occurrences))
p <- p + geom_bar(stat="identity")
p <- p + theme(axis.text.x=element_text(angle=90, hjust=1))
p <- p + scale_x_discrete(limits = wf$term[1:10])
p
# display a wordcloud for words occuring at least 1000 times
wordcloud(names(freq), freq, min.freq = 1000)
I’ve done some preliminary investigation in using the ngram library. The ngram library allows identifying n-grams, which are an ordered sequence of n words. It also identifies which words come after a given n-gram and the frequency. Using ngram package it should be possible to generate a predictive model of possible follow-on words, given one or more already entered words. I’m planning to generate the n-grams for n=1, 2, and 3, so it will be possible predict one word out from anywhere from 1 to 3 words. As a person enters the fourth word, my plan is to look back just the previous 3 words. Given the suggestion in the course notes I’ve also done some research on Markov Chains as a possible way to store the model. A Markov Chain has the characteristic that it represents various possible states and then the probabilities for the next state, based solely on the current state. This is a so-called stateless or memoryless model. Key things to watch out for will be performance and handling the case, where entered words do not appear in the model.