This is the milestone report for week 2 of the Coursera JHU Data Science Capstone Project which is to develop a prediction algorithm for the most likely next word in a sequence of words in a sentence.
This report will load and clean the data, provide exploratory analysis to invesitgate some features of the data and use Natural Language Processing applications in R to tokenize n-grams as the first step toward building a predictive model.
blogs <- readLines("/home/dugi/coursera/Course10/data/final/en_US/en_US.blogs.txt", warn=FALSE, encoding="UTF-8")
news <- readLines("/home/dugi/coursera/Course10/data/final/en_US/en_US.news.txt", warn=FALSE, encoding="UTF-8")
twitter <- readLines("/home/dugi/coursera/Course10/data/final/en_US/en_US.twitter.txt", warn=FALSE, encoding="UTF-8")
SummaryOfFiles <- data.frame("File Name" = c("Blogs","News","Twitter"),
"File Size" = sapply(list(blogs, news, twitter), function(x){format(object.size(x),"MB")}),
"Row Count" = sapply(list(blogs, news, twitter), function(x){length(x)}),
"Word Count" = sapply(list(blogs, news, twitter), function(x){wordcount(x)})
)
SummaryOfFiles
## File.Name File.Size Row.Count Word.Count
## 1 Blogs 255.4 Mb 899288 37334131
## 2 News 257.3 Mb 1010242 34372530
## 3 Twitter 319 Mb 2360148 30373543
The files are very large, so I’m going to do a sample of 5% of each file. With this subset of the total data, I’m going to clean the data by removing all non-English characters.
# set seed for reproduciblity
set.seed(1234)
blogsSamp <- sample(blogs, length(blogs)*0.05)
newsSamp <- sample(news, length(news)*0.05)
twitterSamp <- sample(twitter, length(twitter)*0.05)
sampData <- c(blogsSamp, newsSamp, twitterSamp)
sampData <- iconv(sampData, "latin1", "ASCII", sub="")
Using the tm package to build and clean the corpus that will be analyzed. I’m going to convert to all lower case and then will remove all numbers, punctuation, unneccesary white space, and stopwords.
corpus <- VCorpus(VectorSource(sampData))
corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, PlainTextDocument)
corpus <- tm_map(corpus, removeWords, stopwords("english"))
Using the RWeka package, I will tokenize the sample data and construct 3 TDMs: unigrams, bigrams and trigrams.
unigram <- function(x) NGramTokenizer(x, Weka_control(min=1, max=1))
bigram <- function(x) NGramTokenizer(x, Weka_control(min=2, max=2))
trigram <- function(x) NGramTokenizer(x, Weka_control(min=3, max=3))
unigramTDM <- TermDocumentMatrix(corpus, control=list(tokenize=unigram))
bigramTDM <- TermDocumentMatrix(corpus, control=list(tokenize=bigram))
trigramTDM <- TermDocumentMatrix(corpus, control=list(tokenize=trigram))
We need to look at the sparsity of each of the 3 TDMs: unigramTDM, bigramTDM and trigramTDM
unigramTDM
## <<TermDocumentMatrix (terms: 142580, documents: 213483)>>
## Non-/sparse entries: 2571033/30435835107
## Sparsity : 100%
## Maximal term length: 114
## Weighting : term frequency (tf)
bigramTDM
## <<TermDocumentMatrix (terms: 1814759, documents: 213483)>>
## Non-/sparse entries: 2589014/387417606583
## Sparsity : 100%
## Maximal term length: 119
## Weighting : term frequency (tf)
trigramTDM
## <<TermDocumentMatrix (terms: 2327168, documents: 213483)>>
## Non-/sparse entries: 2396677/496808409467
## Sparsity : 100%
## Maximal term length: 142
## Weighting : term frequency (tf)
My 3 matrices are extremely sparse which is meaning they contain almost all zeros. I need to eliminate sparse terms for each of the 3 matrices and create and order frequency data frames to plot.
unigramDense <- removeSparseTerms(unigramTDM, 0.99)
bigramDense <- removeSparseTerms(bigramTDM, 0.999)
trigramDense <- removeSparseTerms(trigramTDM, 0.9999)
freqUnigram <- rowSums(as.matrix(unigramDense))
freqBigram <- rowSums(as.matrix(bigramDense))
freqTrigram <- rowSums(as.matrix(trigramDense))
orderUnigram <- order(freqUnigram, decreasing=TRUE)
orderBigram <- order(freqBigram, decreasing=TRUE)
orderTrigram <- order(freqTrigram, decreasing=TRUE)
unigramDF <- data.frame("unigram"=names(freqUnigram[orderUnigram]), "freq"=freqUnigram[orderUnigram])
bigramDF <- data.frame("unigram"=names(freqBigram[orderBigram]), "freq"=freqBigram[orderBigram])
trigramDF <- data.frame("unigram"=names(freqTrigram[orderTrigram]), "freq"=freqTrigram[orderTrigram])
ggplot(unigramDF[1:30,], aes(factor(unigram, levels=unique(unigram)), freq)) +
geom_bar(stat="identity", fill="blue", colour="black", width=0.9) +
theme(axis.text.x=element_text(angle=90)) +
labs(x="Unigram", y="Frequency", title="30 Most Common Single Words")
ggplot(bigramDF[1:30,], aes(factor(unigram, levels=unique(unigram)), freq)) +
geom_bar(stat="identity", fill="green", colour="black", width=0.9) +
theme(axis.text.x=element_text(angle=90)) +
labs(x="Bigram", y="Frequency", title="30 Most Common Pair Words")
ggplot(trigramDF[1:30,], aes(factor(unigram, levels=unique(unigram)), freq)) +
geom_bar(stat="identity", fill="purple", colour="black", width=0.9) +
theme(axis.text.x=element_text(angle=90)) +
labs(x="Trigram", y="Frequency", title="30 Most Common Triple Words")
We were able to save time by evaluating 5% of the 3 data files and were able to get a good sense of the data and perform sufficient exploratory analysis. The longer the N-gram the lower the frequency. The most frequent single words was “will” occurring 15,911 while for most frequent pair of words was “right now” occuring 1,211 and for triple words “cant wait see” occurred on 173 times.
The next step of this project will be to build a predictive algorithm using N-grams to get the probabilites for the next occurence for the next word based on the prior word typed. Because we have the data frames of the N-grams in TDMs this format should be good for predicting the next word in a sequence. The final step will be to develop a Shiny app that will use this algorithm and suggest the next word.