This is the milestone report for the week 2 of the Coursera Data Science Capstone project. The goal of this report is just to display that I am on track to create my prediction algorithm. The motivation for this report is to:
This report covers the following aspects:
library(RWeka)
library(dplyr)
library(stringi)
library(tm)
library(ggplot2)
library(RColorBrewer)
library(wordcloud)
blogs <- readLines("final/en_US/en_US.blogs.txt", encoding = "UTF-8", skipNul = TRUE)
news <- readLines("final/en_US/en_US.news.txt", encoding = "UTF-8", skipNul = TRUE)
twitter <- readLines("final/en_US/en_US.twitter.txt", encoding = "UTF-8", skipNul = TRUE)
To know what the data looks like, determine the number of lines, characters and words for each of the 3 given data sets. Also calculate some basic statistics on the number of words per line.
wpl = sapply(list(blogs, news, twitter), function(x) summary(stri_count_words(x))[c('Min.', 'Mean', 'Max.')])
rownames(wpl) = c('wplMin','wplMean','wplMax')
stats = data.frame(Dataset = c("blogs","news","twitter"),
t(rbind(sapply(list(blogs, news, twitter), stri_stats_general)[c('Lines', 'Chars'),],
Words = sapply(list(blogs, news, twitter), stri_stats_latex)['Words',], wpl)))
stats
## Dataset Lines Chars Words wplMin wplMean wplMax
## 1 blogs 899288 206824382 37570839 0 41.75107 6726
## 2 news 77259 15639408 2651432 1 34.61779 1123
## 3 twitter 2360148 162096241 30451170 1 12.75065 47
First, remove all non-English characters. Then, compile a sample data set composed of 1% of each of the 3 original data sets.
blogs <- iconv(blogs, "latin1", "ASCII", sub="")
news <- iconv(news, "latin1", "ASCII", sub="")
twitter <- iconv(twitter, "latin1", "ASCII", sub="")
set.seed(520)
sampleData <- c(sample(blogs, length(blogs) * 0.01),
sample(news, length(news) * 0.01),
sample(twitter, length(twitter) * 0.01))
Next, use the functions from the tm package to build and clean the corpus that’ll be analyzed. After building the corpus, convert everything to lower case, remove all punctuation and numbers, strip white spaces and then convert it to plain text.
corpus <- VCorpus(VectorSource(sampleData))
corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, content_transformer(removePunctuation))
corpus <- tm_map(corpus, content_transformer(removeNumbers))
corpus <- tm_map(corpus, content_transformer(stripWhitespace))
corpus <- tm_map(corpus, content_transformer(PlainTextDocument))
Use the RWeka package to create functions that tokenize the sample and construct matrices of unigrams, bigrams and trigrams.
uniTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 1, max = 1))
biTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
triTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))
uniMat <- TermDocumentMatrix(corpus, control = list(tokenize = uniTokenizer))
biMat <- TermDocumentMatrix(corpus, control = list(tokenize = biTokenizer))
triMat <- TermDocumentMatrix(corpus, control = list(tokenize = triTokenizer))
Then, find the frequency of terms in each of these 3 matrices and construct data frames of these frequencies.
uniCorpus <- findFreqTerms(uniMat,lowfreq = 50)
biCorpus <- findFreqTerms(biMat,lowfreq=50)
triCorpus <- findFreqTerms(triMat,lowfreq=50)
uniCorpusF <- rowSums(as.matrix(uniMat[uniCorpus,]))
uniCorpusF <- data.frame(word=names(uniCorpusF), frequency=uniCorpusF)
biCorpusF <- rowSums(as.matrix(biMat[biCorpus,]))
biCorpusF <- data.frame(word=names(biCorpusF), frequency=biCorpusF)
triCorpusF <- rowSums(as.matrix(triMat[triCorpus,]))
triCorpusF <- data.frame(word=names(triCorpusF), frequency=triCorpusF)
head(uniCorpusF)
## word frequency
## able able 195
## about about 2168
## above above 87
## absolutely absolutely 80
## according according 85
## account account 82
Write a function to plot the 20 most frequent n-grams.
plotNgrams <- function(data, title) {
plotData <- data[order(-data$frequency),][1:20,]
ggplot(plotData, aes(x = seq(1:20), y = frequency)) +
geom_bar(stat = "identity", fill = "pink", width = 0.80) +
coord_cartesian(xlim = c(0, 21)) +
labs(title = title) +
xlab("Words") +
ylab("Count") +
scale_x_discrete(breaks = seq(1, 20, by = 1)) +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
}
plotNgrams(uniCorpusF, "Top 20 Unigrams")
plotNgrams(biCorpusF, "Top 20 Bigrams")
plotNgrams(triCorpusF, "Top 20 Trigrams")
These results can also be presented in the form of wordclouds as follows:
wordcloud(uniCorpusF$word, uniCorpusF$frequency, max.words=20, random.order=FALSE, rot.per=.15, colors=colorRampPalette(brewer.pal(3,"Blues"))(32))
wordcloud(biCorpusF$word, biCorpusF$frequency, max.words=20, random.order=FALSE, rot.per=.15, colors=colorRampPalette(brewer.pal(9,"Oranges"))(32), scale=c(2, .2))
wordcloud(triCorpusF$word, triCorpusF$frequency, max.words=20, random.order=FALSE, rot.per=.15, colors=colorRampPalette(brewer.pal(9,"Purples"))(32), scale=c(2, .2))
The wordcloud for unigrams looks faded. It is because of the huge difference between the frequencies of various unigrams. Parameters of the wordcloud() can be tweaked to get appropriate results as shown above.
This concludes our exploratory analysis. The next step will be to build a predictive algorithm that uses an n-gram model with a frequency lookup similar to the one shown above. This algorithm will then be deployed in a Shiny app to suggest the most likely next word after a phrase is typed.