This is the milestone report for the week 2 of the Coursera Data Science Capstone project. The goal of this report is just to display that I am on track to create my prediction algorithm. The motivation for this report is to:

  1. Demonstrate that I’ve downloaded the data and have successfully loaded it in
  2. Create a basic report of summary statistics about the data sets
  3. Report any interesting findings that I have amassed so far
  4. Get feedback on my plans for creating a prediction algorithm and Shiny app

This report covers the following aspects:

Loading required packages and data

library(RWeka)
library(dplyr)
library(stringi)
library(tm)
library(ggplot2)
library(RColorBrewer)
library(wordcloud)

blogs <- readLines("final/en_US/en_US.blogs.txt", encoding = "UTF-8", skipNul = TRUE)
news <- readLines("final/en_US/en_US.news.txt", encoding = "UTF-8", skipNul = TRUE)
twitter <- readLines("final/en_US/en_US.twitter.txt", encoding = "UTF-8", skipNul = TRUE)

Summarising the data

To know what the data looks like, determine the number of lines, characters and words for each of the 3 given data sets. Also calculate some basic statistics on the number of words per line.

wpl = sapply(list(blogs, news, twitter), function(x) summary(stri_count_words(x))[c('Min.', 'Mean', 'Max.')])
rownames(wpl) = c('wplMin','wplMean','wplMax')
stats = data.frame(Dataset = c("blogs","news","twitter"), 
                   t(rbind(sapply(list(blogs, news, twitter), stri_stats_general)[c('Lines', 'Chars'),],
                   Words = sapply(list(blogs, news, twitter), stri_stats_latex)['Words',], wpl)))
stats
##   Dataset   Lines     Chars    Words wplMin  wplMean wplMax
## 1   blogs  899288 206824382 37570839      0 41.75107   6726
## 2    news   77259  15639408  2651432      1 34.61779   1123
## 3 twitter 2360148 162096241 30451170      1 12.75065     47

Cleaning and sampling data

First, remove all non-English characters. Then, compile a sample data set composed of 1% of each of the 3 original data sets.

blogs <- iconv(blogs, "latin1", "ASCII", sub="")
news <- iconv(news, "latin1", "ASCII", sub="")
twitter <- iconv(twitter, "latin1", "ASCII", sub="")

set.seed(520)
sampleData <- c(sample(blogs, length(blogs) * 0.01),
                sample(news, length(news) * 0.01),
                sample(twitter, length(twitter) * 0.01))

Building the corpus

Next, use the functions from the tm package to build and clean the corpus that’ll be analyzed. After building the corpus, convert everything to lower case, remove all punctuation and numbers, strip white spaces and then convert it to plain text.

corpus <- VCorpus(VectorSource(sampleData))
corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, content_transformer(removePunctuation))
corpus <- tm_map(corpus, content_transformer(removeNumbers))
corpus <- tm_map(corpus, content_transformer(stripWhitespace))
corpus <- tm_map(corpus, content_transformer(PlainTextDocument))

Tokenizing n-grams

Use the RWeka package to create functions that tokenize the sample and construct matrices of unigrams, bigrams and trigrams.

uniTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 1, max = 1))
biTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
triTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))

uniMat <- TermDocumentMatrix(corpus, control = list(tokenize = uniTokenizer))
biMat <- TermDocumentMatrix(corpus, control = list(tokenize = biTokenizer))
triMat <- TermDocumentMatrix(corpus, control = list(tokenize = triTokenizer))

Then, find the frequency of terms in each of these 3 matrices and construct data frames of these frequencies.

uniCorpus <- findFreqTerms(uniMat,lowfreq = 50)
biCorpus <- findFreqTerms(biMat,lowfreq=50)
triCorpus <- findFreqTerms(triMat,lowfreq=50)

uniCorpusF <- rowSums(as.matrix(uniMat[uniCorpus,]))
uniCorpusF <- data.frame(word=names(uniCorpusF), frequency=uniCorpusF)
biCorpusF <- rowSums(as.matrix(biMat[biCorpus,]))
biCorpusF <- data.frame(word=names(biCorpusF), frequency=biCorpusF)
triCorpusF <- rowSums(as.matrix(triMat[triCorpus,]))
triCorpusF <- data.frame(word=names(triCorpusF), frequency=triCorpusF)
head(uniCorpusF)
##                  word frequency
## able             able       195
## about           about      2168
## above           above        87
## absolutely absolutely        80
## according   according        85
## account       account        82

Making plots and wordclouds

Write a function to plot the 20 most frequent n-grams.

plotNgrams <- function(data, title) {
  plotData <- data[order(-data$frequency),][1:20,] 
  ggplot(plotData, aes(x = seq(1:20), y = frequency)) +
    geom_bar(stat = "identity", fill = "pink", width = 0.80) +
    coord_cartesian(xlim = c(0, 21)) +
    labs(title = title) +
    xlab("Words") +
    ylab("Count") +
    scale_x_discrete(breaks = seq(1, 20, by = 1)) +
    theme(axis.text.x = element_text(angle = 90, hjust = 1))
}

plotNgrams(uniCorpusF, "Top 20 Unigrams")

plotNgrams(biCorpusF, "Top 20 Bigrams")

plotNgrams(triCorpusF, "Top 20 Trigrams")

These results can also be presented in the form of wordclouds as follows:

wordcloud(uniCorpusF$word, uniCorpusF$frequency, max.words=20, random.order=FALSE, rot.per=.15, colors=colorRampPalette(brewer.pal(3,"Blues"))(32))

wordcloud(biCorpusF$word, biCorpusF$frequency, max.words=20, random.order=FALSE, rot.per=.15, colors=colorRampPalette(brewer.pal(9,"Oranges"))(32), scale=c(2, .2))

wordcloud(triCorpusF$word, triCorpusF$frequency, max.words=20, random.order=FALSE, rot.per=.15, colors=colorRampPalette(brewer.pal(9,"Purples"))(32), scale=c(2, .2))

The wordcloud for unigrams looks faded. It is because of the huge difference between the frequencies of various unigrams. Parameters of the wordcloud() can be tweaked to get appropriate results as shown above.

Conclusion

This concludes our exploratory analysis. The next step will be to build a predictive algorithm that uses an n-gram model with a frequency lookup similar to the one shown above. This algorithm will then be deployed in a Shiny app to suggest the most likely next word after a phrase is typed.