Introduction

This is the milestone report for the 2nd week of the Coursera Data Science Capstone project. The goal of the following report is cleaning the data as well as use Natural Language Processing applications in R (tm and RWeka) to tokenize n-grams as a basis towards building a predictive model.

Read the Datasets

Read the dataset with the function readLines.

blogs   <-  readLines("~/Desktop/Coursera DS Specialization Capstone Project/01 - Data/final/en_US/en_US.blogs.txt",   encoding = "UTF-8", skipNul = TRUE)
news    <-  readLines("~/Desktop/Coursera DS Specialization Capstone Project/01 - Data/final/en_US/en_US.news.txt",    encoding = "UTF-8", skipNul = TRUE)
twitter <-  readLines("~/Desktop/Coursera DS Specialization Capstone Project/01 - Data/final/en_US/en_US.twitter.txt", encoding = "UTF-8", skipNul = TRUE)

Summary statistics

The following information for each file is summarised in the following section: * Size (in MB) * Number of lines and words * Average word count per line * Max count of char per line

fsize_blogs   <- file.info("../final/en_US/en_US.blogs.txt")$size / 1024^2
fsize_news    <- file.info("../final/en_US/en_US.news.txt")$size / 1024^2
fsize_twitter <- file.info("../final/en_US/en_US.twitter.txt")$size / 1024^2
line_count_blogs   <- length(blogs)    # count of lines in the file
line_count_news    <- length(news)     #
line_count_twitter <- length(twitter)  #
word_count_blogs   <- sum(sapply(gregexpr("\\W+", blogs),   length)) + 1    # count of words in the file
word_count_news    <- sum(sapply(gregexpr("\\W+", news),    length)) + 1
word_count_twitter <- sum(sapply(gregexpr("\\W+", twitter), length)) + 1
maxchar_blogs    <- max(nchar(blogs))    # max count of char per line.
maxchar_news     <- max(nchar(news))     # 
maxchar_twitter  <- max(nchar(twitter))  # 

knitr::kable(data.frame(
  DataSources     = c("Blogs", "News", "Twitter"),
  FileSizeMB    = format(c(fsize_blogs, fsize_news, fsize_twitter), digits = 5),
  totLines     = format(c(line_count_blogs, line_count_news, line_count_twitter), big.mark = ","),
  totWord   = format(c(word_count_blogs, word_count_news, word_count_twitter), big.mark = ","),
  AvgWordsLine = format(c(word_count_blogs/line_count_blogs, 
                                word_count_news/line_count_news, 
                                word_count_twitter/line_count_twitter),
                              digits = 5),
  max.char.per.line  = format(c(maxchar_blogs, maxchar_news, maxchar_twitter), big.mark = ",")))
DataSources FileSizeMB totLines totWord AvgWordsLine max.char.per.line
Blogs NA 899,288 38,221,262 42.502 40,833
News NA 1,010,242 35,710,846 35.349 11,384
Twitter NA 2,360,148 30,433,296 12.895 140

Data Exploration

In order to create the corpus and to perform data exploration on it, the TM package has been used.

library(tm)
## Loading required package: NLP
library(RWeka)
library(RColorBrewer)
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.3.2
## 
## Attaching package: 'ggplot2'
## The following object is masked from 'package:NLP':
## 
##     annotate
library(wordcloud)

Furthermore, to make hadling these dataset more easily, the datasets have been sampled to work with a smaller amout of data and thus making the calculations faster. The sampling size has been chosen as the 1% of the total size of each file. According to prediction results, this will be changed accordingly if needed.

blogs   <- iconv(blogs,   "UTF-8", "ASCII", "byte")
news    <- iconv(news,    "UTF-8", "ASCII", "byte")
twitter <- iconv(twitter, "UTF-8", "ASCII", "byte")

set.seed(1332)
data.sample <- c(sample(blogs,   length(blogs)   * 0.01),
                 sample(news,    length(news)    * 0.01),
                 sample(twitter, length(twitter) * 0.01))
#data.sample <- sample(c(blogs, news, twitter), 50000 )
rm(blogs)
rm(news)
rm(twitter)

Subsequently, the required manipulations are performed:

# Create the corpus
corpus <- VCorpus(VectorSource(data.sample))
# Transform all letters into lower case
corpus <- tm_map(corpus, tolower)
# Remove punctuation and numbers
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeNumbers)
# Remove unnecessary white spaces
corpus <- tm_map(corpus, stripWhitespace)
# Conversion into a plain tex doc
corpus <- tm_map(corpus, PlainTextDocument)

N-Gram creation and frequency histograms

options(mc.cores=1)
unigram <- function(x) NGramTokenizer(x, Weka_control(min = 1, max = 1))
bigram  <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
trigram <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))

getFreq <- function(tdm) {
  freq <- sort(rowSums(as.matrix(tdm)), decreasing = TRUE)
  return(data.frame(word = names(freq), freq = freq))
}

generatePlot <- function(data, label) {
  ggplot(data[1:20,], aes(reorder(word, -freq), freq)) +
         labs(x = label, y = "Frequency") +
         theme(axis.text.x = element_text(angle = 60, size = 12, hjust = 1)) +
         geom_bar(stat = "identity", fill = I("grey50"))
}

The histograms of the 20 most frequent unigrams, bigrams and trigrams are displayed below:

freq_uni <- getFreq(removeSparseTerms(TermDocumentMatrix(corpus, contro=list(tokenize = unigram)), 0.99))
generatePlot(freq_uni, "20 Most frequent unigrams")

freq_bi <- getFreq(removeSparseTerms(TermDocumentMatrix(corpus, contro=list(tokenize = bigram)), 0.99))
generatePlot(freq_bi, "20 Most frequent bigrams")

freq_tri <- getFreq(removeSparseTerms(TermDocumentMatrix(corpus, contro=list(tokenize = trigram)), 0.99))
generatePlot(freq_bi, "20 Most frequent trigrams")

Considerations

It can be seen that many of the most frequent words are “stopword”. An idea of what the most frequent words without them can be achieved by removing them:

wo_stopw_corpus<-tm_map(corpus, removeWords, stopwords("english"))

Let’s plot them on a cloud chart:

wordcloud(wo_stopw_corpus, max.words=50, random.order=FALSE, colors=brewer.pal(8,"Blues"))

Next steps