Introduction

This is the milestone report for the 2nd week of the Coursera Data Science Capstone project. The goal of the following report is cleaning the data as well as use Natural Language Processing applications in R (tm and RWeka) to tokenize n-grams as a basis towards building a predictive model.

Read the Datasets

Read the dataset with the function readLines.

blogs   <-  readLines("~/Desktop/Coursera DS Specialization Capstone Project/01 - Data/final/en_US/en_US.blogs.txt",   encoding = "UTF-8", skipNul = TRUE)
news    <-  readLines("~/Desktop/Coursera DS Specialization Capstone Project/01 - Data/final/en_US/en_US.news.txt",    encoding = "UTF-8", skipNul = TRUE)
twitter <-  readLines("~/Desktop/Coursera DS Specialization Capstone Project/01 - Data/final/en_US/en_US.twitter.txt", encoding = "UTF-8", skipNul = TRUE)

Summary statistics

The following information for each file is summarised in the following section: * Size (in MB) * Number of lines and words * Average word count per line * Max count of char per line

fsize_blogs   <- file.info("../final/en_US/en_US.blogs.txt")$size / 1024^2
fsize_news    <- file.info("../final/en_US/en_US.news.txt")$size / 1024^2
fsize_twitter <- file.info("../final/en_US/en_US.twitter.txt")$size / 1024^2
line_count_blogs   <- length(blogs)    # count of lines in the file
line_count_news    <- length(news)     #
line_count_twitter <- length(twitter)  #
word_count_blogs   <- sum(sapply(gregexpr("\\W+", blogs),   length)) + 1    # count of words in the file
word_count_news    <- sum(sapply(gregexpr("\\W+", news),    length)) + 1
word_count_twitter <- sum(sapply(gregexpr("\\W+", twitter), length)) + 1
maxchar_blogs    <- max(nchar(blogs))    # max count of char per line.
maxchar_news     <- max(nchar(news))     # 
maxchar_twitter  <- max(nchar(twitter))  # 

knitr::kable(data.frame(
  DataSources     = c("Blogs", "News", "Twitter"),
  FileSizeMB    = format(c(fsize_blogs, fsize_news, fsize_twitter), digits = 5),
  totLines     = format(c(line_count_blogs, line_count_news, line_count_twitter), big.mark = ","),
  totWord   = format(c(word_count_blogs, word_count_news, word_count_twitter), big.mark = ","),
  AvgWordsLine = format(c(word_count_blogs/line_count_blogs, 
                                word_count_news/line_count_news, 
                                word_count_twitter/line_count_twitter),
                              digits = 5),
  max.char.per.line  = format(c(maxchar_blogs, maxchar_news, maxchar_twitter), big.mark = ",")))

DataSources	FileSizeMB	totLines	totWord	AvgWordsLine	max.char.per.line
Blogs	NA	899,288	38,221,262	42.502	40,833
News	NA	1,010,242	35,710,846	35.349	11,384
Twitter	NA	2,360,148	30,433,296	12.895	140

Data Exploration

In order to create the corpus and to perform data exploration on it, the TM package has been used.

library(tm)

## Loading required package: NLP

library(RWeka)
library(RColorBrewer)
library(ggplot2)

## Warning: package 'ggplot2' was built under R version 3.3.2

## 
## Attaching package: 'ggplot2'

## The following object is masked from 'package:NLP':
## 
##     annotate

library(wordcloud)

Furthermore, to make hadling these dataset more easily, the datasets have been sampled to work with a smaller amout of data and thus making the calculations faster. The sampling size has been chosen as the 1% of the total size of each file. According to prediction results, this will be changed accordingly if needed.

blogs   <- iconv(blogs,   "UTF-8", "ASCII", "byte")
news    <- iconv(news,    "UTF-8", "ASCII", "byte")
twitter <- iconv(twitter, "UTF-8", "ASCII", "byte")

set.seed(1332)
data.sample <- c(sample(blogs,   length(blogs)   * 0.01),
                 sample(news,    length(news)    * 0.01),
                 sample(twitter, length(twitter) * 0.01))
#data.sample <- sample(c(blogs, news, twitter), 50000 )
rm(blogs)
rm(news)
rm(twitter)

Subsequently, the required manipulations are performed:

# Create the corpus
corpus <- VCorpus(VectorSource(data.sample))
# Transform all letters into lower case
corpus <- tm_map(corpus, tolower)
# Remove punctuation and numbers
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeNumbers)
# Remove unnecessary white spaces
corpus <- tm_map(corpus, stripWhitespace)
# Conversion into a plain tex doc
corpus <- tm_map(corpus, PlainTextDocument)

N-Gram creation and frequency histograms

options(mc.cores=1)
unigram <- function(x) NGramTokenizer(x, Weka_control(min = 1, max = 1))
bigram  <- function(x) NGramTokenizer(x, Weka_control(min = 2, max = 2))
trigram <- function(x) NGramTokenizer(x, Weka_control(min = 3, max = 3))

getFreq <- function(tdm) {
  freq <- sort(rowSums(as.matrix(tdm)), decreasing = TRUE)
  return(data.frame(word = names(freq), freq = freq))
}

generatePlot <- function(data, label) {
  ggplot(data[1:20,], aes(reorder(word, -freq), freq)) +
         labs(x = label, y = "Frequency") +
         theme(axis.text.x = element_text(angle = 60, size = 12, hjust = 1)) +
         geom_bar(stat = "identity", fill = I("grey50"))
}

The histograms of the 20 most frequent unigrams, bigrams and trigrams are displayed below:

freq_uni <- getFreq(removeSparseTerms(TermDocumentMatrix(corpus, contro=list(tokenize = unigram)), 0.99))
generatePlot(freq_uni, "20 Most frequent unigrams")

freq_bi <- getFreq(removeSparseTerms(TermDocumentMatrix(corpus, contro=list(tokenize = bigram)), 0.99))
generatePlot(freq_bi, "20 Most frequent bigrams")

freq_tri <- getFreq(removeSparseTerms(TermDocumentMatrix(corpus, contro=list(tokenize = trigram)), 0.99))
generatePlot(freq_bi, "20 Most frequent trigrams")

Considerations

It can be seen that many of the most frequent words are “stopword”. An idea of what the most frequent words without them can be achieved by removing them:

wo_stopw_corpus<-tm_map(corpus, removeWords, stopwords("english"))

Let’s plot them on a cloud chart:

wordcloud(wo_stopw_corpus, max.words=50, random.order=FALSE, colors=brewer.pal(8,"Blues"))

Next steps

Build a predictive algorithm based on a frequency n-gram model
Develop the Shiny app that will return the most probable word after the one in input