Introduction

The goal of this project is to display the data and the track to create a prediction algorithm. This report explains the exploratory analysis and the goals for the eventual app and algorithm. This document explains only the major features of the data that have been identified and briefly summarizes the plans for creating the prediction algorithm and Shiny app in a way that is understandable to a non-data scientist manager. This document makes use of tables and plots to illustrate important summaries of the data set.

The motivation for this project is to:

  1. Demonstrate that the data has been successfully downloaded & loaded.
  2. Create a basic report of summary statistics about the data sets.
  3. Report any interesting findings that have been amassed so far.
  4. Get feedback on the plans for creating a prediction algorithm and Shiny app.

Prework and loading libraries

A series of libraries and toolsets is loaded:

library(NLP); library(tm); library(RWeka); library(ggplot2);
library(dplyr); library(wordcloud); library(knitr); library(kableExtra);
library(stringi)

In order to begin the data exploratory process, the dataset is downloaded:

Capstone Dataset

Once extracted and the working directory has been defined the dataset is loaded into a “corpus” structure.

Loading the data

Each document is read into R.

Blogs <- readLines(paste0(corpus.location, "/en_US.blogs.txt"), encoding="UTF-8", warn = FALSE)
News <- readLines(paste0(corpus.location, "/en_US.news.txt"), encoding="UTF-8", warn = FALSE)
Twitter <- readLines(paste0(corpus.location, "/en_US.twitter.txt"), encoding="UTF-8", warn = FALSE)
The summary for the documents read is shown:
File File.Size Lines TotalCharacters Words
Blogs 255.4 Mb 899,288 206,824,505 37,570,839
News 19.8 Mb 77,259 15,639,408 2,651,432
Twitter 319 Mb 2,360,148 162,096,241 30,451,170

Considering the size of each data set and the fact that we are interested in generating useful information without compromising memory usage; we will proceed to sample the data (10% of the total).

set.seed(849775)
sBlogs <- sample(Blogs, length(Blogs)*0.1)
sNews <- sample(News, length(News)*0.1)
sTwitter <- sample(Twitter, length(Twitter)*0.1)
rm(Blogs)
rm(News)
rm(Twitter)

Cleaning and preprocessing the data

The sampled data will be structured as a Corpus:

sData <- c(sBlogs, sNews, sTwitter)
corpus <- VCorpus(VectorSource(sData))

Whitespaces are removed, the content is transformed to lowercase, numbers are removed as well as punctuation. The reason behind this process is because this characters provide no useful information about the data. \ Stopwords could also be removed from the dataset (just uncomment the line marked).

corpus <- tm_map(corpus, tolower)
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, removeNumbers)
## corpus <- tm_map(corpus, removeWords, stopwords("english"))
corpus <- tm_map(corpus, PlainTextDocument)

Tokenization & N-Gram Frequencies

We will build Term document matrices to represent the dataset as N-gram stuctures by building custom tokenizers.

UniTokenizer <- function(x){NGramTokenizer(x, Weka_control(min = 1, max = 1))}
BiTokenizer <- function(x) {NGramTokenizer(x, Weka_control(min = 2, max = 2))}
TriTokenizer <- function(x){NGramTokenizer(x, Weka_control(min = 3, max = 3))}

The Term Document Matrices (TDM’s) are built and the sparse terms are removed:

UnigramMatrix <- TermDocumentMatrix(corpus, control = list(tokenize = UniTokenizer))
BigramMatrix <- TermDocumentMatrix(corpus, control = list(tokenize = BiTokenizer))
TrigramMatrix <- TermDocumentMatrix(corpus, control = list(tokenize = TriTokenizer))
UnigramMatrix <- removeSparseTerms(UnigramMatrix, 0.99)
BigramMatrix <- removeSparseTerms(BigramMatrix, 0.99)
TrigramMatrix <- removeSparseTerms(TrigramMatrix, 0.999)

Basic metadata about each TDM is shown for each one:

The “Unigram-Matrix”

## <<TermDocumentMatrix (terms: 182, documents: 333667)>>
## Non-/sparse entries: 1963071/58764323
## Sparsity           : 97%
## Maximal term length: 9
## Weighting          : term frequency (tf)

The “Bigram-Matrix”

## <<TermDocumentMatrix (terms: 55, documents: 333667)>>
## Non-/sparse entries: 329466/18022219
## Sparsity           : 98%
## Maximal term length: 10
## Weighting          : term frequency (tf)

The “Trigram-Matrix”

## <<TermDocumentMatrix (terms: 144, documents: 333667)>>
## Non-/sparse entries: 82604/47965444
## Sparsity           : 100%
## Maximal term length: 20
## Weighting          : term frequency (tf)

Exploratory Analysis

Using the previous n-gram matrices, the most common n-gram terms are filtered using the findFreqTerms function:

Freq1 <- findFreqTerms(UnigramMatrix,lowfreq = 50)
Freq2 <- findFreqTerms(BigramMatrix,lowfreq = 50)
Freq3 <- findFreqTerms(TrigramMatrix,lowfreq = 50)

The function shown below is used to calculate the frequency count for every term in the Term Document Matrices.

NGramDF <- function (termdocmat, freqMat){
    s1 <- rowSums(as.matrix(termdocmat[freqMat, ]))
    s1 <- data.frame(NGram=names(s1), frequency=s1)
    s1 <- s1[order(s1$frequency, decreasing = TRUE),]
    return(s1)
}

Using each data frame containing the n-gram term and its associated frequency, the following tables are generated:\

The most common unigrams (with associated frequency):
NGram frequency
the the 293,499
and and 158,187
you you 84,657
for for 76,984
that that 71,407
with with 48,197
The most common bigrams (with associated frequency):
NGram frequency
of the of the 25,709
in the in the 24,579
for the for the 13,750
to the to the 13,380
on the on the 12,958
to be to be 12,031
The most common trigrams (with associated frequency):
NGram frequency
thanks for the thanks for the 2,383
one of the one of the 2,170
a lot of a lot of 1,890
i want to i want to 1,334
to be a to be a 1,323
going to be going to be 1,286

Finally a plot is generated to visualize and scale every term for every n-gram goup:

A word cloud is plotted to further visualize the data for the most frequent trigrams:

library(RColorBrewer)
wordcloud(words=N3GF$NGram, freq=N3GF$frequency, max.words = 140, random.order = FALSE, rot.per=0.35, colors = brewer.pal(8, "Set1"))

Findings

The most interesting findings about the data are the statistics and metadata generated and shown in this report. It is also interesting to notice that the most common unigrams are in fact “Stopwords”. Since this words could be useful to build a prediction model they are not initially removed. \

Next Steps and Further Development