Data Science Capstone - Milestone Report

Introduction

The goal of this project is to display the data and the track to create a prediction algorithm. This report explains the exploratory analysis and the goals for the eventual app and algorithm. This document explains only the major features of the data that have been identified and briefly summarizes the plans for creating the prediction algorithm and Shiny app in a way that is understandable to a non-data scientist manager. This document makes use of tables and plots to illustrate important summaries of the data set.

The motivation for this project is to:

Demonstrate that the data has been successfully downloaded & loaded.
Create a basic report of summary statistics about the data sets.
Report any interesting findings that have been amassed so far.
Get feedback on the plans for creating a prediction algorithm and Shiny app.

Prework and loading libraries

A series of libraries and toolsets is loaded:

library(NLP); library(tm); library(RWeka); library(ggplot2);
library(dplyr); library(wordcloud); library(knitr); library(kableExtra);
library(stringi)

In order to begin the data exploratory process, the dataset is downloaded:

Capstone Dataset

Once extracted and the working directory has been defined the dataset is loaded into a “corpus” structure.

Loading the data

Each document is read into R.

Blogs <- readLines(paste0(corpus.location, "/en_US.blogs.txt"), encoding="UTF-8", warn = FALSE)
News <- readLines(paste0(corpus.location, "/en_US.news.txt"), encoding="UTF-8", warn = FALSE)
Twitter <- readLines(paste0(corpus.location, "/en_US.twitter.txt"), encoding="UTF-8", warn = FALSE)

The summary for the documents read is shown:

File	File.Size	Lines	TotalCharacters	Words
Blogs	255.4 Mb	899,288	206,824,505	37,570,839
News	19.8 Mb	77,259	15,639,408	2,651,432
Twitter	319 Mb	2,360,148	162,096,241	30,451,170

Considering the size of each data set and the fact that we are interested in generating useful information without compromising memory usage; we will proceed to sample the data (10% of the total).

set.seed(849775)
sBlogs <- sample(Blogs, length(Blogs)*0.1)
sNews <- sample(News, length(News)*0.1)
sTwitter <- sample(Twitter, length(Twitter)*0.1)

rm(Blogs)
rm(News)
rm(Twitter)

Cleaning and preprocessing the data

The sampled data will be structured as a Corpus:

sData <- c(sBlogs, sNews, sTwitter)

corpus <- VCorpus(VectorSource(sData))

Whitespaces are removed, the content is transformed to lowercase, numbers are removed as well as punctuation. The reason behind this process is because this characters provide no useful information about the data. \ Stopwords could also be removed from the dataset (just uncomment the line marked).

corpus <- tm_map(corpus, tolower)
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, removeNumbers)
## corpus <- tm_map(corpus, removeWords, stopwords("english"))
corpus <- tm_map(corpus, PlainTextDocument)

Tokenization & N-Gram Frequencies

We will build Term document matrices to represent the dataset as N-gram stuctures by building custom tokenizers.

UniTokenizer <- function(x){NGramTokenizer(x, Weka_control(min = 1, max = 1))}
BiTokenizer <- function(x) {NGramTokenizer(x, Weka_control(min = 2, max = 2))}
TriTokenizer <- function(x){NGramTokenizer(x, Weka_control(min = 3, max = 3))}

The Term Document Matrices (TDM’s) are built and the sparse terms are removed:

UnigramMatrix <- TermDocumentMatrix(corpus, control = list(tokenize = UniTokenizer))

BigramMatrix <- TermDocumentMatrix(corpus, control = list(tokenize = BiTokenizer))

TrigramMatrix <- TermDocumentMatrix(corpus, control = list(tokenize = TriTokenizer))

UnigramMatrix <- removeSparseTerms(UnigramMatrix, 0.99)
BigramMatrix <- removeSparseTerms(BigramMatrix, 0.99)
TrigramMatrix <- removeSparseTerms(TrigramMatrix, 0.999)

Basic metadata about each TDM is shown for each one:

The “Unigram-Matrix”

## <<TermDocumentMatrix (terms: 182, documents: 333667)>>
## Non-/sparse entries: 1963071/58764323
## Sparsity           : 97%
## Maximal term length: 9
## Weighting          : term frequency (tf)

The “Bigram-Matrix”

## <<TermDocumentMatrix (terms: 55, documents: 333667)>>
## Non-/sparse entries: 329466/18022219
## Sparsity           : 98%
## Maximal term length: 10
## Weighting          : term frequency (tf)

The “Trigram-Matrix”

## <<TermDocumentMatrix (terms: 144, documents: 333667)>>
## Non-/sparse entries: 82604/47965444
## Sparsity           : 100%
## Maximal term length: 20
## Weighting          : term frequency (tf)

Exploratory Analysis

Using the previous n-gram matrices, the most common n-gram terms are filtered using the findFreqTerms function:

Freq1 <- findFreqTerms(UnigramMatrix,lowfreq = 50)
Freq2 <- findFreqTerms(BigramMatrix,lowfreq = 50)
Freq3 <- findFreqTerms(TrigramMatrix,lowfreq = 50)

The function shown below is used to calculate the frequency count for every term in the Term Document Matrices.

NGramDF <- function (termdocmat, freqMat){
    s1 <- rowSums(as.matrix(termdocmat[freqMat, ]))
    s1 <- data.frame(NGram=names(s1), frequency=s1)
    s1 <- s1[order(s1$frequency, decreasing = TRUE),]
    return(s1)
}

Using each data frame containing the n-gram term and its associated frequency, the following tables are generated:\

The most common unigrams (with associated frequency):

	NGram	frequency
the	the	293,499
and	and	158,187
you	you	84,657
for	for	76,984
that	that	71,407
with	with	48,197

The most common bigrams (with associated frequency):

	NGram	frequency
of the	of the	25,709
in the	in the	24,579
for the	for the	13,750
to the	to the	13,380
on the	on the	12,958
to be	to be	12,031

The most common trigrams (with associated frequency):

	NGram	frequency
thanks for the	thanks for the	2,383
one of the	one of the	2,170
a lot of	a lot of	1,890
i want to	i want to	1,334
to be a	to be a	1,323
going to be	going to be	1,286

Finally a plot is generated to visualize and scale every term for every n-gram goup:

A word cloud is plotted to further visualize the data for the most frequent trigrams:

library(RColorBrewer)
wordcloud(words=N3GF$NGram, freq=N3GF$frequency, max.words = 140, random.order = FALSE, rot.per=0.35, colors = brewer.pal(8, "Set1"))

Findings

The most interesting findings about the data are the statistics and metadata generated and shown in this report. It is also interesting to notice that the most common unigrams are in fact “Stopwords”. Since this words could be useful to build a prediction model they are not initially removed. \

Next Steps and Further Development

Build a prediction model that considers the n-gram term as an imput and uses the associated frequency to predict what the user might want to type next.
A list of banned words could be loaded to further filter the dataset.
Build a data product that employs a prediction model and is easy to use for non-data scientists.