The goal of this project is to display the data and the track to create a prediction algorithm. This report explains the exploratory analysis and the goals for the eventual app and algorithm. This document explains only the major features of the data that have been identified and briefly summarizes the plans for creating the prediction algorithm and Shiny app in a way that is understandable to a non-data scientist manager. This document makes use of tables and plots to illustrate important summaries of the data set.
The motivation for this project is to:
A series of libraries and toolsets is loaded:
library(NLP); library(tm); library(RWeka); library(ggplot2);
library(dplyr); library(wordcloud); library(knitr); library(kableExtra);
library(stringi)
In order to begin the data exploratory process, the dataset is downloaded:
Once extracted and the working directory has been defined the dataset is loaded into a “corpus” structure.
Each document is read into R.
Blogs <- readLines(paste0(corpus.location, "/en_US.blogs.txt"), encoding="UTF-8", warn = FALSE)
News <- readLines(paste0(corpus.location, "/en_US.news.txt"), encoding="UTF-8", warn = FALSE)
Twitter <- readLines(paste0(corpus.location, "/en_US.twitter.txt"), encoding="UTF-8", warn = FALSE)
The summary for the documents read is shown:
File | File.Size | Lines | TotalCharacters | Words |
---|---|---|---|---|
Blogs | 255.4 Mb | 899,288 | 206,824,505 | 37,570,839 |
News | 19.8 Mb | 77,259 | 15,639,408 | 2,651,432 |
319 Mb | 2,360,148 | 162,096,241 | 30,451,170 |
Considering the size of each data set and the fact that we are interested in generating useful information without compromising memory usage; we will proceed to sample the data (10% of the total).
set.seed(849775)
sBlogs <- sample(Blogs, length(Blogs)*0.1)
sNews <- sample(News, length(News)*0.1)
sTwitter <- sample(Twitter, length(Twitter)*0.1)
rm(Blogs)
rm(News)
rm(Twitter)
The sampled data will be structured as a Corpus:
sData <- c(sBlogs, sNews, sTwitter)
corpus <- VCorpus(VectorSource(sData))
Whitespaces are removed, the content is transformed to lowercase, numbers are removed as well as punctuation. The reason behind this process is because this characters provide no useful information about the data. \ Stopwords could also be removed from the dataset (just uncomment the line marked).
corpus <- tm_map(corpus, tolower)
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, removeNumbers)
## corpus <- tm_map(corpus, removeWords, stopwords("english"))
corpus <- tm_map(corpus, PlainTextDocument)
We will build Term document matrices to represent the dataset as N-gram stuctures by building custom tokenizers.
UniTokenizer <- function(x){NGramTokenizer(x, Weka_control(min = 1, max = 1))}
BiTokenizer <- function(x) {NGramTokenizer(x, Weka_control(min = 2, max = 2))}
TriTokenizer <- function(x){NGramTokenizer(x, Weka_control(min = 3, max = 3))}
The Term Document Matrices (TDM’s) are built and the sparse terms are removed:
UnigramMatrix <- TermDocumentMatrix(corpus, control = list(tokenize = UniTokenizer))
BigramMatrix <- TermDocumentMatrix(corpus, control = list(tokenize = BiTokenizer))
TrigramMatrix <- TermDocumentMatrix(corpus, control = list(tokenize = TriTokenizer))
UnigramMatrix <- removeSparseTerms(UnigramMatrix, 0.99)
BigramMatrix <- removeSparseTerms(BigramMatrix, 0.99)
TrigramMatrix <- removeSparseTerms(TrigramMatrix, 0.999)
Basic metadata about each TDM is shown for each one:
The “Unigram-Matrix”
## <<TermDocumentMatrix (terms: 182, documents: 333667)>>
## Non-/sparse entries: 1963071/58764323
## Sparsity : 97%
## Maximal term length: 9
## Weighting : term frequency (tf)
The “Bigram-Matrix”
## <<TermDocumentMatrix (terms: 55, documents: 333667)>>
## Non-/sparse entries: 329466/18022219
## Sparsity : 98%
## Maximal term length: 10
## Weighting : term frequency (tf)
The “Trigram-Matrix”
## <<TermDocumentMatrix (terms: 144, documents: 333667)>>
## Non-/sparse entries: 82604/47965444
## Sparsity : 100%
## Maximal term length: 20
## Weighting : term frequency (tf)
Using the previous n-gram matrices, the most common n-gram terms are filtered using the findFreqTerms function:
Freq1 <- findFreqTerms(UnigramMatrix,lowfreq = 50)
Freq2 <- findFreqTerms(BigramMatrix,lowfreq = 50)
Freq3 <- findFreqTerms(TrigramMatrix,lowfreq = 50)
The function shown below is used to calculate the frequency count for every term in the Term Document Matrices.
NGramDF <- function (termdocmat, freqMat){
s1 <- rowSums(as.matrix(termdocmat[freqMat, ]))
s1 <- data.frame(NGram=names(s1), frequency=s1)
s1 <- s1[order(s1$frequency, decreasing = TRUE),]
return(s1)
}
Using each data frame containing the n-gram term and its associated frequency, the following tables are generated:\
The most common unigrams (with associated frequency):NGram | frequency | |
---|---|---|
the | the | 293,499 |
and | and | 158,187 |
you | you | 84,657 |
for | for | 76,984 |
that | that | 71,407 |
with | with | 48,197 |
NGram | frequency | |
---|---|---|
of the | of the | 25,709 |
in the | in the | 24,579 |
for the | for the | 13,750 |
to the | to the | 13,380 |
on the | on the | 12,958 |
to be | to be | 12,031 |
NGram | frequency | |
---|---|---|
thanks for the | thanks for the | 2,383 |
one of the | one of the | 2,170 |
a lot of | a lot of | 1,890 |
i want to | i want to | 1,334 |
to be a | to be a | 1,323 |
going to be | going to be | 1,286 |
Finally a plot is generated to visualize and scale every term for every n-gram goup:
A word cloud is plotted to further visualize the data for the most frequent trigrams:
library(RColorBrewer)
wordcloud(words=N3GF$NGram, freq=N3GF$frequency, max.words = 140, random.order = FALSE, rot.per=0.35, colors = brewer.pal(8, "Set1"))
The most interesting findings about the data are the statistics and metadata generated and shown in this report. It is also interesting to notice that the most common unigrams are in fact “Stopwords”. Since this words could be useful to build a prediction model they are not initially removed. \
Build a prediction model that considers the n-gram term as an imput and uses the associated frequency to predict what the user might want to type next.
A list of banned words could be loaded to further filter the dataset.
Build a data product that employs a prediction model and is easy to use for non-data scientists.