This report presents competency in working with the project’s training data using exploratory analysis to apply data science in the area of natural language processing (NLP). The final result of the course will be to construct a Shiny application that requires some input by the user and try to predict the next word. The course have provided a set of files containing texts extracted from blogs, news/media sites and twitter, to be used as a input in the creation of a prediction algorithm and analyzing a subset of the data provided.
library(dplyr)
## Warning: package 'dplyr' was built under R version 3.2.4
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(doParallel)
## Warning: package 'doParallel' was built under R version 3.2.4
## Loading required package: foreach
## Warning: package 'foreach' was built under R version 3.2.4
## Loading required package: iterators
## Loading required package: parallel
library(stringi)
library(tm)
## Loading required package: NLP
library(slam)
library(ggplot2)
## Warning: package 'ggplot2' was built under R version 3.2.4
##
## Attaching package: 'ggplot2'
## The following object is masked from 'package:NLP':
##
## annotate
library(wordcloud)
## Loading required package: RColorBrewer
jobcluster <- makeCluster(detectCores())
invisible(clusterEvalQ(jobcluster, library(tm)))
invisible(clusterEvalQ(jobcluster, library(slam)))
invisible(clusterEvalQ(jobcluster, library(stringi)))
invisible(clusterEvalQ(jobcluster, library(wordcloud)))
The data provided in the course site comprises 4 sets of files on project’s “US” training data(de_DE - Danish, en_US - English, fi_FI - Finnish an ru_RU - Russian), with each set containing 3 text files with texts from blogs, news/media sites and twitter. This analysis will focus english set of files: en_US.blogs.txt, en_US.news.txt, en_US.twitter.txt.
Load data from the three(3) data files in binary format to preserve all characters.
conn <- file("final/en_US/en_US.blogs.txt", open = "rb")
blogs <- readLines(conn, encoding = "UTF-8")
close(conn)
conn <- file("final/en_US/en_US.news.txt", open = "rb")
news <- readLines(conn, encoding = "UTF-8")
close(conn)
conn <- file("final/en_US/en_US.twitter.txt", open = "rb")
twits <- readLines(conn, encoding = "UTF-8")
## Warning in readLines(conn, encoding = "UTF-8"): line 167155 appears to
## contain an embedded nul
## Warning in readLines(conn, encoding = "UTF-8"): line 268547 appears to
## contain an embedded nul
## Warning in readLines(conn, encoding = "UTF-8"): line 1274086 appears to
## contain an embedded nul
## Warning in readLines(conn, encoding = "UTF-8"): line 1759032 appears to
## contain an embedded nul
close(conn)
rm(conn)
Analyse basic statistics of the three(3) data files including Line, Character, Word counts, and Words Per Line (WPL) summaries. Basic histograms are plotted to identify distribution of these data.
From the statistics, we observed that WPL for blogs are generally higher (at 41.75 mean), followed by news (at 34.41 mean) and twits (at 12.75 mean). This may be reflective of the expected attention-span of readers of these contents.
From the histograms, we also noticed that the WPL for all data types are right-skewed (i.e. longer right tail). This may be an indication of the general trend towards short and concised communications.
rawWPL<-lapply(list(blogs,news,twits),function(x) stri_count_words(x))
#Compute Statistics And Summary Info For Each Data Type
rawstats<-data.frame(
File=c("blogs","news","twitter"),
t(rbind(sapply(list(blogs,news,twits),stri_stats_general),
TotalWords=sapply(list(blogs,news,twits),stri_stats_latex)[4,])),
# Compute words per line summary
WPL=rbind(summary(rawWPL[[1]]),summary(rawWPL[[2]]),summary(rawWPL[[3]]))
)
print(rawstats)
## File Lines LinesNEmpty Chars CharsNWhite TotalWords WPL.Min.
## 1 blogs 899288 899288 206824382 170389539 37570839 0
## 2 news 1010242 1010242 203223154 169860866 34494539 1
## 3 twitter 2360148 2360148 162096031 134082634 30451128 1
## WPL.1st.Qu. WPL.Median WPL.Mean WPL.3rd.Qu. WPL.Max.
## 1 9 28 41.75 60 6726
## 2 19 32 34.41 46 1796
## 3 7 12 12.75 18 47
qplot(rawWPL[[1]],geom="histogram",main="Histogram for US Blogs",
xlab="No. of Words",ylab="Frequency",binwidth=10)
qplot(rawWPL[[2]],geom="histogram",main="Histogram for US News",
xlab="No. of Words",ylab="Frequency",binwidth=10)
qplot(rawWPL[[3]],geom="histogram",main="Histogram for US Twits",
xlab="No. of Words",ylab="Frequency",binwidth=1)
rm(rawWPL);rm(rawstats)
We will sample 30000 lines from each data type as the raw data is sizeable before cleaning and performing exploratory analysis.
samplesize <- 30000 # Assign sample size
set.seed(2703) # Ensure reproducibility
#Create Raw Data And Sample Vectors
data <- list(blogs, news, twits)
sample <- list()
#Iterate Each Raw Data To Create clean Sample
for (i in 1:length(data)) {
# Create sample dataset
Filter <- sample(1:length(data[[i]]), samplesize, replace = FALSE)
sample[[i]] <- data[[i]][Filter]
# Remove unconvention/funny characters
for (j in 1:length(sample[[i]])) {
row1 <- sample[[i]][j]
row2 <- iconv(row1, "latin1", "ASCII", sub = "")
sample[[i]][j] <- row2
}
}
rm(blogs)
rm(news)
rm(twits)
Corpus for each data type is created before cleaning them (e.g. removing punctuations, white spaces, numbers and stopwords, converting text to lowercase. Stemming are performed to eradicate duplication of similar words. Finally, a document term matrix is created to identify terms occurences in documents or lines.
corpus <- list()
dtMatrix <- list()
# Iterate Each Sample Data To Create Corpus and Document Term Matrix
for (i in 1:length(sample)) {
# Create corpus dataset
corpus[[i]] <- Corpus(VectorSource(sample[[i]]))
# Cleaning/stemming the data
corpus[[i]] <- tm_map(corpus[[i]], tolower)
corpus[[i]] <- tm_map(corpus[[i]], removeNumbers)
corpus[[i]] <- tm_map(corpus[[i]], removeWords, stopwords("english"))
corpus[[i]] <- tm_map(corpus[[i]], removePunctuation)
corpus[[i]] <- tm_map(corpus[[i]], stemDocument)
corpus[[i]] <- tm_map(corpus[[i]], stripWhitespace)
corpus[[i]] <- tm_map(corpus[[i]], PlainTextDocument)
# calculate document term frequency for corpus
dtMatrix[[i]] <- DocumentTermMatrix(corpus[[i]], control = list(wordLengths = c(0, Inf)))
}
rm(data)
rm(sample)
Corpus data is used to explore the word clouds to illustrate word frequencies effectively. The most frequent words are displayed in respect to their size and centralisation. One word cloud is plotted for each data type.
set.seed(2803) # Ensure reproducibility
par(mfrow = c(1, 3)) # Establish Plotting Panel
headings = c("US Blogs Word Cloud", "US News Word Cloud", "US Twits Word Cloud")
##Iterate Each Corpus/Document Term Matrix and Plot The Word Cloud
for (i in 1:length(corpus)) {
wordcloud(words = colnames(dtMatrix[[i]]), freq = col_sums(dtMatrix[[i]]),
scale = c(3, 1), max.words = 100, random.order = FALSE, rot.per = 0.35,
use.r.layout = FALSE, colors = brewer.pal(8, "Dark2"))
title(headings[i])
}
## Warning in wordcloud(words = colnames(dtMatrix[[i]]), freq =
## col_sums(dtMatrix[[i]]), : something could not be fit on page. It will not
## be plotted.
## Warning in wordcloud(words = colnames(dtMatrix[[i]]), freq =
## col_sums(dtMatrix[[i]]), : may could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(words = colnames(dtMatrix[[i]]), freq =
## col_sums(dtMatrix[[i]]), : never could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(words = colnames(dtMatrix[[i]]), freq =
## col_sums(dtMatrix[[i]]), : book could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(words = colnames(dtMatrix[[i]]), freq =
## col_sums(dtMatrix[[i]]), : another could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(words = colnames(dtMatrix[[i]]), freq =
## col_sums(dtMatrix[[i]]), : every could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(words = colnames(dtMatrix[[i]]), freq =
## col_sums(dtMatrix[[i]]), : find could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(words = colnames(dtMatrix[[i]]), freq =
## col_sums(dtMatrix[[i]]), : always could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(words = colnames(dtMatrix[[i]]), freq =
## col_sums(dtMatrix[[i]]), : week could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(words = colnames(dtMatrix[[i]]), freq =
## col_sums(dtMatrix[[i]]), : long could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(words = colnames(dtMatrix[[i]]), freq =
## col_sums(dtMatrix[[i]]), : home could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(words = colnames(dtMatrix[[i]]), freq =
## col_sums(dtMatrix[[i]]), : come could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(words = colnames(dtMatrix[[i]]), freq =
## col_sums(dtMatrix[[i]]), : feel could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(words = colnames(dtMatrix[[i]]), freq =
## col_sums(dtMatrix[[i]]), : since could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(words = colnames(dtMatrix[[i]]), freq =
## col_sums(dtMatrix[[i]]), : use could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(words = colnames(dtMatrix[[i]]), freq =
## col_sums(dtMatrix[[i]]), : used could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(words = colnames(dtMatrix[[i]]), freq =
## col_sums(dtMatrix[[i]]), : god could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(words = colnames(dtMatrix[[i]]), freq =
## col_sums(dtMatrix[[i]]), : read could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(words = colnames(dtMatrix[[i]]), freq =
## col_sums(dtMatrix[[i]]), : sure could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(words = colnames(dtMatrix[[i]]), freq =
## col_sums(dtMatrix[[i]]), : thing could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(words = colnames(dtMatrix[[i]]), freq =
## col_sums(dtMatrix[[i]]), : today could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(words = colnames(dtMatrix[[i]]), freq =
## col_sums(dtMatrix[[i]]), : best could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(words = colnames(dtMatrix[[i]]), freq =
## col_sums(dtMatrix[[i]]), : place could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(words = colnames(dtMatrix[[i]]), freq =
## col_sums(dtMatrix[[i]]), : old could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(words = colnames(dtMatrix[[i]]), freq =
## col_sums(dtMatrix[[i]]), : though could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(words = colnames(dtMatrix[[i]]), freq =
## col_sums(dtMatrix[[i]]), : lot could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(words = colnames(dtMatrix[[i]]), freq =
## col_sums(dtMatrix[[i]]), : part could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(words = colnames(dtMatrix[[i]]), freq =
## col_sums(dtMatrix[[i]]), : away could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(words = colnames(dtMatrix[[i]]), freq =
## col_sums(dtMatrix[[i]]), : put could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(words = colnames(dtMatrix[[i]]), freq =
## col_sums(dtMatrix[[i]]), : blog could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(words = colnames(dtMatrix[[i]]), freq =
## col_sums(dtMatrix[[i]]), : thought could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(words = colnames(dtMatrix[[i]]), freq =
## col_sums(dtMatrix[[i]]), : days could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(words = colnames(dtMatrix[[i]]), freq =
## col_sums(dtMatrix[[i]]), : night could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(words = colnames(dtMatrix[[i]]), freq =
## col_sums(dtMatrix[[i]]), : end could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(words = colnames(dtMatrix[[i]]), freq =
## col_sums(dtMatrix[[i]]), : give could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(words = colnames(dtMatrix[[i]]), freq =
## col_sums(dtMatrix[[i]]), : big could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(words = colnames(dtMatrix[[i]]), freq =
## col_sums(dtMatrix[[i]]), : better could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(words = colnames(dtMatrix[[i]]), freq =
## col_sums(dtMatrix[[i]]), : post could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(words = colnames(dtMatrix[[i]]), freq =
## col_sums(dtMatrix[[i]]), : went could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(words = colnames(dtMatrix[[i]]), freq =
## col_sums(dtMatrix[[i]]), : might could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(words = colnames(dtMatrix[[i]]), freq =
## col_sums(dtMatrix[[i]]), : actually could not be fit on page. It will not
## be plotted.
## Warning in wordcloud(words = colnames(dtMatrix[[i]]), freq =
## col_sums(dtMatrix[[i]]), : ever could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(words = colnames(dtMatrix[[i]]), freq =
## col_sums(dtMatrix[[i]]), : without could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(words = colnames(dtMatrix[[i]]), freq =
## col_sums(dtMatrix[[i]]), : found could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(words = colnames(dtMatrix[[i]]), freq =
## col_sums(dtMatrix[[i]]), : family could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(words = colnames(dtMatrix[[i]]), freq =
## col_sums(dtMatrix[[i]]), : help could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(words = colnames(dtMatrix[[i]]), freq =
## col_sums(dtMatrix[[i]]), : according could not be fit on page. It will not
## be plotted.
## Warning in wordcloud(words = colnames(dtMatrix[[i]]), freq =
## col_sums(dtMatrix[[i]]), : president could not be fit on page. It will not
## be plotted.
## Warning in wordcloud(words = colnames(dtMatrix[[i]]), freq =
## col_sums(dtMatrix[[i]]), : company could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(words = colnames(dtMatrix[[i]]), freq =
## col_sums(dtMatrix[[i]]), : states could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(words = colnames(dtMatrix[[i]]), freq =
## col_sums(dtMatrix[[i]]), : officials could not be fit on page. It will not
## be plotted.
## Warning in wordcloud(words = colnames(dtMatrix[[i]]), freq =
## col_sums(dtMatrix[[i]]), : come could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(words = colnames(dtMatrix[[i]]), freq =
## col_sums(dtMatrix[[i]]), : night could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(words = colnames(dtMatrix[[i]]), freq =
## col_sums(dtMatrix[[i]]), : former could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(words = colnames(dtMatrix[[i]]), freq =
## col_sums(dtMatrix[[i]]), : great could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(words = colnames(dtMatrix[[i]]), freq =
## col_sums(dtMatrix[[i]]), : never could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(words = colnames(dtMatrix[[i]]), freq =
## col_sums(dtMatrix[[i]]), : place could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(words = colnames(dtMatrix[[i]]), freq =
## col_sums(dtMatrix[[i]]), : never could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(words = colnames(dtMatrix[[i]]), freq =
## col_sums(dtMatrix[[i]]), : show could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(words = colnames(dtMatrix[[i]]), freq =
## col_sums(dtMatrix[[i]]), : week could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(words = colnames(dtMatrix[[i]]), freq =
## col_sums(dtMatrix[[i]]), : hope could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(words = colnames(dtMatrix[[i]]), freq =
## col_sums(dtMatrix[[i]]), : year could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(words = colnames(dtMatrix[[i]]), freq =
## col_sums(dtMatrix[[i]]), : better could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(words = colnames(dtMatrix[[i]]), freq =
## col_sums(dtMatrix[[i]]), : tomorrow could not be fit on page. It will not
## be plotted.
## Warning in wordcloud(words = colnames(dtMatrix[[i]]), freq =
## col_sums(dtMatrix[[i]]), : always could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(words = colnames(dtMatrix[[i]]), freq =
## col_sums(dtMatrix[[i]]), : take could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(words = colnames(dtMatrix[[i]]), freq =
## col_sums(dtMatrix[[i]]), : yes could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(words = colnames(dtMatrix[[i]]), freq =
## col_sums(dtMatrix[[i]]), : getting could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(words = colnames(dtMatrix[[i]]), freq =
## col_sums(dtMatrix[[i]]), : man could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(words = colnames(dtMatrix[[i]]), freq =
## col_sums(dtMatrix[[i]]), : wait could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(words = colnames(dtMatrix[[i]]), freq =
## col_sums(dtMatrix[[i]]), : hey could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(words = colnames(dtMatrix[[i]]), freq =
## col_sums(dtMatrix[[i]]), : looking could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(words = colnames(dtMatrix[[i]]), freq =
## col_sums(dtMatrix[[i]]), : game could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(words = colnames(dtMatrix[[i]]), freq =
## col_sums(dtMatrix[[i]]), : twitter could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(words = colnames(dtMatrix[[i]]), freq =
## col_sums(dtMatrix[[i]]), : weekend could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(words = colnames(dtMatrix[[i]]), freq =
## col_sums(dtMatrix[[i]]), : even could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(words = colnames(dtMatrix[[i]]), freq =
## col_sums(dtMatrix[[i]]), : look could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(words = colnames(dtMatrix[[i]]), freq =
## col_sums(dtMatrix[[i]]), : haha could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(words = colnames(dtMatrix[[i]]), freq =
## col_sums(dtMatrix[[i]]), : world could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(words = colnames(dtMatrix[[i]]), freq =
## col_sums(dtMatrix[[i]]), : miss could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(words = colnames(dtMatrix[[i]]), freq =
## col_sums(dtMatrix[[i]]), : home could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(words = colnames(dtMatrix[[i]]), freq =
## col_sums(dtMatrix[[i]]), : yeah could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(words = colnames(dtMatrix[[i]]), freq =
## col_sums(dtMatrix[[i]]), : everyone could not be fit on page. It will not
## be plotted.
## Warning in wordcloud(words = colnames(dtMatrix[[i]]), freq =
## col_sums(dtMatrix[[i]]), : help could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(words = colnames(dtMatrix[[i]]), freq =
## col_sums(dtMatrix[[i]]), : big could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(words = colnames(dtMatrix[[i]]), freq =
## col_sums(dtMatrix[[i]]), : keep could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(words = colnames(dtMatrix[[i]]), freq =
## col_sums(dtMatrix[[i]]), : school could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(words = colnames(dtMatrix[[i]]), freq =
## col_sums(dtMatrix[[i]]), : tweet could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(words = colnames(dtMatrix[[i]]), freq =
## col_sums(dtMatrix[[i]]), : nice could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(words = colnames(dtMatrix[[i]]), freq =
## col_sums(dtMatrix[[i]]), : gonna could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(words = colnames(dtMatrix[[i]]), freq =
## col_sums(dtMatrix[[i]]), : sure could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(words = colnames(dtMatrix[[i]]), freq =
## col_sums(dtMatrix[[i]]), : bad could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(words = colnames(dtMatrix[[i]]), freq =
## col_sums(dtMatrix[[i]]), : give could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(words = colnames(dtMatrix[[i]]), freq =
## col_sums(dtMatrix[[i]]), : little could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(words = colnames(dtMatrix[[i]]), freq =
## col_sums(dtMatrix[[i]]), : check could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(words = colnames(dtMatrix[[i]]), freq =
## col_sums(dtMatrix[[i]]), : morning could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(words = colnames(dtMatrix[[i]]), freq =
## col_sums(dtMatrix[[i]]), : live could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(words = colnames(dtMatrix[[i]]), freq =
## col_sums(dtMatrix[[i]]), : things could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(words = colnames(dtMatrix[[i]]), freq =
## col_sums(dtMatrix[[i]]), : someone could not be fit on page. It will not be
## plotted.
After exploratory analysis has been performed then only building predictive models and the data product. High-level plans to achieve this goal as below: