1. Introduction

This report presents competency in working with the project’s training data using exploratory analysis to apply data science in the area of natural language processing (NLP). The final result of the course will be to construct a Shiny application that requires some input by the user and try to predict the next word. The course have provided a set of files containing texts extracted from blogs, news/media sites and twitter, to be used as a input in the creation of a prediction algorithm and analyzing a subset of the data provided.

Preload Necessary R Libraries

library(dplyr)

## Warning: package 'dplyr' was built under R version 3.2.4

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(doParallel)

## Warning: package 'doParallel' was built under R version 3.2.4

## Loading required package: foreach

## Warning: package 'foreach' was built under R version 3.2.4

## Loading required package: iterators

## Loading required package: parallel

library(stringi)
library(tm)

## Loading required package: NLP

library(slam)
library(ggplot2)

## Warning: package 'ggplot2' was built under R version 3.2.4

## 
## Attaching package: 'ggplot2'

## The following object is masked from 'package:NLP':
## 
##     annotate

library(wordcloud)

## Loading required package: RColorBrewer

Setup Parallel Clusters To Accelarate Execution Time

jobcluster <- makeCluster(detectCores())
invisible(clusterEvalQ(jobcluster, library(tm)))
invisible(clusterEvalQ(jobcluster, library(slam)))
invisible(clusterEvalQ(jobcluster, library(stringi)))
invisible(clusterEvalQ(jobcluster, library(wordcloud)))

2. Data

The data provided in the course site comprises 4 sets of files on project’s “US” training data(de_DE - Danish, en_US - English, fi_FI - Finnish an ru_RU - Russian), with each set containing 3 text files with texts from blogs, news/media sites and twitter. This analysis will focus english set of files: en_US.blogs.txt, en_US.news.txt, en_US.twitter.txt.

3. Loading Raw Data

Load data from the three(3) data files in binary format to preserve all characters.

Read Blogs Data In Binary Mode

conn <- file("final/en_US/en_US.blogs.txt", open = "rb")
blogs <- readLines(conn, encoding = "UTF-8")
close(conn)

Read News Data In Binary Mode

conn <- file("final/en_US/en_US.news.txt", open = "rb")
news <- readLines(conn, encoding = "UTF-8")
close(conn)

Read Twitter Data In Binary Mode

conn <- file("final/en_US/en_US.twitter.txt", open = "rb")
twits <- readLines(conn, encoding = "UTF-8")

## Warning in readLines(conn, encoding = "UTF-8"): line 167155 appears to
## contain an embedded nul

## Warning in readLines(conn, encoding = "UTF-8"): line 268547 appears to
## contain an embedded nul

## Warning in readLines(conn, encoding = "UTF-8"): line 1274086 appears to
## contain an embedded nul

## Warning in readLines(conn, encoding = "UTF-8"): line 1759032 appears to
## contain an embedded nul

close(conn)

rm(conn)

4. Analyzing Raw Data (Exploratory Analysis)

Analyse basic statistics of the three(3) data files including Line, Character, Word counts, and Words Per Line (WPL) summaries. Basic histograms are plotted to identify distribution of these data.

From the statistics, we observed that WPL for blogs are generally higher (at 41.75 mean), followed by news (at 34.41 mean) and twits (at 12.75 mean). This may be reflective of the expected attention-span of readers of these contents.

From the histograms, we also noticed that the WPL for all data types are right-skewed (i.e. longer right tail). This may be an indication of the general trend towards short and concised communications.

Compute Words Per Line Info On Each Line For Each Data Type

rawWPL<-lapply(list(blogs,news,twits),function(x) stri_count_words(x))

#Compute Statistics And Summary Info For Each Data Type
rawstats<-data.frame(
            File=c("blogs","news","twitter"), 
            t(rbind(sapply(list(blogs,news,twits),stri_stats_general),
                    TotalWords=sapply(list(blogs,news,twits),stri_stats_latex)[4,])),
            # Compute words per line summary
            WPL=rbind(summary(rawWPL[[1]]),summary(rawWPL[[2]]),summary(rawWPL[[3]]))
            )
print(rawstats)

##      File   Lines LinesNEmpty     Chars CharsNWhite TotalWords WPL.Min.
## 1   blogs  899288      899288 206824382   170389539   37570839        0
## 2    news 1010242     1010242 203223154   169860866   34494539        1
## 3 twitter 2360148     2360148 162096031   134082634   30451128        1
##   WPL.1st.Qu. WPL.Median WPL.Mean WPL.3rd.Qu. WPL.Max.
## 1           9         28    41.75          60     6726
## 2          19         32    34.41          46     1796
## 3           7         12    12.75          18       47

Plot Histogram For Each Data Type

qplot(rawWPL[[1]],geom="histogram",main="Histogram for US Blogs",
      xlab="No. of Words",ylab="Frequency",binwidth=10)

qplot(rawWPL[[2]],geom="histogram",main="Histogram for US News",
      xlab="No. of Words",ylab="Frequency",binwidth=10)

qplot(rawWPL[[3]],geom="histogram",main="Histogram for US Twits",
      xlab="No. of Words",ylab="Frequency",binwidth=1)

rm(rawWPL);rm(rawstats)

5. Sampling Raw Data

We will sample 30000 lines from each data type as the raw data is sizeable before cleaning and performing exploratory analysis.

samplesize <- 30000  # Assign sample size
set.seed(2703)  # Ensure reproducibility 

#Create Raw Data And Sample Vectors
data <- list(blogs, news, twits)
sample <- list()

#Iterate Each Raw Data To Create clean Sample
for (i in 1:length(data)) {
    # Create sample dataset
    Filter <- sample(1:length(data[[i]]), samplesize, replace = FALSE)
    sample[[i]] <- data[[i]][Filter]
    # Remove unconvention/funny characters
    for (j in 1:length(sample[[i]])) {
        row1 <- sample[[i]][j]
        row2 <- iconv(row1, "latin1", "ASCII", sub = "")
        sample[[i]][j] <- row2
    }
}
rm(blogs)
rm(news)
rm(twits)

6. Creating Corpus and Cleaning Data

Corpus for each data type is created before cleaning them (e.g. removing punctuations, white spaces, numbers and stopwords, converting text to lowercase. Stemming are performed to eradicate duplication of similar words. Finally, a document term matrix is created to identify terms occurences in documents or lines.

Create Corpus and Document Term Matrix Vectors

corpus <- list()
dtMatrix <- list()

# Iterate Each Sample Data To Create Corpus and Document Term Matrix
for (i in 1:length(sample)) {
    # Create corpus dataset
    corpus[[i]] <- Corpus(VectorSource(sample[[i]]))
    # Cleaning/stemming the data
    corpus[[i]] <- tm_map(corpus[[i]], tolower)
    corpus[[i]] <- tm_map(corpus[[i]], removeNumbers)
    corpus[[i]] <- tm_map(corpus[[i]], removeWords, stopwords("english"))
    corpus[[i]] <- tm_map(corpus[[i]], removePunctuation)
    corpus[[i]] <- tm_map(corpus[[i]], stemDocument)
    corpus[[i]] <- tm_map(corpus[[i]], stripWhitespace)
    corpus[[i]] <- tm_map(corpus[[i]], PlainTextDocument)
    # calculate document term frequency for corpus
dtMatrix[[i]] <- DocumentTermMatrix(corpus[[i]], control = list(wordLengths = c(0, Inf)))
}
rm(data)
rm(sample)

7. Plotting Data in Word Cloud (Exploratory Analysis)

Corpus data is used to explore the word clouds to illustrate word frequencies effectively. The most frequent words are displayed in respect to their size and centralisation. One word cloud is plotted for each data type.

set.seed(2803)  # Ensure reproducibility
par(mfrow = c(1, 3))  # Establish Plotting Panel
headings = c("US Blogs Word Cloud", "US News Word Cloud", "US Twits Word Cloud")

##Iterate Each Corpus/Document Term Matrix and Plot The Word Cloud 
for (i in 1:length(corpus)) {
    wordcloud(words = colnames(dtMatrix[[i]]), freq = col_sums(dtMatrix[[i]]), 
        scale = c(3, 1), max.words = 100, random.order = FALSE, rot.per = 0.35, 
        use.r.layout = FALSE, colors = brewer.pal(8, "Dark2"))
    title(headings[i])
}

## Warning in wordcloud(words = colnames(dtMatrix[[i]]), freq =
## col_sums(dtMatrix[[i]]), : something could not be fit on page. It will not
## be plotted.

## Warning in wordcloud(words = colnames(dtMatrix[[i]]), freq =
## col_sums(dtMatrix[[i]]), : may could not be fit on page. It will not be
## plotted.

## Warning in wordcloud(words = colnames(dtMatrix[[i]]), freq =
## col_sums(dtMatrix[[i]]), : never could not be fit on page. It will not be
## plotted.

## Warning in wordcloud(words = colnames(dtMatrix[[i]]), freq =
## col_sums(dtMatrix[[i]]), : book could not be fit on page. It will not be
## plotted.

## Warning in wordcloud(words = colnames(dtMatrix[[i]]), freq =
## col_sums(dtMatrix[[i]]), : another could not be fit on page. It will not be
## plotted.

## Warning in wordcloud(words = colnames(dtMatrix[[i]]), freq =
## col_sums(dtMatrix[[i]]), : every could not be fit on page. It will not be
## plotted.

## Warning in wordcloud(words = colnames(dtMatrix[[i]]), freq =
## col_sums(dtMatrix[[i]]), : find could not be fit on page. It will not be
## plotted.

## Warning in wordcloud(words = colnames(dtMatrix[[i]]), freq =
## col_sums(dtMatrix[[i]]), : always could not be fit on page. It will not be
## plotted.

## Warning in wordcloud(words = colnames(dtMatrix[[i]]), freq =
## col_sums(dtMatrix[[i]]), : week could not be fit on page. It will not be
## plotted.

## Warning in wordcloud(words = colnames(dtMatrix[[i]]), freq =
## col_sums(dtMatrix[[i]]), : long could not be fit on page. It will not be
## plotted.

## Warning in wordcloud(words = colnames(dtMatrix[[i]]), freq =
## col_sums(dtMatrix[[i]]), : home could not be fit on page. It will not be
## plotted.

## Warning in wordcloud(words = colnames(dtMatrix[[i]]), freq =
## col_sums(dtMatrix[[i]]), : come could not be fit on page. It will not be
## plotted.

## Warning in wordcloud(words = colnames(dtMatrix[[i]]), freq =
## col_sums(dtMatrix[[i]]), : feel could not be fit on page. It will not be
## plotted.

## Warning in wordcloud(words = colnames(dtMatrix[[i]]), freq =
## col_sums(dtMatrix[[i]]), : since could not be fit on page. It will not be
## plotted.

## Warning in wordcloud(words = colnames(dtMatrix[[i]]), freq =
## col_sums(dtMatrix[[i]]), : use could not be fit on page. It will not be
## plotted.

## Warning in wordcloud(words = colnames(dtMatrix[[i]]), freq =
## col_sums(dtMatrix[[i]]), : used could not be fit on page. It will not be
## plotted.

## Warning in wordcloud(words = colnames(dtMatrix[[i]]), freq =
## col_sums(dtMatrix[[i]]), : god could not be fit on page. It will not be
## plotted.

## Warning in wordcloud(words = colnames(dtMatrix[[i]]), freq =
## col_sums(dtMatrix[[i]]), : read could not be fit on page. It will not be
## plotted.

## Warning in wordcloud(words = colnames(dtMatrix[[i]]), freq =
## col_sums(dtMatrix[[i]]), : sure could not be fit on page. It will not be
## plotted.

## Warning in wordcloud(words = colnames(dtMatrix[[i]]), freq =
## col_sums(dtMatrix[[i]]), : thing could not be fit on page. It will not be
## plotted.

## Warning in wordcloud(words = colnames(dtMatrix[[i]]), freq =
## col_sums(dtMatrix[[i]]), : today could not be fit on page. It will not be
## plotted.

## Warning in wordcloud(words = colnames(dtMatrix[[i]]), freq =
## col_sums(dtMatrix[[i]]), : best could not be fit on page. It will not be
## plotted.

## Warning in wordcloud(words = colnames(dtMatrix[[i]]), freq =
## col_sums(dtMatrix[[i]]), : place could not be fit on page. It will not be
## plotted.

## Warning in wordcloud(words = colnames(dtMatrix[[i]]), freq =
## col_sums(dtMatrix[[i]]), : old could not be fit on page. It will not be
## plotted.

## Warning in wordcloud(words = colnames(dtMatrix[[i]]), freq =
## col_sums(dtMatrix[[i]]), : though could not be fit on page. It will not be
## plotted.

## Warning in wordcloud(words = colnames(dtMatrix[[i]]), freq =
## col_sums(dtMatrix[[i]]), : lot could not be fit on page. It will not be
## plotted.

## Warning in wordcloud(words = colnames(dtMatrix[[i]]), freq =
## col_sums(dtMatrix[[i]]), : part could not be fit on page. It will not be
## plotted.

## Warning in wordcloud(words = colnames(dtMatrix[[i]]), freq =
## col_sums(dtMatrix[[i]]), : away could not be fit on page. It will not be
## plotted.

## Warning in wordcloud(words = colnames(dtMatrix[[i]]), freq =
## col_sums(dtMatrix[[i]]), : put could not be fit on page. It will not be
## plotted.

## Warning in wordcloud(words = colnames(dtMatrix[[i]]), freq =
## col_sums(dtMatrix[[i]]), : blog could not be fit on page. It will not be
## plotted.

## Warning in wordcloud(words = colnames(dtMatrix[[i]]), freq =
## col_sums(dtMatrix[[i]]), : thought could not be fit on page. It will not be
## plotted.

## Warning in wordcloud(words = colnames(dtMatrix[[i]]), freq =
## col_sums(dtMatrix[[i]]), : days could not be fit on page. It will not be
## plotted.

## Warning in wordcloud(words = colnames(dtMatrix[[i]]), freq =
## col_sums(dtMatrix[[i]]), : night could not be fit on page. It will not be
## plotted.

## Warning in wordcloud(words = colnames(dtMatrix[[i]]), freq =
## col_sums(dtMatrix[[i]]), : end could not be fit on page. It will not be
## plotted.

## Warning in wordcloud(words = colnames(dtMatrix[[i]]), freq =
## col_sums(dtMatrix[[i]]), : give could not be fit on page. It will not be
## plotted.

## Warning in wordcloud(words = colnames(dtMatrix[[i]]), freq =
## col_sums(dtMatrix[[i]]), : big could not be fit on page. It will not be
## plotted.

## Warning in wordcloud(words = colnames(dtMatrix[[i]]), freq =
## col_sums(dtMatrix[[i]]), : better could not be fit on page. It will not be
## plotted.

## Warning in wordcloud(words = colnames(dtMatrix[[i]]), freq =
## col_sums(dtMatrix[[i]]), : post could not be fit on page. It will not be
## plotted.

## Warning in wordcloud(words = colnames(dtMatrix[[i]]), freq =
## col_sums(dtMatrix[[i]]), : went could not be fit on page. It will not be
## plotted.

## Warning in wordcloud(words = colnames(dtMatrix[[i]]), freq =
## col_sums(dtMatrix[[i]]), : might could not be fit on page. It will not be
## plotted.

## Warning in wordcloud(words = colnames(dtMatrix[[i]]), freq =
## col_sums(dtMatrix[[i]]), : actually could not be fit on page. It will not
## be plotted.

## Warning in wordcloud(words = colnames(dtMatrix[[i]]), freq =
## col_sums(dtMatrix[[i]]), : ever could not be fit on page. It will not be
## plotted.

## Warning in wordcloud(words = colnames(dtMatrix[[i]]), freq =
## col_sums(dtMatrix[[i]]), : without could not be fit on page. It will not be
## plotted.

## Warning in wordcloud(words = colnames(dtMatrix[[i]]), freq =
## col_sums(dtMatrix[[i]]), : found could not be fit on page. It will not be
## plotted.

## Warning in wordcloud(words = colnames(dtMatrix[[i]]), freq =
## col_sums(dtMatrix[[i]]), : family could not be fit on page. It will not be
## plotted.

## Warning in wordcloud(words = colnames(dtMatrix[[i]]), freq =
## col_sums(dtMatrix[[i]]), : help could not be fit on page. It will not be
## plotted.

## Warning in wordcloud(words = colnames(dtMatrix[[i]]), freq =
## col_sums(dtMatrix[[i]]), : according could not be fit on page. It will not
## be plotted.

## Warning in wordcloud(words = colnames(dtMatrix[[i]]), freq =
## col_sums(dtMatrix[[i]]), : president could not be fit on page. It will not
## be plotted.

## Warning in wordcloud(words = colnames(dtMatrix[[i]]), freq =
## col_sums(dtMatrix[[i]]), : company could not be fit on page. It will not be
## plotted.

## Warning in wordcloud(words = colnames(dtMatrix[[i]]), freq =
## col_sums(dtMatrix[[i]]), : states could not be fit on page. It will not be
## plotted.

## Warning in wordcloud(words = colnames(dtMatrix[[i]]), freq =
## col_sums(dtMatrix[[i]]), : officials could not be fit on page. It will not
## be plotted.

## Warning in wordcloud(words = colnames(dtMatrix[[i]]), freq =
## col_sums(dtMatrix[[i]]), : come could not be fit on page. It will not be
## plotted.

## Warning in wordcloud(words = colnames(dtMatrix[[i]]), freq =
## col_sums(dtMatrix[[i]]), : night could not be fit on page. It will not be
## plotted.

## Warning in wordcloud(words = colnames(dtMatrix[[i]]), freq =
## col_sums(dtMatrix[[i]]), : former could not be fit on page. It will not be
## plotted.

## Warning in wordcloud(words = colnames(dtMatrix[[i]]), freq =
## col_sums(dtMatrix[[i]]), : great could not be fit on page. It will not be
## plotted.

## Warning in wordcloud(words = colnames(dtMatrix[[i]]), freq =
## col_sums(dtMatrix[[i]]), : never could not be fit on page. It will not be
## plotted.

## Warning in wordcloud(words = colnames(dtMatrix[[i]]), freq =
## col_sums(dtMatrix[[i]]), : place could not be fit on page. It will not be
## plotted.

## Warning in wordcloud(words = colnames(dtMatrix[[i]]), freq =
## col_sums(dtMatrix[[i]]), : never could not be fit on page. It will not be
## plotted.

## Warning in wordcloud(words = colnames(dtMatrix[[i]]), freq =
## col_sums(dtMatrix[[i]]), : show could not be fit on page. It will not be
## plotted.

## Warning in wordcloud(words = colnames(dtMatrix[[i]]), freq =
## col_sums(dtMatrix[[i]]), : week could not be fit on page. It will not be
## plotted.

## Warning in wordcloud(words = colnames(dtMatrix[[i]]), freq =
## col_sums(dtMatrix[[i]]), : hope could not be fit on page. It will not be
## plotted.

## Warning in wordcloud(words = colnames(dtMatrix[[i]]), freq =
## col_sums(dtMatrix[[i]]), : year could not be fit on page. It will not be
## plotted.

## Warning in wordcloud(words = colnames(dtMatrix[[i]]), freq =
## col_sums(dtMatrix[[i]]), : better could not be fit on page. It will not be
## plotted.

## Warning in wordcloud(words = colnames(dtMatrix[[i]]), freq =
## col_sums(dtMatrix[[i]]), : tomorrow could not be fit on page. It will not
## be plotted.

## Warning in wordcloud(words = colnames(dtMatrix[[i]]), freq =
## col_sums(dtMatrix[[i]]), : always could not be fit on page. It will not be
## plotted.

## Warning in wordcloud(words = colnames(dtMatrix[[i]]), freq =
## col_sums(dtMatrix[[i]]), : take could not be fit on page. It will not be
## plotted.

## Warning in wordcloud(words = colnames(dtMatrix[[i]]), freq =
## col_sums(dtMatrix[[i]]), : yes could not be fit on page. It will not be
## plotted.

## Warning in wordcloud(words = colnames(dtMatrix[[i]]), freq =
## col_sums(dtMatrix[[i]]), : getting could not be fit on page. It will not be
## plotted.

## Warning in wordcloud(words = colnames(dtMatrix[[i]]), freq =
## col_sums(dtMatrix[[i]]), : man could not be fit on page. It will not be
## plotted.

## Warning in wordcloud(words = colnames(dtMatrix[[i]]), freq =
## col_sums(dtMatrix[[i]]), : wait could not be fit on page. It will not be
## plotted.

## Warning in wordcloud(words = colnames(dtMatrix[[i]]), freq =
## col_sums(dtMatrix[[i]]), : hey could not be fit on page. It will not be
## plotted.

## Warning in wordcloud(words = colnames(dtMatrix[[i]]), freq =
## col_sums(dtMatrix[[i]]), : looking could not be fit on page. It will not be
## plotted.

## Warning in wordcloud(words = colnames(dtMatrix[[i]]), freq =
## col_sums(dtMatrix[[i]]), : game could not be fit on page. It will not be
## plotted.

## Warning in wordcloud(words = colnames(dtMatrix[[i]]), freq =
## col_sums(dtMatrix[[i]]), : twitter could not be fit on page. It will not be
## plotted.

## Warning in wordcloud(words = colnames(dtMatrix[[i]]), freq =
## col_sums(dtMatrix[[i]]), : weekend could not be fit on page. It will not be
## plotted.

## Warning in wordcloud(words = colnames(dtMatrix[[i]]), freq =
## col_sums(dtMatrix[[i]]), : even could not be fit on page. It will not be
## plotted.

## Warning in wordcloud(words = colnames(dtMatrix[[i]]), freq =
## col_sums(dtMatrix[[i]]), : look could not be fit on page. It will not be
## plotted.

## Warning in wordcloud(words = colnames(dtMatrix[[i]]), freq =
## col_sums(dtMatrix[[i]]), : haha could not be fit on page. It will not be
## plotted.

## Warning in wordcloud(words = colnames(dtMatrix[[i]]), freq =
## col_sums(dtMatrix[[i]]), : world could not be fit on page. It will not be
## plotted.

## Warning in wordcloud(words = colnames(dtMatrix[[i]]), freq =
## col_sums(dtMatrix[[i]]), : miss could not be fit on page. It will not be
## plotted.

## Warning in wordcloud(words = colnames(dtMatrix[[i]]), freq =
## col_sums(dtMatrix[[i]]), : home could not be fit on page. It will not be
## plotted.

## Warning in wordcloud(words = colnames(dtMatrix[[i]]), freq =
## col_sums(dtMatrix[[i]]), : yeah could not be fit on page. It will not be
## plotted.

## Warning in wordcloud(words = colnames(dtMatrix[[i]]), freq =
## col_sums(dtMatrix[[i]]), : everyone could not be fit on page. It will not
## be plotted.

## Warning in wordcloud(words = colnames(dtMatrix[[i]]), freq =
## col_sums(dtMatrix[[i]]), : help could not be fit on page. It will not be
## plotted.

## Warning in wordcloud(words = colnames(dtMatrix[[i]]), freq =
## col_sums(dtMatrix[[i]]), : big could not be fit on page. It will not be
## plotted.

## Warning in wordcloud(words = colnames(dtMatrix[[i]]), freq =
## col_sums(dtMatrix[[i]]), : keep could not be fit on page. It will not be
## plotted.

## Warning in wordcloud(words = colnames(dtMatrix[[i]]), freq =
## col_sums(dtMatrix[[i]]), : school could not be fit on page. It will not be
## plotted.

## Warning in wordcloud(words = colnames(dtMatrix[[i]]), freq =
## col_sums(dtMatrix[[i]]), : tweet could not be fit on page. It will not be
## plotted.

## Warning in wordcloud(words = colnames(dtMatrix[[i]]), freq =
## col_sums(dtMatrix[[i]]), : nice could not be fit on page. It will not be
## plotted.

## Warning in wordcloud(words = colnames(dtMatrix[[i]]), freq =
## col_sums(dtMatrix[[i]]), : gonna could not be fit on page. It will not be
## plotted.

## Warning in wordcloud(words = colnames(dtMatrix[[i]]), freq =
## col_sums(dtMatrix[[i]]), : sure could not be fit on page. It will not be
## plotted.

## Warning in wordcloud(words = colnames(dtMatrix[[i]]), freq =
## col_sums(dtMatrix[[i]]), : bad could not be fit on page. It will not be
## plotted.

## Warning in wordcloud(words = colnames(dtMatrix[[i]]), freq =
## col_sums(dtMatrix[[i]]), : give could not be fit on page. It will not be
## plotted.

## Warning in wordcloud(words = colnames(dtMatrix[[i]]), freq =
## col_sums(dtMatrix[[i]]), : little could not be fit on page. It will not be
## plotted.

## Warning in wordcloud(words = colnames(dtMatrix[[i]]), freq =
## col_sums(dtMatrix[[i]]), : check could not be fit on page. It will not be
## plotted.

## Warning in wordcloud(words = colnames(dtMatrix[[i]]), freq =
## col_sums(dtMatrix[[i]]), : morning could not be fit on page. It will not be
## plotted.

## Warning in wordcloud(words = colnames(dtMatrix[[i]]), freq =
## col_sums(dtMatrix[[i]]), : live could not be fit on page. It will not be
## plotted.

## Warning in wordcloud(words = colnames(dtMatrix[[i]]), freq =
## col_sums(dtMatrix[[i]]), : things could not be fit on page. It will not be
## plotted.

## Warning in wordcloud(words = colnames(dtMatrix[[i]]), freq =
## col_sums(dtMatrix[[i]]), : someone could not be fit on page. It will not be
## plotted.

8. Plans Ahead For Project

After exploratory analysis has been performed then only building predictive models and the data product. High-level plans to achieve this goal as below:

Using N-grams to generate tokens of one to four words.
Summarizing frequency of tokens and find association between tokens.
Building predictive models using the tokens.
Develop data product (i.e. shiny apps) to make prediction of words based on user inputs.

Data Science Capstone - Milestone Report

Anita Hassan

March 18, 2016