Capstone: Milestone Report

The goal of this project is to show how to work with the data and predict the algorithm correctly. The report will be submitted on R Pubs (http://rpubs.com/) that explain exploratory analysis and goals for the eventual apps and algorithm. The documentation shall be concise and explain the major features of the data you have identified and briefly summarize your plans for creating the prediction algorithm and Shiny app that can understandable to a non-data scientist manager. The table will be used including plots in order to illustrate the important summaries of the data set. The motivation for this project is to:

Demonstrate that successfully loaded the data.
Create a basic report of summary statistics about the data sets.
Report any interesting findings that amassed so far.
Get feedback for creating a prediction algorithm and Shiny app.

Environment

The packages used for analysis are: NLP tm RWeka stringi stringr ggplot2 knitr dplyr *wordcloud

library(NLP)
library(tm)
library(RWeka)
library(stringi)
library(stringr)
library(ggplot2)

## 
## Attaching package: 'ggplot2'

## The following object is masked from 'package:NLP':
## 
##     annotate

library(knitr)
library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(RColorBrewer)
library(wordcloud)

Data Summary

The dataset project had been downloaded from the Coursera website (Project Dataset). The sample data containing multiple languages like DE,US,FI and RU. In this project English language subset been used in which consisting of blogs, news, and tweets initially derived from a HC Corpus model.Three English files were processed from the final/en_US directory. The original ZIP file up to 560 MB used in order to develop the predictive algorithm. Below are the summary information of the data including file size, number lines, number words and means of number words.

doc1 <- file("Coursera-SwiftKey/final/en_US/en_US.blogs.txt", "rb")
blogs <- readLines(doc1, encoding="UTF-8")
close(doc1)

doc2 <- file("Coursera-SwiftKey/final/en_US/en_US.news.txt", "rb")
news <- readLines(doc2, encoding="UTF-8")
close(doc2)

doc3 <- file("Coursera-SwiftKey/final/en_US/en_US.twitter.txt", "rb")
twitter <- readLines(doc3, encoding="UTF-8")

## Warning in readLines(doc3, encoding = "UTF-8"): line 167155 appears to
## contain an embedded nul

## Warning in readLines(doc3, encoding = "UTF-8"): line 268547 appears to
## contain an embedded nul

## Warning in readLines(doc3, encoding = "UTF-8"): line 1274086 appears to
## contain an embedded nul

## Warning in readLines(doc3, encoding = "UTF-8"): line 1759032 appears to
## contain an embedded nul

close(doc3)

words_blogs <- stri_count_words(blogs)
words_news <- stri_count_words(news)
words_twitter <- stri_count_words(twitter)

size_blogs <- file.info("Coursera-SwiftKey/final/en_US/en_US.blogs.txt")$size/1024^2
size_news <- file.info("Coursera-SwiftKey/final/en_US/en_US.news.txt")$size/1024^2
size_twitter <- file.info("Coursera-SwiftKey/final/en_US/en_US.twitter.txt")$size/1024^2
DataSummary <- data.frame(filename = c("blogs","news","twitter"),
                            file_size_MB = c(size_blogs, size_news, size_twitter),
                            num_lines = c(length(blogs),length(news),length(twitter)),
                            num_words = c(sum(words_blogs),sum(words_news),sum(words_twitter)),
                            mean_num_words = c(mean(words_blogs),mean(words_news),mean(words_twitter)))

kable(DataSummary)

filename	file_size_MB	num_lines	num_words	mean_num_words
blogs	200.4242	899288	37546246	41.75108
news	196.2775	1010242	34762395	34.40997
twitter	159.3641	2360148	30093369	12.75063

Sampling Data

We will randomly choose 1% of each data set to demonstrate data preprocessing and exploratory data analysis. The full dataset will be used later in creating the prediction algorithm.

set.seed(1)
blogsSample <- sample(blogs, length(blogs)*0.01)
newsSample <- sample(news, length(news)*0.01)
twitterSample <- sample(twitter, length(twitter)*0.01)
twitterSample <- sapply(twitterSample, 
                        function(row) iconv(row, "latin1", "ASCII", sub=""))

Combine the three samples. The number of lines and total number of words are as follows:

text_sample  <- c(blogsSample,newsSample,twitterSample)
length(text_sample)

## [1] 42695

sum(stri_count_words(text_sample))

## [1] 1020011

Exploratory Analysis

The basic procedure for data preprocessing consists of the following key steps: 1. Construct a corpus from the document file. 2. Clean up the corpus by removing special characters, punctuation, numbers etc. Also remove profanity that we do not want to predict. 3. Build basic n-gram model.

1. Process and Cleaning Data

In order to prepare corpus the following function will be construct.

toSpace <- content_transformer(function(x, pattern) gsub(pattern, " ", x))
preprocessCorpus <- function(corpus){
  # Helper function to preprocess corpus
  corpus <- tm_map(corpus, toSpace, "(f|ht)tp(s?)://(.*)[.][a-z]+")
  corpus <- tm_map(corpus, toSpace, "@[^\\s]+")
  corpus <- tm_map(corpus, tolower)
  corpus <- tm_map(corpus, removeWords, stopwords("en"))
  corpus <- tm_map(corpus, removePunctuation)
  corpus <- tm_map(corpus, removeNumbers)
  corpus <- tm_map(corpus, stripWhitespace)
  corpus <- tm_map(corpus, PlainTextDocument)
  return(corpus)
}

text_sample <- VCorpus(VectorSource(text_sample))
text_sample <- preprocessCorpus(text_sample)

2. Document Matrix with Tokenization and Profanity Removal

BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min=2, max=2))
TrigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min=3, max=3))
QuadgramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min=4, max=4))

tdm1a <- TermDocumentMatrix(text_sample)
tdm1 <- removeSparseTerms(tdm1a, 0.99)

tdm2a <- TermDocumentMatrix(text_sample, control=list(tokenize=BigramTokenizer))
tdm2 <- removeSparseTerms(tdm2a, 0.999)

tdm3a <- TermDocumentMatrix(text_sample, control=list(tokenize=TrigramTokenizer))
tdm3 <- removeSparseTerms(tdm3a, 0.9999)

tdm4a <- TermDocumentMatrix(text_sample, control=list(tokenize=QuadgramTokenizer))
tdm4 <- removeSparseTerms(tdm4a, 0.9999)

Data Analysis

1. Word Frequency Analysis

Helper function to tabulate frequency

freq_frame <- function(tdm){
  freq <- sort(rowSums(as.matrix(tdm)), decreasing=TRUE)
  freq_frame <- data.frame(word=names(freq), freq=freq)
  return(freq_frame)
}

freq1_frame <- freq_frame(tdm1)
freq1_frame

##                word freq
## will           will 3346
## just           just 3023
## one             one 2982
## said           said 2962
## like           like 2605
## can             can 2512
## get             get 2205
## time           time 2196
## new             new 1831
## good           good 1807
## now             now 1717
## know           know 1611
## day             day 1597
## people       people 1562
## love           love 1555
## first         first 1445
## see             see 1433
## back           back 1383
## make           make 1366
## going         going 1303
## think         think 1271
## great         great 1225
## two             two 1218
## much           much 1191
## also           also 1181
## last           last 1174
## really       really 1166
## year           year 1137
## even           even 1105
## way             way 1096
## well           well 1095
## work           work 1067
## got             got 1054
## today         today 1032
## want           want 1015
## right         right 1012
## thanks       thanks  999
## need           need  992
## years         years  976
## still         still  959
## many           many  906
## say             say  892
## life           life  857
## take           take  854
## made           made  852
## come           come  840
## never         never  840
## home           home  839
## little       little  834
## best           best  790
## may             may  785
## night         night  745
## school       school  745
## game           game  737
## week           week  734
## next           next  726
## things       things  714
## lol             lol  713
## always       always  704
## happy         happy  695
## something something  691
## better       better  685
## state         state  680
## around       around  678
## look           look  669
## every         every  659
## another     another  622
## since         since  613
## show           show  612
## long           long  605
## world         world  604
## big             big  588
## three         three  585
## find           find  570
## city           city  567
## hope           hope  567
## man             man  560
## follow       follow  555
## sure           sure  551
## thing         thing  551
## tonight     tonight  544
## getting     getting  538
## keep           keep  538
## days           days  532
## help           help  530
## team           team  525
## says           says  522
## feel           feel  520
## use             use  512
## house         house  510
## lot             lot  510
## give           give  504
## family       family  491
## looking     looking  491
## ever           ever  488
## thank         thank  486
## play           play  483
## high           high  477
## everyone   everyone  476
## part           part  476
## place         place  476
## though       though  476
## done           done  465
## let             let  464
## might         might  464
## away           away  461
## morning     morning  461
## without     without  459
## end             end  456
## put             put  456
## thought     thought  456

freq2_frame <- freq_frame(tdm2)
freq2_frame

##                              word freq
## right now               right now  275
## new york                 new york  194
## last year               last year  178
## high school           high school  170
## first time             first time  149
## last night             last night  140
## years ago               years ago  140
## feel like               feel like  122
## looking forward   looking forward  121
## last week               last week  118
## make sure               make sure  112
## can get                   can get  110
## happy birthday     happy birthday  101
## st louis                 st louis   97
## good morning         good morning   94
## even though           even though   90
## just got                 just got   88
## looks like             looks like   88
## two years               two years   87
## can see                   can see   85
## united states       united states   84
## one day                   one day   83
## next week               next week   80
## let know                 let know   76
## new jersey             new jersey   74
## every day               every day   72
## look like               look like   72
## los angeles           los angeles   72
## said <U+0093>                     said <U+0093>   72
## next year               next year   71
## social media         social media   70
## good luck               good luck   69
## last month             last month   66
## thanks follow       thanks follow   62
## just want               just want   60
## san francisco       san francisco   60
## mothers day           mothers day   59
## long time               long time   57
## will never             will never   57
## get back                 get back   56
## one thing               one thing   56
## sounds like           sounds like   56
## san diego               san diego   55
## will take               will take   55
## will get                 will get   54
## follow back           follow back   53
## just like               just like   53
## can make                 can make   49
## come back               come back   49
## little bit             little bit   48
## many people           many people   48
## two weeks               two weeks   48
## going get               going get   47
## can help                 can help   46
## five years             five years   46
## go back                   go back   46
## pretty much           pretty much   46
## wait see                 wait see   46
## last years             last years   45
## one best                 one best   45
## thanks following thanks following   45
## will make               will make   45
## don<U+0092>t know             don<U+0092>t know   44
## last time               last time   44
## seems like             seems like   44
## much better           much better   43
## will go                   will go   43

freq3_frame <- freq_frame(tdm3)
freq3_frame

##                                                          word freq
## happy mothers day                           happy mothers day   35
## let us know                                       let us know   24
## new york city                                   new york city   23
## two years ago                                   two years ago   23
## happy new year                                 happy new year   21
## first time since                             first time since   15
## cinco de mayo                                   cinco de mayo   12
## four years ago                                 four years ago   12
## president barack obama                 president barack obama   12
## st patricks day                               st patricks day   12
## will take place                               will take place   12
## looking forward seeing                 looking forward seeing   11
## new york times                                 new york times   11
## gov chris christie                         gov chris christie   10
## ha ha ha                                             ha ha ha   10
## happy valentines day                     happy valentines day   10
## just got done                                   just got done   10
## really looking forward                 really looking forward   10
## world war ii                                     world war ii   10
## couple weeks ago                             couple weeks ago    9
## make sure get                                   make sure get    9
## dream come true                               dream come true    8
## just let know                                   just let know    8
## last two years                                 last two years    8
## let know think                                 let know think    8
## preheat oven degrees                     preheat oven degrees    8
## st louis county                               st louis county    8
## thanks following us                       thanks following us    8
## coach ken hitchcock                       coach ken hitchcock    7
## come see us                                       come see us    7
## couple years ago                             couple years ago    7
## don<U+0092>t get wrong                               don<U+0092>t get wrong    7
## first two games                               first two games    7
## good morning everyone                   good morning everyone    7
## just make sure                                 just make sure    7
## life right now                                 life right now    7
## new years eve                                   new years eve    7
## osama bin laden                               osama bin laden    7
## past two years                                 past two years    7
## rock n roll                                       rock n roll    7
## two half men                                     two half men    7
## business network international business network international    6
## case western reserve                     case western reserve    6
## chief executive officer               chief executive officer    6
## come join us                                     come join us    6
## feel better soon                             feel better soon    6
## five years ago                                 five years ago    6
## help us get                                       help us get    6
## hope everyone great                       hope everyone great    6
## hundreds millions dollars           hundreds millions dollars    6
## just got back                                   just got back    6
## just got home                                   just got home    6
## just need get                                   just need get    6
## let know can                                     let know can    6
## love love love                                 love love love    6
## major league baseball                   major league baseball    6
## make feel better                             make feel better    6
## nearly two years                             nearly two years    6
## now just need                                   now just need    6
## please let know                               please let know    6
## right now just                                 right now just    6
## season salt pepper                         season salt pepper    6
## show last night                               show last night    6
## two weeks ago                                   two weeks ago    6
## western reserve university         western reserve university    6
## will never forget                           will never forget    6
## <U+0093> don<U+0092>t know                                     <U+0093> don<U+0092>t know    5
## blues coach ken                               blues coach ken    5
## can please follow                           can please follow    5
## centers disease control               centers disease control    5
## chicago chicago illinois             chicago chicago illinois    5
## every day week                                 every day week    5
## executive vice president             executive vice president    5
## follow follow back                         follow follow back    5
## g protein g                                       g protein g    5
## high blood pressure                       high blood pressure    5
## high school senior                         high school senior    5
## hope feel better                             hope feel better    5
## hope great day                                 hope great day    5
## just little bit                               just little bit    5
## make dream come                               make dream come    5
## martin luther king                         martin luther king    5
## memorial day weekend                     memorial day weekend    5
## next couple weeks                           next couple weeks    5
## next two years                                 next two years    5
## past two decades                             past two decades    5
## president barack obamas               president barack obamas    5
## respond request comment               respond request comment    5
## rock roll hall                                 rock roll hall    5
## said one thing                                 said one thing    5
## salt pepper taste                           salt pepper taste    5
## seems like good                               seems like good    5
## seen anything like                         seen anything like    5
## social networking site                 social networking site    5
## standard poors index                     standard poors index    5
## stop say hi                                       stop say hi    5
## thank everyone came                       thank everyone came    5
## thanks following back                   thanks following back    5
## thanks new followers                     thanks new followers    5
## told associated press                   told associated press    5
## two days later                                 two days later    5
## u know u                                             u know u    5
## uc san diego                                     uc san diego    5
## university chicago chicago         university chicago chicago    5
## us district judge                           us district judge    5
## us supreme court                             us supreme court    5
## wall street journal                       wall street journal    5
## want make sure                                 want make sure    5
## will get back                                   will get back    5
## will let know                                   will let know    5
## yearold resident block                 yearold resident block    5

freq4_frame <- freq_frame(tdm4)
freq4_frame

##                                                                    word
## case western reserve university         case western reserve university
## blues coach ken hitchcock                     blues coach ken hitchcock
## make dream come true                               make dream come true
## university chicago chicago illinois university chicago chicago illinois
##                                     freq
## case western reserve university        6
## blues coach ken hitchcock              5
## make dream come true                   5
## university chicago chicago illinois    5

2. Word Correlation Analysis

Histograms of words that appear aleast 100 times using bigrams

freqBigram <- rowSums(as.matrix(tdm2))
wordFrameBigram <- data.frame(word=names(freqBigram),count=freqBigram,stringsAsFactors=FALSE)
bigramPlot <- ggplot(subset(wordFrameBigram, count > 100), aes(word,count))
bigramPlot <- bigramPlot + geom_bar(stat="identity")
bigramPlot <- bigramPlot + theme(axis.text.x=element_text(angle=45, hjust=1))
bigramPlot

Word Cloud of words that appear aleast 1000 times using trigrams

freqTrigram <- rowSums(as.matrix(tdm3))
wordFrameTrigram <- data.frame(word=names(freqTrigram),count=freqTrigram,stringsAsFactors=FALSE)
wordcloud(wordFrameTrigram$word, wordFrameTrigram$count, min.freq=1000, colors=brewer.pal(8,"Dark2"))

## Warning in wordcloud(wordFrameTrigram$word, wordFrameTrigram$count,
## min.freq = 1000, : happy mothers day could not be fit on page. It will not
## be plotted.

## Warning in wordcloud(wordFrameTrigram$word, wordFrameTrigram$count,
## min.freq = 1000, : thanks following us could not be fit on page. It will
## not be plotted.

## Warning in wordcloud(wordFrameTrigram$word, wordFrameTrigram$count,
## min.freq = 1000, : university chicago chicago could not be fit on page. It
## will not be plotted.

## Warning in wordcloud(wordFrameTrigram$word, wordFrameTrigram$count,
## min.freq = 1000, : president barack obama could not be fit on page. It will
## not be plotted.

## Warning in wordcloud(wordFrameTrigram$word, wordFrameTrigram$count,
## min.freq = 1000, : will take place could not be fit on page. It will not be
## plotted.

## Warning in wordcloud(wordFrameTrigram$word, wordFrameTrigram$count,
## min.freq = 1000, : new york city could not be fit on page. It will not be
## plotted.

## Warning in wordcloud(wordFrameTrigram$word, wordFrameTrigram$count,
## min.freq = 1000, : yearold resident block could not be fit on page. It will
## not be plotted.

## Warning in wordcloud(wordFrameTrigram$word, wordFrameTrigram$count,
## min.freq = 1000, : case western reserve could not be fit on page. It will
## not be plotted.

## Warning in wordcloud(wordFrameTrigram$word, wordFrameTrigram$count,
## min.freq = 1000, : business network international could not be fit on page.
## It will not be plotted.

## Warning in wordcloud(wordFrameTrigram$word, wordFrameTrigram$count,
## min.freq = 1000, : social networking site could not be fit on page. It will
## not be plotted.

## Warning in wordcloud(wordFrameTrigram$word, wordFrameTrigram$count,
## min.freq = 1000, : cinco de mayo could not be fit on page. It will not be
## plotted.

## Warning in wordcloud(wordFrameTrigram$word, wordFrameTrigram$count,
## min.freq = 1000, : please let know could not be fit on page. It will not be
## plotted.

## Warning in wordcloud(wordFrameTrigram$word, wordFrameTrigram$count,
## min.freq = 1000, : let us know could not be fit on page. It will not be
## plotted.

## Warning in wordcloud(wordFrameTrigram$word, wordFrameTrigram$count,
## min.freq = 1000, : preheat oven degrees could not be fit on page. It will
## not be plotted.

## Warning in wordcloud(wordFrameTrigram$word, wordFrameTrigram$count,
## min.freq = 1000, : really looking forward could not be fit on page. It will
## not be plotted.

## Warning in wordcloud(wordFrameTrigram$word, wordFrameTrigram$count,
## min.freq = 1000, : two weeks ago could not be fit on page. It will not be
## plotted.

## Warning in wordcloud(wordFrameTrigram$word, wordFrameTrigram$count,
## min.freq = 1000, : chicago chicago illinois could not be fit on page. It
## will not be plotted.

## Warning in wordcloud(wordFrameTrigram$word, wordFrameTrigram$count,
## min.freq = 1000, : couple years ago could not be fit on page. It will not
## be plotted.

## Warning in wordcloud(wordFrameTrigram$word, wordFrameTrigram$count,
## min.freq = 1000, : chief executive officer could not be fit on page. It
## will not be plotted.

## Warning in wordcloud(wordFrameTrigram$word, wordFrameTrigram$count,
## min.freq = 1000, : just got done could not be fit on page. It will not be
## plotted.

## Warning in wordcloud(wordFrameTrigram$word, wordFrameTrigram$count,
## min.freq = 1000, : thanks new followers could not be fit on page. It will
## not be plotted.

## Warning in wordcloud(wordFrameTrigram$word, wordFrameTrigram$count,
## min.freq = 1000, : past two years could not be fit on page. It will not be
## plotted.

## Warning in wordcloud(wordFrameTrigram$word, wordFrameTrigram$count,
## min.freq = 1000, : let know think could not be fit on page. It will not be
## plotted.

## Warning in wordcloud(wordFrameTrigram$word, wordFrameTrigram$count,
## min.freq = 1000, : centers disease control could not be fit on page. It
## will not be plotted.

## Warning in wordcloud(wordFrameTrigram$word, wordFrameTrigram$count,
## min.freq = 1000, : st louis county could not be fit on page. It will not be
## plotted.

## Warning in wordcloud(wordFrameTrigram$word, wordFrameTrigram$count,
## min.freq = 1000, : major league baseball could not be fit on page. It will
## not be plotted.

## Warning in wordcloud(wordFrameTrigram$word, wordFrameTrigram$count,
## min.freq = 1000, : want make sure could not be fit on page. It will not be
## plotted.

## Warning in wordcloud(wordFrameTrigram$word, wordFrameTrigram$count,
## min.freq = 1000, : good morning everyone could not be fit on page. It will
## not be plotted.

## Warning in wordcloud(wordFrameTrigram$word, wordFrameTrigram$count,
## min.freq = 1000, : dont get wrong could not be fit on page. It will not be
## plotted.

## Warning in wordcloud(wordFrameTrigram$word, wordFrameTrigram$count,
## min.freq = 1000, : told associated press could not be fit on page. It will
## not be plotted.

## Warning in wordcloud(wordFrameTrigram$word, wordFrameTrigram$count,
## min.freq = 1000, : last two years could not be fit on page. It will not be
## plotted.

## Warning in wordcloud(wordFrameTrigram$word, wordFrameTrigram$count,
## min.freq = 1000, : happy valentines day could not be fit on page. It will
## not be plotted.

## Warning in wordcloud(wordFrameTrigram$word, wordFrameTrigram$count,
## min.freq = 1000, : show last night could not be fit on page. It will not be
## plotted.

## Warning in wordcloud(wordFrameTrigram$word, wordFrameTrigram$count,
## min.freq = 1000, : will get back could not be fit on page. It will not be
## plotted.

## Warning in wordcloud(wordFrameTrigram$word, wordFrameTrigram$count,
## min.freq = 1000, : us supreme court could not be fit on page. It will not
## be plotted.

## Warning in wordcloud(wordFrameTrigram$word, wordFrameTrigram$count,
## min.freq = 1000, : salt pepper taste could not be fit on page. It will not
## be plotted.

## Warning in wordcloud(wordFrameTrigram$word, wordFrameTrigram$count,
## min.freq = 1000, : uc san diego could not be fit on page. It will not be
## plotted.

## Warning in wordcloud(wordFrameTrigram$word, wordFrameTrigram$count,
## min.freq = 1000, : wall street journal could not be fit on page. It will
## not be plotted.

## Warning in wordcloud(wordFrameTrigram$word, wordFrameTrigram$count,
## min.freq = 1000, : season salt pepper could not be fit on page. It will not
## be plotted.

## Warning in wordcloud(wordFrameTrigram$word, wordFrameTrigram$count,
## min.freq = 1000, : right now just could not be fit on page. It will not be
## plotted.

## Warning in wordcloud(wordFrameTrigram$word, wordFrameTrigram$count,
## min.freq = 1000, : st patricks day could not be fit on page. It will not be
## plotted.

## Warning in wordcloud(wordFrameTrigram$word, wordFrameTrigram$count,
## min.freq = 1000, : hope everyone great could not be fit on page. It will
## not be plotted.

## Warning in wordcloud(wordFrameTrigram$word, wordFrameTrigram$count,
## min.freq = 1000, : thanks following back could not be fit on page. It will
## not be plotted.

## Warning in wordcloud(wordFrameTrigram$word, wordFrameTrigram$count,
## min.freq = 1000, : respond request comment could not be fit on page. It
## will not be plotted.

## Warning in wordcloud(wordFrameTrigram$word, wordFrameTrigram$count,
## min.freq = 1000, : seen anything like could not be fit on page. It will not
## be plotted.

## Warning in wordcloud(wordFrameTrigram$word, wordFrameTrigram$count,
## min.freq = 1000, : executive vice president could not be fit on page. It
## will not be plotted.

## Warning in wordcloud(wordFrameTrigram$word, wordFrameTrigram$count,
## min.freq = 1000, : stop say hi could not be fit on page. It will not be
## plotted.

## Warning in wordcloud(wordFrameTrigram$word, wordFrameTrigram$count,
## min.freq = 1000, : hundreds millions dollars could not be fit on page. It
## will not be plotted.

## Warning in wordcloud(wordFrameTrigram$word, wordFrameTrigram$count,
## min.freq = 1000, : said one thing could not be fit on page. It will not be
## plotted.

## Warning in wordcloud(wordFrameTrigram$word, wordFrameTrigram$count,
## min.freq = 1000, : g protein g could not be fit on page. It will not be
## plotted.

## Warning in wordcloud(wordFrameTrigram$word, wordFrameTrigram$count,
## min.freq = 1000, : seems like good could not be fit on page. It will not be
## plotted.

## Warning in wordcloud(wordFrameTrigram$word, wordFrameTrigram$count,
## min.freq = 1000, : world war ii could not be fit on page. It will not be
## plotted.

## Warning in wordcloud(wordFrameTrigram$word, wordFrameTrigram$count,
## min.freq = 1000, : just need get could not be fit on page. It will not be
## plotted.

## Warning in wordcloud(wordFrameTrigram$word, wordFrameTrigram$count,
## min.freq = 1000, : dream come true could not be fit on page. It will not be
## plotted.

## Warning in wordcloud(wordFrameTrigram$word, wordFrameTrigram$count,
## min.freq = 1000, : western reserve university could not be fit on page. It
## will not be plotted.

Next Steps For Prediction Algorithm Strategies And Shiny App

This concludes the exploratory analysis. On next steps of this capstone project would be to finalize predictive algorithm, and deploy algorithm as a Shiny app. I plan to use n-gram model algorithm with frequency lookup similar like above. Trigram model would be possible used to predict the next word. If no matching trigram can be found, then the algorithm would back off to the bigram model, and then to the unigram model if needed. The Shiny app that will develop soon consist of a text input that will allow a user to enter a phrase. The algorithm try to predict the possible next word after short delay. For advance, the app allow a user to configure suggestions number of words.