The goal of this project is to show how to work with the data and predict the algorithm correctly. The report will be submitted on R Pubs (http://rpubs.com/) that explain exploratory analysis and goals for the eventual apps and algorithm. The documentation shall be concise and explain the major features of the data you have identified and briefly summarize your plans for creating the prediction algorithm and Shiny app that can understandable to a non-data scientist manager. The table will be used including plots in order to illustrate the important summaries of the data set. The motivation for this project is to:
The packages used for analysis are: NLP tm RWeka stringi stringr ggplot2 knitr dplyr *wordcloud
library(NLP)
library(tm)
library(RWeka)
library(stringi)
library(stringr)
library(ggplot2)
##
## Attaching package: 'ggplot2'
## The following object is masked from 'package:NLP':
##
## annotate
library(knitr)
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(RColorBrewer)
library(wordcloud)
The dataset project had been downloaded from the Coursera website (Project Dataset). The sample data containing multiple languages like DE,US,FI and RU. In this project English language subset been used in which consisting of blogs, news, and tweets initially derived from a HC Corpus model.Three English files were processed from the final/en_US directory. The original ZIP file up to 560 MB used in order to develop the predictive algorithm. Below are the summary information of the data including file size, number lines, number words and means of number words.
doc1 <- file("Coursera-SwiftKey/final/en_US/en_US.blogs.txt", "rb")
blogs <- readLines(doc1, encoding="UTF-8")
close(doc1)
doc2 <- file("Coursera-SwiftKey/final/en_US/en_US.news.txt", "rb")
news <- readLines(doc2, encoding="UTF-8")
close(doc2)
doc3 <- file("Coursera-SwiftKey/final/en_US/en_US.twitter.txt", "rb")
twitter <- readLines(doc3, encoding="UTF-8")
## Warning in readLines(doc3, encoding = "UTF-8"): line 167155 appears to
## contain an embedded nul
## Warning in readLines(doc3, encoding = "UTF-8"): line 268547 appears to
## contain an embedded nul
## Warning in readLines(doc3, encoding = "UTF-8"): line 1274086 appears to
## contain an embedded nul
## Warning in readLines(doc3, encoding = "UTF-8"): line 1759032 appears to
## contain an embedded nul
close(doc3)
words_blogs <- stri_count_words(blogs)
words_news <- stri_count_words(news)
words_twitter <- stri_count_words(twitter)
size_blogs <- file.info("Coursera-SwiftKey/final/en_US/en_US.blogs.txt")$size/1024^2
size_news <- file.info("Coursera-SwiftKey/final/en_US/en_US.news.txt")$size/1024^2
size_twitter <- file.info("Coursera-SwiftKey/final/en_US/en_US.twitter.txt")$size/1024^2
DataSummary <- data.frame(filename = c("blogs","news","twitter"),
file_size_MB = c(size_blogs, size_news, size_twitter),
num_lines = c(length(blogs),length(news),length(twitter)),
num_words = c(sum(words_blogs),sum(words_news),sum(words_twitter)),
mean_num_words = c(mean(words_blogs),mean(words_news),mean(words_twitter)))
kable(DataSummary)
filename | file_size_MB | num_lines | num_words | mean_num_words |
---|---|---|---|---|
blogs | 200.4242 | 899288 | 37546246 | 41.75108 |
news | 196.2775 | 1010242 | 34762395 | 34.40997 |
159.3641 | 2360148 | 30093369 | 12.75063 |
We will randomly choose 1% of each data set to demonstrate data preprocessing and exploratory data analysis. The full dataset will be used later in creating the prediction algorithm.
set.seed(1)
blogsSample <- sample(blogs, length(blogs)*0.01)
newsSample <- sample(news, length(news)*0.01)
twitterSample <- sample(twitter, length(twitter)*0.01)
twitterSample <- sapply(twitterSample,
function(row) iconv(row, "latin1", "ASCII", sub=""))
Combine the three samples. The number of lines and total number of words are as follows:
text_sample <- c(blogsSample,newsSample,twitterSample)
length(text_sample)
## [1] 42695
sum(stri_count_words(text_sample))
## [1] 1020011
The basic procedure for data preprocessing consists of the following key steps: 1. Construct a corpus from the document file. 2. Clean up the corpus by removing special characters, punctuation, numbers etc. Also remove profanity that we do not want to predict. 3. Build basic n-gram model.
In order to prepare corpus the following function will be construct.
toSpace <- content_transformer(function(x, pattern) gsub(pattern, " ", x))
preprocessCorpus <- function(corpus){
# Helper function to preprocess corpus
corpus <- tm_map(corpus, toSpace, "(f|ht)tp(s?)://(.*)[.][a-z]+")
corpus <- tm_map(corpus, toSpace, "@[^\\s]+")
corpus <- tm_map(corpus, tolower)
corpus <- tm_map(corpus, removeWords, stopwords("en"))
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, PlainTextDocument)
return(corpus)
}
text_sample <- VCorpus(VectorSource(text_sample))
text_sample <- preprocessCorpus(text_sample)
BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min=2, max=2))
TrigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min=3, max=3))
QuadgramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min=4, max=4))
tdm1a <- TermDocumentMatrix(text_sample)
tdm1 <- removeSparseTerms(tdm1a, 0.99)
tdm2a <- TermDocumentMatrix(text_sample, control=list(tokenize=BigramTokenizer))
tdm2 <- removeSparseTerms(tdm2a, 0.999)
tdm3a <- TermDocumentMatrix(text_sample, control=list(tokenize=TrigramTokenizer))
tdm3 <- removeSparseTerms(tdm3a, 0.9999)
tdm4a <- TermDocumentMatrix(text_sample, control=list(tokenize=QuadgramTokenizer))
tdm4 <- removeSparseTerms(tdm4a, 0.9999)
Helper function to tabulate frequency
freq_frame <- function(tdm){
freq <- sort(rowSums(as.matrix(tdm)), decreasing=TRUE)
freq_frame <- data.frame(word=names(freq), freq=freq)
return(freq_frame)
}
freq1_frame <- freq_frame(tdm1)
freq1_frame
## word freq
## will will 3346
## just just 3023
## one one 2982
## said said 2962
## like like 2605
## can can 2512
## get get 2205
## time time 2196
## new new 1831
## good good 1807
## now now 1717
## know know 1611
## day day 1597
## people people 1562
## love love 1555
## first first 1445
## see see 1433
## back back 1383
## make make 1366
## going going 1303
## think think 1271
## great great 1225
## two two 1218
## much much 1191
## also also 1181
## last last 1174
## really really 1166
## year year 1137
## even even 1105
## way way 1096
## well well 1095
## work work 1067
## got got 1054
## today today 1032
## want want 1015
## right right 1012
## thanks thanks 999
## need need 992
## years years 976
## still still 959
## many many 906
## say say 892
## life life 857
## take take 854
## made made 852
## come come 840
## never never 840
## home home 839
## little little 834
## best best 790
## may may 785
## night night 745
## school school 745
## game game 737
## week week 734
## next next 726
## things things 714
## lol lol 713
## always always 704
## happy happy 695
## something something 691
## better better 685
## state state 680
## around around 678
## look look 669
## every every 659
## another another 622
## since since 613
## show show 612
## long long 605
## world world 604
## big big 588
## three three 585
## find find 570
## city city 567
## hope hope 567
## man man 560
## follow follow 555
## sure sure 551
## thing thing 551
## tonight tonight 544
## getting getting 538
## keep keep 538
## days days 532
## help help 530
## team team 525
## says says 522
## feel feel 520
## use use 512
## house house 510
## lot lot 510
## give give 504
## family family 491
## looking looking 491
## ever ever 488
## thank thank 486
## play play 483
## high high 477
## everyone everyone 476
## part part 476
## place place 476
## though though 476
## done done 465
## let let 464
## might might 464
## away away 461
## morning morning 461
## without without 459
## end end 456
## put put 456
## thought thought 456
freq2_frame <- freq_frame(tdm2)
freq2_frame
## word freq
## right now right now 275
## new york new york 194
## last year last year 178
## high school high school 170
## first time first time 149
## last night last night 140
## years ago years ago 140
## feel like feel like 122
## looking forward looking forward 121
## last week last week 118
## make sure make sure 112
## can get can get 110
## happy birthday happy birthday 101
## st louis st louis 97
## good morning good morning 94
## even though even though 90
## just got just got 88
## looks like looks like 88
## two years two years 87
## can see can see 85
## united states united states 84
## one day one day 83
## next week next week 80
## let know let know 76
## new jersey new jersey 74
## every day every day 72
## look like look like 72
## los angeles los angeles 72
## said <U+0093> said <U+0093> 72
## next year next year 71
## social media social media 70
## good luck good luck 69
## last month last month 66
## thanks follow thanks follow 62
## just want just want 60
## san francisco san francisco 60
## mothers day mothers day 59
## long time long time 57
## will never will never 57
## get back get back 56
## one thing one thing 56
## sounds like sounds like 56
## san diego san diego 55
## will take will take 55
## will get will get 54
## follow back follow back 53
## just like just like 53
## can make can make 49
## come back come back 49
## little bit little bit 48
## many people many people 48
## two weeks two weeks 48
## going get going get 47
## can help can help 46
## five years five years 46
## go back go back 46
## pretty much pretty much 46
## wait see wait see 46
## last years last years 45
## one best one best 45
## thanks following thanks following 45
## will make will make 45
## don<U+0092>t know don<U+0092>t know 44
## last time last time 44
## seems like seems like 44
## much better much better 43
## will go will go 43
freq3_frame <- freq_frame(tdm3)
freq3_frame
## word freq
## happy mothers day happy mothers day 35
## let us know let us know 24
## new york city new york city 23
## two years ago two years ago 23
## happy new year happy new year 21
## first time since first time since 15
## cinco de mayo cinco de mayo 12
## four years ago four years ago 12
## president barack obama president barack obama 12
## st patricks day st patricks day 12
## will take place will take place 12
## looking forward seeing looking forward seeing 11
## new york times new york times 11
## gov chris christie gov chris christie 10
## ha ha ha ha ha ha 10
## happy valentines day happy valentines day 10
## just got done just got done 10
## really looking forward really looking forward 10
## world war ii world war ii 10
## couple weeks ago couple weeks ago 9
## make sure get make sure get 9
## dream come true dream come true 8
## just let know just let know 8
## last two years last two years 8
## let know think let know think 8
## preheat oven degrees preheat oven degrees 8
## st louis county st louis county 8
## thanks following us thanks following us 8
## coach ken hitchcock coach ken hitchcock 7
## come see us come see us 7
## couple years ago couple years ago 7
## don<U+0092>t get wrong don<U+0092>t get wrong 7
## first two games first two games 7
## good morning everyone good morning everyone 7
## just make sure just make sure 7
## life right now life right now 7
## new years eve new years eve 7
## osama bin laden osama bin laden 7
## past two years past two years 7
## rock n roll rock n roll 7
## two half men two half men 7
## business network international business network international 6
## case western reserve case western reserve 6
## chief executive officer chief executive officer 6
## come join us come join us 6
## feel better soon feel better soon 6
## five years ago five years ago 6
## help us get help us get 6
## hope everyone great hope everyone great 6
## hundreds millions dollars hundreds millions dollars 6
## just got back just got back 6
## just got home just got home 6
## just need get just need get 6
## let know can let know can 6
## love love love love love love 6
## major league baseball major league baseball 6
## make feel better make feel better 6
## nearly two years nearly two years 6
## now just need now just need 6
## please let know please let know 6
## right now just right now just 6
## season salt pepper season salt pepper 6
## show last night show last night 6
## two weeks ago two weeks ago 6
## western reserve university western reserve university 6
## will never forget will never forget 6
## <U+0093> don<U+0092>t know <U+0093> don<U+0092>t know 5
## blues coach ken blues coach ken 5
## can please follow can please follow 5
## centers disease control centers disease control 5
## chicago chicago illinois chicago chicago illinois 5
## every day week every day week 5
## executive vice president executive vice president 5
## follow follow back follow follow back 5
## g protein g g protein g 5
## high blood pressure high blood pressure 5
## high school senior high school senior 5
## hope feel better hope feel better 5
## hope great day hope great day 5
## just little bit just little bit 5
## make dream come make dream come 5
## martin luther king martin luther king 5
## memorial day weekend memorial day weekend 5
## next couple weeks next couple weeks 5
## next two years next two years 5
## past two decades past two decades 5
## president barack obamas president barack obamas 5
## respond request comment respond request comment 5
## rock roll hall rock roll hall 5
## said one thing said one thing 5
## salt pepper taste salt pepper taste 5
## seems like good seems like good 5
## seen anything like seen anything like 5
## social networking site social networking site 5
## standard poors index standard poors index 5
## stop say hi stop say hi 5
## thank everyone came thank everyone came 5
## thanks following back thanks following back 5
## thanks new followers thanks new followers 5
## told associated press told associated press 5
## two days later two days later 5
## u know u u know u 5
## uc san diego uc san diego 5
## university chicago chicago university chicago chicago 5
## us district judge us district judge 5
## us supreme court us supreme court 5
## wall street journal wall street journal 5
## want make sure want make sure 5
## will get back will get back 5
## will let know will let know 5
## yearold resident block yearold resident block 5
freq4_frame <- freq_frame(tdm4)
freq4_frame
## word
## case western reserve university case western reserve university
## blues coach ken hitchcock blues coach ken hitchcock
## make dream come true make dream come true
## university chicago chicago illinois university chicago chicago illinois
## freq
## case western reserve university 6
## blues coach ken hitchcock 5
## make dream come true 5
## university chicago chicago illinois 5
Histograms of words that appear aleast 100 times using bigrams
freqBigram <- rowSums(as.matrix(tdm2))
wordFrameBigram <- data.frame(word=names(freqBigram),count=freqBigram,stringsAsFactors=FALSE)
bigramPlot <- ggplot(subset(wordFrameBigram, count > 100), aes(word,count))
bigramPlot <- bigramPlot + geom_bar(stat="identity")
bigramPlot <- bigramPlot + theme(axis.text.x=element_text(angle=45, hjust=1))
bigramPlot
Word Cloud of words that appear aleast 1000 times using trigrams
freqTrigram <- rowSums(as.matrix(tdm3))
wordFrameTrigram <- data.frame(word=names(freqTrigram),count=freqTrigram,stringsAsFactors=FALSE)
wordcloud(wordFrameTrigram$word, wordFrameTrigram$count, min.freq=1000, colors=brewer.pal(8,"Dark2"))
## Warning in wordcloud(wordFrameTrigram$word, wordFrameTrigram$count,
## min.freq = 1000, : happy mothers day could not be fit on page. It will not
## be plotted.
## Warning in wordcloud(wordFrameTrigram$word, wordFrameTrigram$count,
## min.freq = 1000, : thanks following us could not be fit on page. It will
## not be plotted.
## Warning in wordcloud(wordFrameTrigram$word, wordFrameTrigram$count,
## min.freq = 1000, : university chicago chicago could not be fit on page. It
## will not be plotted.
## Warning in wordcloud(wordFrameTrigram$word, wordFrameTrigram$count,
## min.freq = 1000, : president barack obama could not be fit on page. It will
## not be plotted.
## Warning in wordcloud(wordFrameTrigram$word, wordFrameTrigram$count,
## min.freq = 1000, : will take place could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(wordFrameTrigram$word, wordFrameTrigram$count,
## min.freq = 1000, : new york city could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(wordFrameTrigram$word, wordFrameTrigram$count,
## min.freq = 1000, : yearold resident block could not be fit on page. It will
## not be plotted.
## Warning in wordcloud(wordFrameTrigram$word, wordFrameTrigram$count,
## min.freq = 1000, : case western reserve could not be fit on page. It will
## not be plotted.
## Warning in wordcloud(wordFrameTrigram$word, wordFrameTrigram$count,
## min.freq = 1000, : business network international could not be fit on page.
## It will not be plotted.
## Warning in wordcloud(wordFrameTrigram$word, wordFrameTrigram$count,
## min.freq = 1000, : social networking site could not be fit on page. It will
## not be plotted.
## Warning in wordcloud(wordFrameTrigram$word, wordFrameTrigram$count,
## min.freq = 1000, : cinco de mayo could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(wordFrameTrigram$word, wordFrameTrigram$count,
## min.freq = 1000, : please let know could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(wordFrameTrigram$word, wordFrameTrigram$count,
## min.freq = 1000, : let us know could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(wordFrameTrigram$word, wordFrameTrigram$count,
## min.freq = 1000, : preheat oven degrees could not be fit on page. It will
## not be plotted.
## Warning in wordcloud(wordFrameTrigram$word, wordFrameTrigram$count,
## min.freq = 1000, : really looking forward could not be fit on page. It will
## not be plotted.
## Warning in wordcloud(wordFrameTrigram$word, wordFrameTrigram$count,
## min.freq = 1000, : two weeks ago could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(wordFrameTrigram$word, wordFrameTrigram$count,
## min.freq = 1000, : chicago chicago illinois could not be fit on page. It
## will not be plotted.
## Warning in wordcloud(wordFrameTrigram$word, wordFrameTrigram$count,
## min.freq = 1000, : couple years ago could not be fit on page. It will not
## be plotted.
## Warning in wordcloud(wordFrameTrigram$word, wordFrameTrigram$count,
## min.freq = 1000, : chief executive officer could not be fit on page. It
## will not be plotted.
## Warning in wordcloud(wordFrameTrigram$word, wordFrameTrigram$count,
## min.freq = 1000, : just got done could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(wordFrameTrigram$word, wordFrameTrigram$count,
## min.freq = 1000, : thanks new followers could not be fit on page. It will
## not be plotted.
## Warning in wordcloud(wordFrameTrigram$word, wordFrameTrigram$count,
## min.freq = 1000, : past two years could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(wordFrameTrigram$word, wordFrameTrigram$count,
## min.freq = 1000, : let know think could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(wordFrameTrigram$word, wordFrameTrigram$count,
## min.freq = 1000, : centers disease control could not be fit on page. It
## will not be plotted.
## Warning in wordcloud(wordFrameTrigram$word, wordFrameTrigram$count,
## min.freq = 1000, : st louis county could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(wordFrameTrigram$word, wordFrameTrigram$count,
## min.freq = 1000, : major league baseball could not be fit on page. It will
## not be plotted.
## Warning in wordcloud(wordFrameTrigram$word, wordFrameTrigram$count,
## min.freq = 1000, : want make sure could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(wordFrameTrigram$word, wordFrameTrigram$count,
## min.freq = 1000, : good morning everyone could not be fit on page. It will
## not be plotted.
## Warning in wordcloud(wordFrameTrigram$word, wordFrameTrigram$count,
## min.freq = 1000, : donÂ’t get wrong could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(wordFrameTrigram$word, wordFrameTrigram$count,
## min.freq = 1000, : told associated press could not be fit on page. It will
## not be plotted.
## Warning in wordcloud(wordFrameTrigram$word, wordFrameTrigram$count,
## min.freq = 1000, : last two years could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(wordFrameTrigram$word, wordFrameTrigram$count,
## min.freq = 1000, : happy valentines day could not be fit on page. It will
## not be plotted.
## Warning in wordcloud(wordFrameTrigram$word, wordFrameTrigram$count,
## min.freq = 1000, : show last night could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(wordFrameTrigram$word, wordFrameTrigram$count,
## min.freq = 1000, : will get back could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(wordFrameTrigram$word, wordFrameTrigram$count,
## min.freq = 1000, : us supreme court could not be fit on page. It will not
## be plotted.
## Warning in wordcloud(wordFrameTrigram$word, wordFrameTrigram$count,
## min.freq = 1000, : salt pepper taste could not be fit on page. It will not
## be plotted.
## Warning in wordcloud(wordFrameTrigram$word, wordFrameTrigram$count,
## min.freq = 1000, : uc san diego could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(wordFrameTrigram$word, wordFrameTrigram$count,
## min.freq = 1000, : wall street journal could not be fit on page. It will
## not be plotted.
## Warning in wordcloud(wordFrameTrigram$word, wordFrameTrigram$count,
## min.freq = 1000, : season salt pepper could not be fit on page. It will not
## be plotted.
## Warning in wordcloud(wordFrameTrigram$word, wordFrameTrigram$count,
## min.freq = 1000, : right now just could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(wordFrameTrigram$word, wordFrameTrigram$count,
## min.freq = 1000, : st patricks day could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(wordFrameTrigram$word, wordFrameTrigram$count,
## min.freq = 1000, : hope everyone great could not be fit on page. It will
## not be plotted.
## Warning in wordcloud(wordFrameTrigram$word, wordFrameTrigram$count,
## min.freq = 1000, : thanks following back could not be fit on page. It will
## not be plotted.
## Warning in wordcloud(wordFrameTrigram$word, wordFrameTrigram$count,
## min.freq = 1000, : respond request comment could not be fit on page. It
## will not be plotted.
## Warning in wordcloud(wordFrameTrigram$word, wordFrameTrigram$count,
## min.freq = 1000, : seen anything like could not be fit on page. It will not
## be plotted.
## Warning in wordcloud(wordFrameTrigram$word, wordFrameTrigram$count,
## min.freq = 1000, : executive vice president could not be fit on page. It
## will not be plotted.
## Warning in wordcloud(wordFrameTrigram$word, wordFrameTrigram$count,
## min.freq = 1000, : stop say hi could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(wordFrameTrigram$word, wordFrameTrigram$count,
## min.freq = 1000, : hundreds millions dollars could not be fit on page. It
## will not be plotted.
## Warning in wordcloud(wordFrameTrigram$word, wordFrameTrigram$count,
## min.freq = 1000, : said one thing could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(wordFrameTrigram$word, wordFrameTrigram$count,
## min.freq = 1000, : g protein g could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(wordFrameTrigram$word, wordFrameTrigram$count,
## min.freq = 1000, : seems like good could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(wordFrameTrigram$word, wordFrameTrigram$count,
## min.freq = 1000, : world war ii could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(wordFrameTrigram$word, wordFrameTrigram$count,
## min.freq = 1000, : just need get could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(wordFrameTrigram$word, wordFrameTrigram$count,
## min.freq = 1000, : dream come true could not be fit on page. It will not be
## plotted.
## Warning in wordcloud(wordFrameTrigram$word, wordFrameTrigram$count,
## min.freq = 1000, : western reserve university could not be fit on page. It
## will not be plotted.
This concludes the exploratory analysis. On next steps of this capstone project would be to finalize predictive algorithm, and deploy algorithm as a Shiny app. I plan to use n-gram model algorithm with frequency lookup similar like above. Trigram model would be possible used to predict the next word. If no matching trigram can be found, then the algorithm would back off to the bigram model, and then to the unigram model if needed. The Shiny app that will develop soon consist of a text input that will allow a user to enter a phrase. The algorithm try to predict the possible next word after short delay. For advance, the app allow a user to configure suggestions number of words.