Below the goals for a prediction algorithm and/or app are described with the goal to receive feedback so the development process can be continued.
So the goal is to develop an algorithm that predicts the next word, deployed in Shiny app. Based on data exploration we want to use bigram, trigram and quadrigrams to train the model on. We want to continue the development process with the following steps:
If you are interested in more details regarding the data exploration, feel free to check them out in the chapter below ‘Data exploration’.
Below the data is loaded, a basic report of summary statistics is shown and the first interesting findings are reported based on the three Coursera-SwiftKey en_US data sets regarding blogs, twitter and news.
## Warning: package 'tm' was built under R version 3.6.3
## Loading required package: NLP
## Loading required package: stringi
##
## Attaching package: 'ggplot2'
## The following object is masked from 'package:NLP':
##
## annotate
## Warning: package 'rJava' was built under R version 3.6.3
## Warning: package 'RWeka' was built under R version 3.6.3
## Warning in readLines(con <- file("./en_US.news.txt"), encoding = "UTF-8", :
## incomplete final line found on './en_US.news.txt'
## [1] "word counts:"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 9.00 28.00 41.75 60.00 6726.00
## [1] "line counts:"
## Lines LinesNEmpty Chars CharsNWhite
## 899288 899288 206824382 170389539
## [1] "basic data table of summary statistics:"
## Length Class Mode
## 899288 character character
## [1] "histogram of word counts:"
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## [1] "word counts:"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 7.00 12.00 12.75 18.00 47.00
## [1] "line counts:"
## Lines LinesNEmpty Chars CharsNWhite
## 2360148 2360148 162096241 134082806
## [1] "basic data table of summary statistics:"
## Length Class Mode
## 2360148 character character
## [1] "histogram of word counts:"
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## [1] "word counts:"
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 19.00 32.00 34.62 46.00 1123.00
## [1] "line counts:"
## Lines LinesNEmpty Chars CharsNWhite
## 77259 77259 15639408 13072698
## [1] "basic data table of summary statistics:"
## Length Class Mode
## 77259 character character
## [1] "histogram of word counts:"
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning in tm_map.SimpleCorpus(all_corpus, whitespace, "\"|/|@|\\|"):
## transformation drops documents
## Warning in tm_map.SimpleCorpus(all_corpus,
## content_transformer(stringi::stri_trans_tolower)): transformation drops
## documents
## Warning in tm_map.SimpleCorpus(all_corpus, removeNumbers): transformation drops
## documents
## Warning in tm_map.SimpleCorpus(all_corpus, stripWhitespace): transformation
## drops documents
## Warning in tm_map.SimpleCorpus(all_corpus, removeWords, stopwords("english")):
## transformation drops documents