Goals for algortihm / app

Below the goals for a prediction algorithm and/or app are described with the goal to receive feedback so the development process can be continued.

So the goal is to develop an algorithm that predicts the next word, deployed in Shiny app. Based on data exploration we want to use bigram, trigram and quadrigrams to train the model on. We want to continue the development process with the following steps:

If you are interested in more details regarding the data exploration, feel free to check them out in the chapter below ‘Data exploration’.

Data exploration

Below the data is loaded, a basic report of summary statistics is shown and the first interesting findings are reported based on the three Coursera-SwiftKey en_US data sets regarding blogs, twitter and news.

## Warning: package 'tm' was built under R version 3.6.3
## Loading required package: NLP
## Loading required package: stringi
## 
## Attaching package: 'ggplot2'
## The following object is masked from 'package:NLP':
## 
##     annotate
## Warning: package 'rJava' was built under R version 3.6.3
## Warning: package 'RWeka' was built under R version 3.6.3
## Warning in readLines(con <- file("./en_US.news.txt"), encoding = "UTF-8", :
## incomplete final line found on './en_US.news.txt'

Summary statistics

blogs

## [1] "word counts:"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    9.00   28.00   41.75   60.00 6726.00
## [1] "line counts:"
##       Lines LinesNEmpty       Chars CharsNWhite 
##      899288      899288   206824382   170389539
## [1] "basic data table of summary statistics:"
##    Length     Class      Mode 
##    899288 character character
## [1] "histogram of word counts:"
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

twitter

## [1] "word counts:"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    7.00   12.00   12.75   18.00   47.00
## [1] "line counts:"
##       Lines LinesNEmpty       Chars CharsNWhite 
##     2360148     2360148   162096241   134082806
## [1] "basic data table of summary statistics:"
##    Length     Class      Mode 
##   2360148 character character
## [1] "histogram of word counts:"
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

news

## [1] "word counts:"
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00   19.00   32.00   34.62   46.00 1123.00
## [1] "line counts:"
##       Lines LinesNEmpty       Chars CharsNWhite 
##       77259       77259    15639408    13072698
## [1] "basic data table of summary statistics:"
##    Length     Class      Mode 
##     77259 character character
## [1] "histogram of word counts:"
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

First findings

## Warning in tm_map.SimpleCorpus(all_corpus, whitespace, "\"|/|@|\\|"):
## transformation drops documents
## Warning in tm_map.SimpleCorpus(all_corpus,
## content_transformer(stringi::stri_trans_tolower)): transformation drops
## documents
## Warning in tm_map.SimpleCorpus(all_corpus, removeNumbers): transformation drops
## documents
## Warning in tm_map.SimpleCorpus(all_corpus, stripWhitespace): transformation
## drops documents
## Warning in tm_map.SimpleCorpus(all_corpus, removeWords, stopwords("english")):
## transformation drops documents