Data exploration

Below the data is loaded, a basic report of summary statistics is shown and the first interesting findings are reported based on the three Coursera-SwiftKey en_US data sets regarding blogs, twitter and news.

## Warning: package 'tm' was built under R version 3.6.3

## Loading required package: NLP

## Loading required package: stringi

## 
## Attaching package: 'ggplot2'

## The following object is masked from 'package:NLP':
## 
##     annotate

## Warning: package 'rJava' was built under R version 3.6.3

## Warning: package 'RWeka' was built under R version 3.6.3

## Warning in readLines(con <- file("./en_US.news.txt"), encoding = "UTF-8", :
## incomplete final line found on './en_US.news.txt'

Summary statistics

blogs

## [1] "word counts:"

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.00    9.00   28.00   41.75   60.00 6726.00

## [1] "line counts:"

##       Lines LinesNEmpty       Chars CharsNWhite 
##      899288      899288   206824382   170389539

## [1] "basic data table of summary statistics:"

##    Length     Class      Mode 
##    899288 character character

## [1] "histogram of word counts:"

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

twitter

## [1] "word counts:"

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    7.00   12.00   12.75   18.00   47.00

## [1] "line counts:"

##       Lines LinesNEmpty       Chars CharsNWhite 
##     2360148     2360148   162096241   134082806

## [1] "basic data table of summary statistics:"

##    Length     Class      Mode 
##   2360148 character character

## [1] "histogram of word counts:"

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

news

## [1] "word counts:"

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00   19.00   32.00   34.62   46.00 1123.00

## [1] "line counts:"

##       Lines LinesNEmpty       Chars CharsNWhite 
##       77259       77259    15639408    13072698

## [1] "basic data table of summary statistics:"

##    Length     Class      Mode 
##     77259 character character

## [1] "histogram of word counts:"

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

First findings

## Warning in tm_map.SimpleCorpus(all_corpus, whitespace, "\"|/|@|\\|"):
## transformation drops documents

## Warning in tm_map.SimpleCorpus(all_corpus,
## content_transformer(stringi::stri_trans_tolower)): transformation drops
## documents

## Warning in tm_map.SimpleCorpus(all_corpus, removeNumbers): transformation drops
## documents

## Warning in tm_map.SimpleCorpus(all_corpus, stripWhitespace): transformation
## drops documents

## Warning in tm_map.SimpleCorpus(all_corpus, removeWords, stopwords("english")):
## transformation drops documents

Capstone_week2

Isadora