The goal of this project is to create a helpful tool, which provides word suggestions based on the last word used. This way it helps the user to write texts by not having to write out words because he can use the suggested words instead. The suggested words are based on all kinds of publicly available text data and are mostly determined by the last written word. Here is the exploratory analysis I did, to wrap my head around the data.
## [1] "Total words in the US blogs data set: 37546806"
## [1] "Total words in the US twitter data set: 30096649"
## [1] "Total words in the US news data set: 2674561"
## [1] "Total lines in the US blogs data set: 899288"
## [1] "Total lines in the US twitter data set: 2360148"
## [1] "Total lines in the US news data set: 77259"
## [1] "Average words per entry in the US blogs data set: 41.7518428529332"
## [1] "Average words per entry in the US twitter data set: 12.7520176700783"
## [1] "Average words per entry in the US news data set: 34.6181156887871"
## Most frequently used words in the US blogs data set:
## # A tibble: 15 × 3
## word n freq
## <chr> <int> <dbl>
## 1 the 1860184 0.0495
## 2 and 1094404 0.0291
## 3 to 1069442 0.0285
## 4 a 900374 0.0240
## 5 of 876799 0.0234
## 6 i 775057 0.0206
## 7 in 598541 0.0159
## 8 that 460783 0.0123
## 9 is 432715 0.0115
## 10 it 403905 0.0108
## 11 for 363840 0.00969
## 12 you 298709 0.00796
## 13 with 286734 0.00764
## 14 was 278347 0.00741
## 15 on 276514 0.00736
## Most frequently used words in the US twitter data set:
## # A tibble: 15 × 3
## word n freq
## <chr> <int> <dbl>
## 1 the 937467 0.0311
## 2 to 788663 0.0262
## 3 i 723548 0.0240
## 4 a 611407 0.0203
## 5 you 548164 0.0182
## 6 and 438541 0.0146
## 7 for 385357 0.0128
## 8 in 380383 0.0126
## 9 of 359636 0.0119
## 10 is 358787 0.0119
## 11 it 295125 0.00981
## 12 my 291924 0.00970
## 13 on 278038 0.00924
## 14 that 234679 0.00780
## 15 me 202713 0.00674
## Most frequently used words in the US news data set:
## # A tibble: 15 × 3
## word n freq
## <chr> <int> <dbl>
## 1 the 151717 0.0567
## 2 to 69757 0.0261
## 3 and 68604 0.0257
## 4 a 67346 0.0252
## 5 of 59315 0.0222
## 6 in 51894 0.0194
## 7 for 27166 0.0102
## 8 that 26384 0.00986
## 9 is 21969 0.00821
## 10 on 20814 0.00778
## 11 with 19758 0.00739
## 12 said 19176 0.00717
## 13 was 17627 0.00659
## 14 he 17587 0.00658
## 15 it 16768 0.00627