## -- Attaching packages ----------------------------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.0     v purrr   0.3.4
## v tibble  3.0.1     v dplyr   0.8.5
## v tidyr   1.0.2     v stringr 1.4.0
## v readr   1.3.1     v forcats 0.5.0
## -- Conflicts -------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
## Warning in readLines("en_US/en_US.news.txt", skipNul = TRUE): incomplete final
## line found on 'en_US/en_US.news.txt'

File Summary

f_names f_lines n_char n_words pct_chars pct_lines pct_words
blogs 899288 208361438 37334131 0.54 0.27 0.53
news 77259 15683765 2643969 0.04 0.02 0.04
twitter 2360148 162385035 30373583 0.42 0.71 0.43
## Warning: `data_frame()` is deprecated as of tibble 1.1.0.
## Please use `tibble()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.
## Joining, by = "word"

Uni-gram Distributions by Source

Bi-gram Distribution

Prediction Model

I will be using the table created for bi-grams as the basis for prediction. The user will input a word and the model will find the bi-gram with the greatest relative frequency given that word. The second word in this bi-gram will be the prediction of the model for the next word, given the userโ€™s input word.