## -- Attaching packages ----------------------------------------------------------- tidyverse 1.3.0 --

## v ggplot2 3.3.0     v purrr   0.3.4
## v tibble  3.0.1     v dplyr   0.8.5
## v tidyr   1.0.2     v stringr 1.4.0
## v readr   1.3.1     v forcats 0.5.0

## -- Conflicts -------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

## Warning in readLines("en_US/en_US.news.txt", skipNul = TRUE): incomplete final
## line found on 'en_US/en_US.news.txt'

File Summary

f_names	f_lines	n_char	n_words	pct_chars	pct_lines	pct_words
blogs	899288	208361438	37334131	0.54	0.27	0.53
news	77259	15683765	2643969	0.04	0.02	0.04
twitter	2360148	162385035	30373583	0.42	0.71	0.43

## Warning: `data_frame()` is deprecated as of tibble 1.1.0.
## Please use `tibble()` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_warnings()` to see where this warning was generated.

## Joining, by = "word"

Uni-gram Distributions by Source

Bi-gram Distribution

Prediction Model

I will be using the table created for bi-grams as the basis for prediction. The user will input a word and the model will find the bi-gram with the greatest relative frequency given that word. The second word in this bi-gram will be the prediction of the model for the next word, given the user’s input word.

Capstone Project

Jeff C

5/31/2020

File Summary

Uni-gram Distributions by Source

Bi-gram Distribution

Prediction Model