Setup

library(tidyverse)
## Warning: package 'tidyverse' was built under R version 3.6.3
## -- Attaching packages -------------------------------------------------------------------- tidyverse 1.3.0 --
## v ggplot2 3.3.0     v purrr   0.3.3
## v tibble  2.1.3     v dplyr   0.8.3
## v tidyr   1.0.2     v stringr 1.4.0
## v readr   1.3.1     v forcats 0.4.0
## Warning: package 'tidyr' was built under R version 3.6.3
## Warning: package 'purrr' was built under R version 3.6.3
## -- Conflicts ----------------------------------------------------------------------- tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()
library(ngram)

Read in datasets

Examine the amount of data in the dataset

entries <- data.frame(Source = c("Blogs", "News", "Tweets"),
                      Entries = c(length(blogs), length(news), length(tweets)))
entries %>% 
  ggplot() +
  geom_col(aes(Source, Entries))

As one can see, we have more tweets than news or blog sources. Next, letโ€™s look at word counts for the data sources.

wordcounts <- map(list(blogs, news, tweets), function(source){source %>% sapply(wordcount) %>% mean()})
wc_frame <- data.frame(Source = c("Blogs", "News", "Tweets"),
                       Wordcount = c(wordcounts[[1]], wordcounts[[2]], wordcounts[[3]]))
wc_frame %>% 
  ggplot() +
  geom_col(aes(Source, Wordcount))

As one might expect, blogs offer the most lengthy exposition, followed by news, with character-limited tweets far behind.

The goals of this project entail creating a predictive algorithm for text. In future reports, we will analyze the frequency of words next to each other, and plan to build an algorithm and app to demonstrate characteristics such as this. We will use a framework that enables the algorithm to pick up on this behavior and chain words based on frequency learned from previous samples.