Hello! To explore the given datasets that we will be using to make our predictions, let’s first look at the line counts from the Blogs, Twitter, and News dataset. I’ll also do some preliminary cleaning here, like removing non-English characters.
## Loading required package: NLP
Length of Twitter sample
length(sampleTwitter)
## [1] 23601
Length of Blogs sample
length(sampleBlogs)
## [1] 8992
Length of News sample
length(sampleNews)
## [1] 10102
Length of combined sample
## [1] 42695
That’s a big dataset! Good thing we are using a sample. Next, I will show you through graphs the words that appear most in each of the datasets.
##
## Attaching package: 'dplyr'
## The following object is masked from 'package:kableExtra':
##
## group_rows
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
##
## Attaching package: 'ggplot2'
## The following object is masked from 'package:NLP':
##
## annotate
Below is a graph of the most frequent words in all the datasets:
wordcounts %>%
filter(n > 2000) %>%
mutate(word = reorder(word, n)) %>%
ggplot(aes(word, n)) +
geom_col() +
coord_flip() +
labs(x = "Word \n", y = "\n Count", title = "Frequent Words in Sample Dataset \n") +
geom_text(aes(label = n), hjust = 1.2, color = "green", fontface = "bold") +
theme(plot.title = element_text(hjust = 0.5),
axis.title.x = element_text(face="bold", size = 12),
axis.title.y = element_text(face="bold", size = 12))
The final model will use a constructed dictionary of unigrams, bigrams, and trigrams to make a prediciton of what words will come next based on user input. Below is an example of the most common bigrams in the bigram dictionary!
## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ tibble 3.1.8 ✔ purrr 0.3.4
## ✔ tidyr 1.2.1 ✔ stringr 1.4.1
## ✔ readr 2.1.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ ggplot2::annotate() masks NLP::annotate()
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::group_rows() masks kableExtra::group_rows()
## ✖ dplyr::lag() masks stats::lag()