Text Prediciton Milestone

Hello! To explore the given datasets that we will be using to make our predictions, let’s first look at the line counts from the Blogs, Twitter, and News dataset. I’ll also do some preliminary cleaning here, like removing non-English characters.

## Loading required package: NLP

Length of Twitter sample

length(sampleTwitter)
## [1] 23601

Length of Blogs sample

length(sampleBlogs)
## [1] 8992

Length of News sample

length(sampleNews)
## [1] 10102

Length of combined sample

## [1] 42695

Data Processing

That’s a big dataset! Good thing we are using a sample. Next, I will show you through graphs the words that appear most in each of the datasets.

## 
## Attaching package: 'dplyr'
## The following object is masked from 'package:kableExtra':
## 
##     group_rows
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
## 
## Attaching package: 'ggplot2'
## The following object is masked from 'package:NLP':
## 
##     annotate

Below is a graph of the most frequent words in all the datasets:

wordcounts %>%
  filter(n > 2000) %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(word, n)) +
  geom_col() +
  coord_flip() +
  labs(x = "Word \n", y = "\n Count", title = "Frequent Words in Sample Dataset \n") +
  geom_text(aes(label = n), hjust = 1.2, color = "green", fontface = "bold") +
  theme(plot.title = element_text(hjust = 0.5),
        axis.title.x = element_text(face="bold", size = 12),
        axis.title.y = element_text(face="bold", size = 12))

Plans for the Prediction

The final model will use a constructed dictionary of unigrams, bigrams, and trigrams to make a prediciton of what words will come next based on user input. Below is an example of the most common bigrams in the bigram dictionary!

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.2 ──
## ✔ tibble  3.1.8     ✔ purrr   0.3.4
## ✔ tidyr   1.2.1     ✔ stringr 1.4.1
## ✔ readr   2.1.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ ggplot2::annotate() masks NLP::annotate()
## ✖ dplyr::filter()     masks stats::filter()
## ✖ dplyr::group_rows() masks kableExtra::group_rows()
## ✖ dplyr::lag()        masks stats::lag()