This report presents exploratory data analysis for three English corpora (blogs, news, and Twitter), and outlines the strategy to build a next word prediction model and Shiny app.
blogs <- readLines("data/en_US.blogs.txt", encoding = "UTF-8", skipNul = TRUE)
news <- readLines("data/en_US.news.txt", encoding = "UTF-8", skipNul = TRUE)
twitter <- readLines("data/en_US.twitter.txt", encoding = "UTF-8", skipNul = TRUE)
data_summary <- data.frame(
Dataset = c("Blogs", "News", "Twitter"),
LineCount = c(length(blogs), length(news), length(twitter)),
WordCount = c(sum(stri_count_words(blogs)),
sum(stri_count_words(news)),
sum(stri_count_words(twitter)))
)
kable(data_summary)
| Dataset | LineCount | WordCount |
|---|---|---|
| Blogs | 899288 | 37546806 |
| News | 1010206 | 34761151 |
| 2360148 | 30096690 |
blogs_df <- data.frame(text = blogs)
blogs_tokens <- blogs_df %>%
unnest_tokens(word, text) %>%
anti_join(stop_words) %>%
count(word, sort = TRUE)
## Joining with `by = join_by(word)`
blogs_tokens %>%
top_n(20) %>%
ggplot(aes(x = reorder(word, n), y = n)) +
geom_col() +
coord_flip() +
labs(title = "Top 20 Words in Blogs Dataset", x = "Word", y = "Count")
## Selecting by n
We will use n-gram models (uni, bi, tri) and backoff strategies to predict the next word. The final app will be built using Shiny.
Exploratory analysis is complete and we are set to move toward building the predictive model and app.