Introduction

This report presents exploratory data analysis for three English corpora (blogs, news, and Twitter), and outlines the strategy to build a next word prediction model and Shiny app.

Data Summary

blogs <- readLines("data/en_US.blogs.txt", encoding = "UTF-8", skipNul = TRUE)
news <- readLines("data/en_US.news.txt", encoding = "UTF-8", skipNul = TRUE)
twitter <- readLines("data/en_US.twitter.txt", encoding = "UTF-8", skipNul = TRUE)

data_summary <- data.frame(
  Dataset = c("Blogs", "News", "Twitter"),
  LineCount = c(length(blogs), length(news), length(twitter)),
  WordCount = c(sum(stri_count_words(blogs)), 
                sum(stri_count_words(news)), 
                sum(stri_count_words(twitter)))
)

kable(data_summary)
Dataset LineCount WordCount
Blogs 899288 37546806
News 1010206 34761151
Twitter 2360148 30096690

Basic Plots

blogs_df <- data.frame(text = blogs)
blogs_tokens <- blogs_df %>%
  unnest_tokens(word, text) %>%
  anti_join(stop_words) %>%
  count(word, sort = TRUE)
## Joining with `by = join_by(word)`
blogs_tokens %>%
  top_n(20) %>%
  ggplot(aes(x = reorder(word, n), y = n)) +
  geom_col() +
  coord_flip() +
  labs(title = "Top 20 Words in Blogs Dataset", x = "Word", y = "Count")
## Selecting by n

Next Steps

We will use n-gram models (uni, bi, tri) and backoff strategies to predict the next word. The final app will be built using Shiny.

Conclusion

Exploratory analysis is complete and we are set to move toward building the predictive model and app.