Data Science Capstone - Milestone Report

Introduction

This report presents exploratory data analysis for three English corpora (blogs, news, and Twitter), and outlines the strategy to build a next word prediction model and Shiny app.

Data Summary

blogs <- readLines("data/en_US.blogs.txt", encoding = "UTF-8", skipNul = TRUE)
news <- readLines("data/en_US.news.txt", encoding = "UTF-8", skipNul = TRUE)
twitter <- readLines("data/en_US.twitter.txt", encoding = "UTF-8", skipNul = TRUE)

data_summary <- data.frame(
  Dataset = c("Blogs", "News", "Twitter"),
  LineCount = c(length(blogs), length(news), length(twitter)),
  WordCount = c(sum(stri_count_words(blogs)), 
                sum(stri_count_words(news)), 
                sum(stri_count_words(twitter)))
)

kable(data_summary)

Dataset	LineCount	WordCount
Blogs	899288	37546806
News	1010206	34761151
Twitter	2360148	30096690

Basic Plots

blogs_df <- data.frame(text = blogs)
blogs_tokens <- blogs_df %>%
  unnest_tokens(word, text) %>%
  anti_join(stop_words) %>%
  count(word, sort = TRUE)

## Joining with `by = join_by(word)`

blogs_tokens %>%
  top_n(20) %>%
  ggplot(aes(x = reorder(word, n), y = n)) +
  geom_col() +
  coord_flip() +
  labs(title = "Top 20 Words in Blogs Dataset", x = "Word", y = "Count")

## Selecting by n

Next Steps

We will use n-gram models (uni, bi, tri) and backoff strategies to predict the next word. The final app will be built using Shiny.

Conclusion

Exploratory analysis is complete and we are set to move toward building the predictive model and app.