Introduction

This report explores the SwiftKey dataset, including blogs, news, and Twitter data. The goal is to build a predictive model using n-gram techniques.

Loading the Data

blogs <- readLines("en_US.blogs.txt", encoding = "UTF-8", skipNul = TRUE)
news <- readLines("en_US.news.txt", encoding = "UTF-8", skipNul = TRUE)
twitter <- readLines("en_US.twitter.txt", encoding = "UTF-8", skipNul = TRUE)

Basic Summary

data_summary <- data.frame(
  File = c("Blogs", "News", "Twitter"),
  Lines = c(length(blogs), length(news), length(twitter)),
  Words = c(sum(str_count(blogs, "\\w+")),
            sum(str_count(news, "\\w+")),
            sum(str_count(twitter, "\\w+")))
)
kable(data_summary)
File Lines Words
Blogs 899288 38309620
News 1010206 35622913
Twitter 2360148 31003544

Exploratory Analysis

set.seed(123)
sample_blogs <- sample(blogs, 5000)
sample_news <- sample(news, 5000)
sample_twitter <- sample(twitter, 5000)

sample_data <- tolower(c(sample_blogs, sample_news, sample_twitter))
sample_data <- str_replace_all(sample_data, "[^a-z\\s]", " ")
sample_data <- str_squish(sample_data)
text_df <- tibble(text = sample_data)

Word Frequency Plot

word_counts <- text_df %>%
  unnest_tokens(word, text) %>%
  count(word, sort = TRUE) %>%
  filter(n > 200)

ggplot(word_counts, aes(x = reorder(word, n), y = n)) +
  geom_col(fill = "steelblue") +
  coord_flip() +
  labs(title = "Most Frequent Words", x = "Word", y = "Frequency")

Plan for Prediction Model

We will build a trigram model (sequence of 3 words). If the user inputs 2 words, the model will predict the 3rd word. The model will be deployed using a Shiny web application. We will also use bigram fallback if trigram match is not found.

Conclusion

This report demonstrates successful data loading, cleaning, and exploration. The next step is to build a predictive text model and deploy