Milestone Report - SwiftKey Capstone

Introduction

This report explores the SwiftKey dataset, including blogs, news, and Twitter data. The goal is to build a predictive model using n-gram techniques.

Loading the Data

blogs <- readLines("en_US.blogs.txt", encoding = "UTF-8", skipNul = TRUE)
news <- readLines("en_US.news.txt", encoding = "UTF-8", skipNul = TRUE)
twitter <- readLines("en_US.twitter.txt", encoding = "UTF-8", skipNul = TRUE)

Basic Summary

data_summary <- data.frame(
  File = c("Blogs", "News", "Twitter"),
  Lines = c(length(blogs), length(news), length(twitter)),
  Words = c(sum(str_count(blogs, "\\w+")),
            sum(str_count(news, "\\w+")),
            sum(str_count(twitter, "\\w+")))
)
kable(data_summary)

File	Lines	Words
Blogs	899288	38309620
News	1010206	35622913
Twitter	2360148	31003544

Exploratory Analysis

set.seed(123)
sample_blogs <- sample(blogs, 5000)
sample_news <- sample(news, 5000)
sample_twitter <- sample(twitter, 5000)

sample_data <- tolower(c(sample_blogs, sample_news, sample_twitter))
sample_data <- str_replace_all(sample_data, "[^a-z\\s]", " ")
sample_data <- str_squish(sample_data)
text_df <- tibble(text = sample_data)

Word Frequency Plot

word_counts <- text_df %>%
  unnest_tokens(word, text) %>%
  count(word, sort = TRUE) %>%
  filter(n > 200)

ggplot(word_counts, aes(x = reorder(word, n), y = n)) +
  geom_col(fill = "steelblue") +
  coord_flip() +
  labs(title = "Most Frequent Words", x = "Word", y = "Frequency")

Plan for Prediction Model

We will build a trigram model (sequence of 3 words). If the user inputs 2 words, the model will predict the 3rd word. The model will be deployed using a Shiny web application. We will also use bigram fallback if trigram match is not found.

Conclusion

This report demonstrates successful data loading, cleaning, and exploration. The next step is to build a predictive text model and deploy