This report explores the SwiftKey dataset, including blogs, news, and Twitter data. The goal is to build a predictive model using n-gram techniques.
blogs <- readLines("en_US.blogs.txt", encoding = "UTF-8", skipNul = TRUE)
news <- readLines("en_US.news.txt", encoding = "UTF-8", skipNul = TRUE)
twitter <- readLines("en_US.twitter.txt", encoding = "UTF-8", skipNul = TRUE)
data_summary <- data.frame(
File = c("Blogs", "News", "Twitter"),
Lines = c(length(blogs), length(news), length(twitter)),
Words = c(sum(str_count(blogs, "\\w+")),
sum(str_count(news, "\\w+")),
sum(str_count(twitter, "\\w+")))
)
kable(data_summary)
| File | Lines | Words |
|---|---|---|
| Blogs | 899288 | 38309620 |
| News | 1010206 | 35622913 |
| 2360148 | 31003544 |
set.seed(123)
sample_blogs <- sample(blogs, 5000)
sample_news <- sample(news, 5000)
sample_twitter <- sample(twitter, 5000)
sample_data <- tolower(c(sample_blogs, sample_news, sample_twitter))
sample_data <- str_replace_all(sample_data, "[^a-z\\s]", " ")
sample_data <- str_squish(sample_data)
text_df <- tibble(text = sample_data)
word_counts <- text_df %>%
unnest_tokens(word, text) %>%
count(word, sort = TRUE) %>%
filter(n > 200)
ggplot(word_counts, aes(x = reorder(word, n), y = n)) +
geom_col(fill = "steelblue") +
coord_flip() +
labs(title = "Most Frequent Words", x = "Word", y = "Frequency")
We will build a trigram model (sequence of 3 words). If the user inputs 2 words, the model will predict the 3rd word. The model will be deployed using a Shiny web application. We will also use bigram fallback if trigram match is not found.
This report demonstrates successful data loading, cleaning, and exploration. The next step is to build a predictive text model and deploy