This report provides an exploratory analysis of the SwiftKey dataset, which contains text from blogs, news articles, and Twitter messages. The goal is to understand the basic structure of the data before building a next-word prediction model and Shiny app.
# Load the three text files from the data folder
blogs <- readLines("data/en_US.blogs.txt", warn = FALSE, encoding = "UTF-8")
news <- readLines("data/en_US.news.txt", warn = FALSE, encoding = "UTF-8")
twitter <- readLines("data/en_US.twitter.txt", warn = FALSE, encoding = "UTF-8")
length(blogs)
## [1] 899288
length(news)
## [1] 1010206
length(twitter)
## [1] 2360148
library(stringi)
blogs_words <- sum(stri_count_words(blogs))
news_words <- sum(stri_count_words(news))
twitter_words <- sum(stri_count_words(twitter))
library(knitr)
summary_table <- data.frame(
File = c("Blogs", "News", "Twitter"),
Lines = c(length(blogs), length(news), length(twitter)),
Words = c(blogs_words, news_words, twitter_words)
)
kable(summary_table)
| File | Lines | Words |
|---|---|---|
| Blogs | 899288 | 37546806 |
| News | 1010206 | 34761151 |
| 2360148 | 30096649 |
hist(nchar(blogs),
main = "Blog Line Length Distribution",
xlab = "Characters per Line",
col = "skyblue",
border = "black")
hist(stri_count_words(twitter),
main = "Twitter Word Count Distribution",
xlab = "Words per Tweet",
col = "orange",
border = "black")
set.seed(123)
twitter_sample <- sample(twitter, 20000)
hist(stri_count_words(twitter_sample),
main = "Sample Twitter Word Count Distribution",
xlab = "Words per Tweet",
col = "green",
border = "black")
I will build an n-gram model using bigrams and trigrams derived from the cleaned text data. The workflow will include:
The final Shiny app will allow the user to type a phrase and receive real-time next-word suggestions, similar to mobile keyboard prediction.
This exploratory analysis confirms the dataset is successfully loaded and summarized. The findings provide a baseline understanding for developing the prediction algorithm and Shiny application.