Introduction

This report provides an exploratory analysis of the SwiftKey dataset, which contains text from blogs, news articles, and Twitter messages. The goal is to understand the basic structure of the data before building a next-word prediction model and Shiny app.

Load the Data

# Load the three text files from the data folder
blogs <- readLines("data/en_US.blogs.txt", warn = FALSE, encoding = "UTF-8")
news <- readLines("data/en_US.news.txt", warn = FALSE, encoding = "UTF-8")
twitter <- readLines("data/en_US.twitter.txt", warn = FALSE, encoding = "UTF-8")

Summary Statistics

Line Counts

length(blogs)
## [1] 899288
length(news)
## [1] 1010206
length(twitter)
## [1] 2360148

Word Counts

library(stringi)

blogs_words <- sum(stri_count_words(blogs))
news_words <- sum(stri_count_words(news))
twitter_words <- sum(stri_count_words(twitter))

Summary Table

library(knitr)

summary_table <- data.frame(
  File = c("Blogs", "News", "Twitter"),
  Lines = c(length(blogs), length(news), length(twitter)),
  Words = c(blogs_words, news_words, twitter_words)
)

kable(summary_table)
File Lines Words
Blogs 899288 37546806
News 1010206 34761151
Twitter 2360148 30096649

Exploratory Plots

Histogram of Blog Line Lengths

hist(nchar(blogs),
     main = "Blog Line Length Distribution",
     xlab = "Characters per Line",
     col = "skyblue",
     border = "black")

Histogram of Twitter Word Counts

hist(stri_count_words(twitter),
     main = "Twitter Word Count Distribution",
     xlab = "Words per Tweet",
     col = "orange",
     border = "black")

Histogram from Sampled Twitter Data (20,000 entries)

set.seed(123)
twitter_sample <- sample(twitter, 20000)

hist(stri_count_words(twitter_sample),
     main = "Sample Twitter Word Count Distribution",
     xlab = "Words per Tweet",
     col = "green",
     border = "black")

Interesting Findings

Plan for Prediction Algorithm

I will build an n-gram model using bigrams and trigrams derived from the cleaned text data. The workflow will include:

The final Shiny app will allow the user to type a phrase and receive real-time next-word suggestions, similar to mobile keyboard prediction.

Conclusion

This exploratory analysis confirms the dataset is successfully loaded and summarized. The findings provide a baseline understanding for developing the prediction algorithm and Shiny application.