Introduction

This report provides an exploratory analysis of the SwiftKey dataset, which contains text from blogs, news articles, and Twitter messages. The goal is to understand the basic structure of the data before building a next-word prediction model and Shiny app.

Load the Data

# Load the three text files from the data folder
blogs <- readLines("data/en_US.blogs.txt", warn = FALSE, encoding = "UTF-8")
news <- readLines("data/en_US.news.txt", warn = FALSE, encoding = "UTF-8")
twitter <- readLines("data/en_US.twitter.txt", warn = FALSE, encoding = "UTF-8")

Summary Statistics

Line Counts

length(blogs)

## [1] 899288

length(news)

## [1] 1010206

length(twitter)

## [1] 2360148

Word Counts

library(stringi)

blogs_words <- sum(stri_count_words(blogs))
news_words <- sum(stri_count_words(news))
twitter_words <- sum(stri_count_words(twitter))

Summary Table

library(knitr)

summary_table <- data.frame(
  File = c("Blogs", "News", "Twitter"),
  Lines = c(length(blogs), length(news), length(twitter)),
  Words = c(blogs_words, news_words, twitter_words)
)

kable(summary_table)

File	Lines	Words
Blogs	899288	37546806
News	1010206	34761151
Twitter	2360148	30096649

Exploratory Plots

Histogram of Blog Line Lengths

hist(nchar(blogs),
     main = "Blog Line Length Distribution",
     xlab = "Characters per Line",
     col = "skyblue",
     border = "black")

Histogram of Twitter Word Counts

hist(stri_count_words(twitter),
     main = "Twitter Word Count Distribution",
     xlab = "Words per Tweet",
     col = "orange",
     border = "black")

Histogram from Sampled Twitter Data (20,000 entries)

set.seed(123)
twitter_sample <- sample(twitter, 20000)

hist(stri_count_words(twitter_sample),
     main = "Sample Twitter Word Count Distribution",
     xlab = "Words per Tweet",
     col = "green",
     border = "black")

Interesting Findings

Twitter has the highest number of lines, but each entry is short.
Blog text contains longer lines and more variation.
News data is more structured and formal.
The datasets vary heavily in style, which impacts n-gram modeling.
Twitter irregularities (abbreviations, slang) will require cleaning.

Plan for Prediction Algorithm

I will build an n-gram model using bigrams and trigrams derived from the cleaned text data. The workflow will include:

Cleaning text (lowercasing, removing numbers, punctuation, and profanity)
Tokenizing into unigrams, bigrams, and trigrams
Creating frequency tables for n-grams
Applying smoothing techniques (like Katz Backoff)
Using the model to predict the next word

The final Shiny app will allow the user to type a phrase and receive real-time next-word suggestions, similar to mobile keyboard prediction.

Conclusion

This exploratory analysis confirms the dataset is successfully loaded and summarized. The findings provide a baseline understanding for developing the prediction algorithm and Shiny application.

Exploratory Data Analysis – Coursera Capstone

Rohan