Introduction

This report summarizes the exploratory analysis of the SwiftKey text corpus, consisting of English text from blogs, news, and Twitter. The goal is to demonstrate understanding of the data and outline a plan to build a text prediction model and a Shiny app.

Data Loading

The data were downloaded from the Capstone Project site and successfully read into R.

library(stringi)
library(ggplot2)
library(tm)
library(RWeka)

blogs <- readLines("en_US.blogs.txt", warn = FALSE)
news <- readLines("en_US.news.txt", warn = FALSE)
twitter <- readLines("en_US.twitter.txt", warn = FALSE)

Basic Statistics

Here’s a summary of line and word counts for each file:

summary_df <- data.frame(
  Dataset = c("Blogs", "News", "Twitter"),
  Lines = c(length(blogs), length(news), length(twitter)),
  Words = c(sum(stri_count_words(blogs)),
            sum(stri_count_words(news)),
            sum(stri_count_words(twitter)))
)
knitr::kable(summary_df)

Line Length Distribution

blogs_len <- nchar(blogs)
twitter_len <- nchar(twitter)
news_len <- nchar(news)

df <- data.frame(
  Length = c(blogs_len, twitter_len, news_len),
  Source = factor(c(rep("Blogs", length(blogs_len)),
                    rep("Twitter", length(twitter_len)),
                    rep("News", length(news_len))))
)

ggplot(df, aes(x = Length, fill = Source)) +
  geom_histogram(bins = 50) +
  facet_wrap(~Source, scales = "free_y") +
  theme_minimal() +
  labs(title = "Line Length Distribution by Source", x = "Characters per Line")

Most Common Words

sample_data <- c(blogs, news, twitter)
sample_corpus <- Corpus(VectorSource(sample_data))
sample_corpus <- tm_map(sample_corpus, content_transformer(tolower))
sample_corpus <- tm_map(sample_corpus, removePunctuation)
sample_corpus <- tm_map(sample_corpus, removeNumbers)
sample_corpus <- tm_map(sample_corpus, removeWords, stopwords("en"))
sample_corpus <- tm_map(sample_corpus, stripWhitespace)

dtm <- DocumentTermMatrix(sample_corpus)
word_freq <- sort(colSums(as.matrix(dtm)), decreasing = TRUE)
top_words <- head(word_freq, 10)

barplot(top_words, las = 2, col = "steelblue", main = "Top 10 Most Frequent Words")

Plans for the Prediction Algorithm

The final prediction model will be built using n-gram modeling (bigrams and trigrams) with techniques such as:

We will preprocess and sample the data due to memory constraints, and evaluate model accuracy using cross-validation.

Conclusion

This report provides a foundation for developing the full product.


Note: All data cleaning and modeling will be done efficiently to ensure app performance and usability.