This report summarizes the exploratory analysis of the SwiftKey text corpus, consisting of English text from blogs, news, and Twitter. The goal is to demonstrate understanding of the data and outline a plan to build a text prediction model and a Shiny app.
The data were downloaded from the Capstone Project site and successfully read into R.
library(stringi)
library(ggplot2)
library(tm)
library(RWeka)
blogs <- readLines("en_US.blogs.txt", warn = FALSE)
news <- readLines("en_US.news.txt", warn = FALSE)
twitter <- readLines("en_US.twitter.txt", warn = FALSE)
Here’s a summary of line and word counts for each file:
summary_df <- data.frame(
Dataset = c("Blogs", "News", "Twitter"),
Lines = c(length(blogs), length(news), length(twitter)),
Words = c(sum(stri_count_words(blogs)),
sum(stri_count_words(news)),
sum(stri_count_words(twitter)))
)
knitr::kable(summary_df)
blogs_len <- nchar(blogs)
twitter_len <- nchar(twitter)
news_len <- nchar(news)
df <- data.frame(
Length = c(blogs_len, twitter_len, news_len),
Source = factor(c(rep("Blogs", length(blogs_len)),
rep("Twitter", length(twitter_len)),
rep("News", length(news_len))))
)
ggplot(df, aes(x = Length, fill = Source)) +
geom_histogram(bins = 50) +
facet_wrap(~Source, scales = "free_y") +
theme_minimal() +
labs(title = "Line Length Distribution by Source", x = "Characters per Line")
sample_data <- c(blogs, news, twitter)
sample_corpus <- Corpus(VectorSource(sample_data))
sample_corpus <- tm_map(sample_corpus, content_transformer(tolower))
sample_corpus <- tm_map(sample_corpus, removePunctuation)
sample_corpus <- tm_map(sample_corpus, removeNumbers)
sample_corpus <- tm_map(sample_corpus, removeWords, stopwords("en"))
sample_corpus <- tm_map(sample_corpus, stripWhitespace)
dtm <- DocumentTermMatrix(sample_corpus)
word_freq <- sort(colSums(as.matrix(dtm)), decreasing = TRUE)
top_words <- head(word_freq, 10)
barplot(top_words, las = 2, col = "steelblue", main = "Top 10 Most Frequent Words")
The final prediction model will be built using n-gram modeling (bigrams and trigrams) with techniques such as:
RWekaWe will preprocess and sample the data due to memory constraints, and evaluate model accuracy using cross-validation.
This report provides a foundation for developing the full product.
Note: All data cleaning and modeling will be done efficiently to ensure app performance and usability.