This milestone report is part of the Johns Hopkins University Data Science Specialization Capstone Project on Coursera, in collaboration with SwiftKey.
The main objective of this capstone project is to develop a predictive text model similar to those used by SwiftKey for mobile keyboard text prediction. The goal is to transform raw text data into a usable data product that can predict the next word in a sentence.
The data for this analysis comes from three different corpora containing English text data:
| Dataset | Description |
|---|---|
| blogs | Blog posts from various websites |
| news | News articles from multiple sources |
| Twitter posts/tweets |
The data is available in three text files: -
en_US.blogs.txt - en_US.news.txt
- en_US.twitter.txt
# Set CRAN mirror to fix the error
options(repos = c(CRAN = "https://cloud.r-project.org"))
required_packages <- c("ggplot2", "dplyr", "tm", "SnowballC", "wordcloud", "RColorBrewer", "stringr", "tidyr")
for (pkg in required_packages) {
if (!require(pkg, character.only = TRUE, quietly = TRUE)) {
install.packages(pkg, dependencies = TRUE, repos = "https://cloud.r-project.org")
library(pkg, character.only = TRUE, quietly = TRUE)
}
}
cat("All required packages loaded successfully.\n")
## All required packages loaded successfully.
set.seed(123)
# Read the text files
blogs <- readLines("data/en_US.blogs.txt", warn = FALSE)
news <- readLines("data/en_US.news.txt", warn = FALSE)
twitter <- readLines("data/en_US.twitter.txt", warn = FALSE)
cat("Data loading complete. Ready for analysis.\n")
## Data loading complete. Ready for analysis.
cat("Blogs:", length(blogs), "documents\n")
## Blogs: 899288 documents
cat("News:", length(news), "documents\n")
## News: 1010206 documents
cat("Twitter:", length(twitter), "documents\n")
## Twitter: 2360148 documents
cat("=== Dataset Information ===\n\n")
## === Dataset Information ===
cat("Blogs dataset:", length(blogs), "documents\n")
## Blogs dataset: 899288 documents
cat("News dataset:", length(news), "documents\n")
## News dataset: 1010206 documents
cat("Twitter dataset:", length(twitter), "documents\n")
## Twitter dataset: 2360148 documents
cat("\nTotal documents:", length(blogs) + length(news) + length(twitter), "\n")
##
## Total documents: 4269642
set.seed(123)
# Sample a subset for analysis
sample_size <- 10000
Sample_Text <- rbind(
sample(blogs, min(sample_size, length(blogs))),
sample(news, min(sample_size, length(news))),
sample(twitter, min(sample_size, length(twitter)))
)
# Create corpus from the text
corpus <- Corpus(VectorSource(Sample_Text))
# Text cleaning pipeline
clean_corpus <- function(corpus) {
corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, removeWords, stopwords("english"))
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, stemDocument)
return(corpus)
}
cat("Text preprocessing functions defined.\n")
## Text preprocessing functions defined.
cat("Sample size:", nrow(Sample_Text), "documents\n")
## Sample size: 3 documents
# Get file sizes
blogs_size <- file.info("data/en_US.blogs.txt")$size
news_size <- file.info("data/en_US.news.txt")$size
twitter_size <- file.info("data/en_US.twitter.txt")$size
# Convert to MB
size_mb <- c(Blogs = blogs_size/1024/1024,
News = news_size/1024/1024,
Twitter = twitter_size/1024/1024)
# Visualize file sizes
barplot(size_mb,
main = "Dataset File Sizes (MB)",
xlab = "Dataset",
ylab = "Size (MB)",
col = c("#2E86AB", "#A23B72", "#F18F01"),
border = NA)
# Calculate number of words per document - FIXED syntax
library(stringi)
blogs_length <- stri_count_words(blogs)
news_length <- stri_count_words(news)
twitter_length <- stri_count_words(twitter)
# Create summary statistics
doc_stats <- data.frame(
Dataset = c("Blogs", "News", "Twitter"),
Mean_Length = c(mean(blogs_length), mean(news_length), mean(twitter_length)),
Median_Length = c(median(blogs_length), median(news_length), median(twitter_length)),
SD_Length = c(sd(blogs_length), sd(news_length), sd(twitter_length)),
Min_Length = c(min(blogs_length), min(news_length), min(twitter_length)),
Max_Length = c(max(blogs_length), max(news_length), max(twitter_length))
)
knitr::kable(doc_stats,
caption = "Document Length Statistics by Dataset",
digits = 2)
| Dataset | Mean_Length | Median_Length | SD_Length | Min_Length | Max_Length |
|---|---|---|---|---|---|
| Blogs | 41.75 | 28 | 46.59 | 0 | 6726 |
| News | 34.41 | 32 | 22.83 | 1 | 1796 |
| 12.75 | 12 | 6.91 | 1 | 47 |
# Plot document length distributio
par(mfrow = c(1, 3))
boxplot(blogs_length, names = "Blogs", main = "Blogs Document Length", col = "#2E86AB", ylab = "Word Count")
boxplot(news_length, names = "News", main = "News Document Length", col = "#A23B72", ylab = "Word Count")
boxplot(twitter_length, names = "Twitter", main = "Twitter Document Length", col = "#F18F01", ylab = "Word Count")
par(mfrow = c(1, 1))
# Create Document Term Matrix
dtm <- DocumentTermMatrix(corpus)
dtm <- removeSparseTerms(dtm, 0.99)
# Calculate word frequencies
word_freq <- colSums(as.matrix(dtm))
word_freq <- sort(word_freq, decreasing = TRUE)
# Top 50 most common words
top_50_words <- head(word_freq, 50)
# Display top 20 words
cat("Top 20 Most Common Words:\n")
## Top 20 Most Common Words:
print(head(word_freq, 20))
## the and for that with was you have this but are not from
## 43574 22462 9090 9047 6441 5852 5742 4479 4335 4183 3980 3461 3341
## his they will all has about one
## 2924 2910 2709 2438 2421 2409 2313
# Plot distribution of documents across sources
source_data <- data.frame(
Source = c("Blogs", "News", "Twitter"),
Count = c(length(blogs), length(news), length(twitter))
)
ggplot(source_data, aes(x = Source, y = Count, fill = Source)) +
geom_bar(stat = "identity", show.legend = FALSE) +
geom_text(aes(label = Count), vjust = -0.5, size = 5) +
labs(title = "Number of Documents by Source",
x = "Dataset",
y = "Number of Documents") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
### 5. Bigram Analysis (Two-Word Phrases)
# Custom function to extract n-grams using base R
extract_ngrams <- function(text, n = 2) {
# 1. Use unlist() to turn the list from strsplit into a character vector
words <- unlist(strsplit(text, "\\s+"))
# 2. Clean up empty strings or NAs
words <- words[words != "" & !is.na(words)]
# 3. Safety check: if the line is too short, return NULL
if (length(words) < n) return(NULL)
# 4. Generate n-grams using a sliding window
ngrams <- sapply(1:(length(words) - n + 1), function(i) {
paste(words[i:(i + n - 1)], collapse = " ")
})
return(ngrams)
}
# Extract bigrams from all documents
cat("Extracting bigrams from corpus...\n")
## Extracting bigrams from corpus...
all_bigrams <- c()
for (i in 1:min(1000, length(Sample_Text))) {
bg <- extract_ngrams(Sample_Text[i], n = 2)
if (!is.null(bg) && length(bg) > 0) {
all_bigrams <- c(all_bigrams, bg)
}
}
# Check if we have any bigrams
if (length(all_bigrams) == 0) {
cat("No bigrams extracted. Check data input.\n")
top_bigrams <- NULL
} else {
# Calculate bigram frequencies
bigram_table <- table(all_bigrams)
bigram_freq <- sort(bigram_table, decreasing = TRUE)
# Top 20 bigrams
top_bigrams <- head(bigram_freq, 20)
# Display top bigrams
cat("Top 20 Most Common Bigrams:\n")
print(top_bigrams)
# Visualize top bigrams (only if we have data)
if (length(top_bigrams) > 0 && all(!is.na(top_bigrams)) && all(top_bigrams > 0)) {
barplot(top_bigrams,
main = "Top 20 Most Common Bigrams",
xlab = "Bigram",
ylab = "Frequency",
las = 2,
cex.names = 0.7,
col = "#2E86AB")
} else {
cat("No valid bigrams to plot.\n")
}
}
## Top 20 Most Common Bigrams:
## all_bigrams
## of the in the to the on the to be for the at the is a
## 123 117 75 57 52 49 40 39
## in a and the for a of a from the that I I have I was
## 35 32 32 31 28 26 25 25
## to a will be it is as a
## 25 25 24 23
# Extract trigrams using custom function
extract_ngrams <- function(text, n = 3) {
# 1. Use unlist() to turn the list from strsplit into a character vector
words <- unlist(strsplit(text, "\\s+"))
# 2. Clean up empty strings or NAs
words <- words[words != "" & !is.na(words)]
# 3. Safety check: if the line is too short, return NULL
if (length(words) < n) return(NULL)
# 4. Generate n-grams using a sliding window
ngrams <- sapply(1:(length(words) - n + 1), function(i) {
paste(words[i:(i + n - 1)], collapse = " ")
})
return(ngrams)
}
# Extract trigrams from all documents
cat("Extracting trigrams from corpus...\n")
## Extracting trigrams from corpus...
all_trigrams <- c()
for (i in 1:min(1000, length(Sample_Text))) {
tg <- extract_ngrams(Sample_Text[i], n = 3)
if (!is.null(tg) && length(tg) > 0) {
all_trigrams <- c(all_trigrams, tg)
}
}
# Check if we have any trigrams
if (length(all_trigrams) == 0) {
cat("No trigrams extracted. Check data input.\n")
top_trigrams <- NULL
} else {
# Calculate trigram frequencies
trigram_table <- table(all_trigrams)
trigram_freq <- sort(trigram_table, decreasing = TRUE)
# Top 15 trigrams
top_trigrams <- head(trigram_freq, 15)
# Display top trigrams
cat("Top 15 Most Common Trigrams:\n")
print(top_trigrams)
# Visualize top trigrams (only if we have data)
if (length(top_trigrams) > 0 && all(!is.na(top_trigrams)) && all(top_trigrams > 0)) {
barplot(top_trigrams,
main = "Top 15 Most Common Trigrams",
xlab = "Trigram",
ylab = "Frequency",
las = 2,
cex.names = 0.6,
col = "#A23B72")
} else {
cat("No valid trigrams to plot.\n")
}
}
## Top 15 Most Common Trigrams:
## all_trigrams
## a lot of as well as one of the One of the . . . in the first
## 8 7 6 6 5 5
## some of the to be in are going to end of the I need to I want to
## 5 5 4 4 4 4
## is in the is one of it to the
## 4 4 4
### 7. Word Frequency Distribution
# Plot word frequency distribution
top_words_df <- data.frame(
Word = names(top_50_words),
Frequency = as.numeric(top_50_words)
)
# Ensure we have valid data
if (nrow(top_words_df) > 0 && all(!is.na(top_words_df$Frequency))) {
ggplot(top_words_df, aes(x = reorder(Word, Frequency), y = Frequency, fill = Frequency)) +
geom_bar(stat = "identity", show.legend = FALSE) +
coord_flip() +
labs(title = "Top 50 Most Common Words",
x = "Word",
y = "Frequency") +
theme_minimal() +
theme(axis.text.y = element_text(size = 8))
} else {
cat("No word frequency data available for plotting.\n")
}
| Metric | Description |
|---|---|
| Total Documents | Combined count across blogs, news, and Twitter datasets |
| Data Types | Unstructured text data from 3 different sources |
| Language | English language text |
| Content Variety | Blog posts, news articles, social media posts |
| Document Length | Highly variable (Twitter shorter, Blogs longer) |
| Vocabulary Size | Large vocabulary with significant overlap across sources |
| N-gram Patterns | Common words, phrases, and idioms identifiable |
Based on the exploratory data analysis, I will develop a n-gram language model for next-word prediction. Here’s my plan:
Steps: - Load and clean all three corpora completely - Apply consistent preprocessing (lowercase, remove punctuation, remove stopwords) - Tokenize text into individual words - Handle edge cases (special characters, numbers, URLs)
| N-gram Type | Purpose |
|---|---|
| Unigrams | Base word frequency probabilities |
| Bigrams | Two-word sequence predictions |
| Trigrams | Three-word sequence predictions |
Strategy: - Build separate models for each n-gram level (1-4) - Use Kneser-Ney smoothing to handle unseen n-grams - Implement backoff strategy for rare/unseen sequences
Metrics to evaluate: - Perplexity: Lower is better (measures prediction uncertainty) - Accuracy: Percentage of correct next-word predictions - Holdout validation: Test on unseen data (20% split)
Deliverable: An R function/predictor that: - Takes user input (previous words) as context - Returns ranked list of predicted next words - Provides confidence scores for predictions - Works in real-time for keyboard integration
Tools and Packages: - tm for text
processing - Base R (strsplit,
paste, table) for n-gram extraction - Custom
implementation for n-gram models - ggplot2 for
visualization - dplyr for data manipulation
This exploratory data analysis has revealed the key characteristics of the SwiftKey text prediction dataset: