Introduction

This milestone report is part of the Johns Hopkins University Data Science Specialization Capstone Project on Coursera, in collaboration with SwiftKey.

The main objective of this capstone project is to develop a predictive text model similar to those used by SwiftKey for mobile keyboard text prediction. The goal is to transform raw text data into a usable data product that can predict the next word in a sentence.

Data Overview

Data Source

The data for this analysis comes from three different corpora containing English text data:

Dataset Description
blogs Blog posts from various websites
news News articles from multiple sources
twitter Twitter posts/tweets

The data is available in three text files: - en_US.blogs.txt - en_US.news.txt
- en_US.twitter.txt

Exploratory Data Analysis

Loading Required Libraries

# Set CRAN mirror to fix the error
options(repos = c(CRAN = "https://cloud.r-project.org"))

required_packages <- c("ggplot2", "dplyr", "tm", "SnowballC", "wordcloud", "RColorBrewer", "stringr", "tidyr")

for (pkg in required_packages) {
  if (!require(pkg, character.only = TRUE, quietly = TRUE)) {
    install.packages(pkg, dependencies = TRUE, repos = "https://cloud.r-project.org")
    library(pkg, character.only = TRUE, quietly = TRUE)
  }
}

cat("All required packages loaded successfully.\n")
## All required packages loaded successfully.

Loading and Sampling the Data

set.seed(123)

# Read the text files
blogs <- readLines("data/en_US.blogs.txt", warn = FALSE)
news <- readLines("data/en_US.news.txt", warn = FALSE)
twitter <- readLines("data/en_US.twitter.txt", warn = FALSE)

cat("Data loading complete. Ready for analysis.\n")
## Data loading complete. Ready for analysis.
cat("Blogs:", length(blogs), "documents\n")
## Blogs: 899288 documents
cat("News:", length(news), "documents\n")
## News: 1010206 documents
cat("Twitter:", length(twitter), "documents\n")
## Twitter: 2360148 documents

Data Size and Structure

cat("=== Dataset Information ===\n\n")
## === Dataset Information ===
cat("Blogs dataset:", length(blogs), "documents\n")
## Blogs dataset: 899288 documents
cat("News dataset:", length(news), "documents\n")
## News dataset: 1010206 documents
cat("Twitter dataset:", length(twitter), "documents\n")
## Twitter dataset: 2360148 documents
cat("\nTotal documents:", length(blogs) + length(news) + length(twitter), "\n")
## 
## Total documents: 4269642

Text Preprocessing and Cleaning

set.seed(123)

# Sample a subset for analysis
sample_size <- 10000

Sample_Text <- rbind(
  sample(blogs, min(sample_size, length(blogs))),
  sample(news, min(sample_size, length(news))),
  sample(twitter, min(sample_size, length(twitter)))
)

# Create corpus from the text
corpus <- Corpus(VectorSource(Sample_Text))

# Text cleaning pipeline
clean_corpus <- function(corpus) {
  corpus <- tm_map(corpus, content_transformer(tolower))
  corpus <- tm_map(corpus, removePunctuation)
  corpus <- tm_map(corpus, removeNumbers)
  corpus <- tm_map(corpus, removeWords, stopwords("english"))
  corpus <- tm_map(corpus, stripWhitespace)
  corpus <- tm_map(corpus, stemDocument)
  return(corpus)
}

cat("Text preprocessing functions defined.\n")
## Text preprocessing functions defined.
cat("Sample size:", nrow(Sample_Text), "documents\n")
## Sample size: 3 documents

Major Features of the Data

1. File Size Analysis

# Get file sizes
blogs_size <- file.info("data/en_US.blogs.txt")$size
news_size <- file.info("data/en_US.news.txt")$size
twitter_size <- file.info("data/en_US.twitter.txt")$size

# Convert to MB
size_mb <- c(Blogs = blogs_size/1024/1024, 
             News = news_size/1024/1024, 
             Twitter = twitter_size/1024/1024)

# Visualize file sizes
barplot(size_mb, 
        main = "Dataset File Sizes (MB)",
        xlab = "Dataset",
        ylab = "Size (MB)",
        col = c("#2E86AB", "#A23B72", "#F18F01"),
        border = NA)

2. Document Length Distribution

# Calculate number of words per document - FIXED syntax
library(stringi)
blogs_length <- stri_count_words(blogs)
news_length <- stri_count_words(news)
twitter_length <- stri_count_words(twitter)

# Create summary statistics
doc_stats <- data.frame(
  Dataset = c("Blogs", "News", "Twitter"),
  Mean_Length = c(mean(blogs_length), mean(news_length), mean(twitter_length)),
  Median_Length = c(median(blogs_length), median(news_length), median(twitter_length)),
  SD_Length = c(sd(blogs_length), sd(news_length), sd(twitter_length)),
  Min_Length = c(min(blogs_length), min(news_length), min(twitter_length)),
  Max_Length = c(max(blogs_length), max(news_length), max(twitter_length))
)

knitr::kable(doc_stats, 
             caption = "Document Length Statistics by Dataset",
             digits = 2)
Document Length Statistics by Dataset
Dataset Mean_Length Median_Length SD_Length Min_Length Max_Length
Blogs 41.75 28 46.59 0 6726
News 34.41 32 22.83 1 1796
Twitter 12.75 12 6.91 1 47
# Plot document length distributio
par(mfrow = c(1, 3))
boxplot(blogs_length, names = "Blogs", main = "Blogs Document Length", col = "#2E86AB", ylab = "Word Count")
boxplot(news_length, names = "News", main = "News Document Length", col = "#A23B72", ylab = "Word Count")
boxplot(twitter_length, names = "Twitter", main = "Twitter Document Length", col = "#F18F01", ylab = "Word Count")

par(mfrow = c(1, 1))

3. Most Common Words (Unigrams) with Word Cloud

# Create Document Term Matrix
dtm <- DocumentTermMatrix(corpus)
dtm <- removeSparseTerms(dtm, 0.99)

# Calculate word frequencies
word_freq <- colSums(as.matrix(dtm))
word_freq <- sort(word_freq, decreasing = TRUE)

# Top 50 most common words
top_50_words <- head(word_freq, 50)

# Display top 20 words
cat("Top 20 Most Common Words:\n")
## Top 20 Most Common Words:
print(head(word_freq, 20))
##   the   and   for  that  with   was   you  have  this   but   are   not  from 
## 43574 22462  9090  9047  6441  5852  5742  4479  4335  4183  3980  3461  3341 
##   his  they  will   all   has about   one 
##  2924  2910  2709  2438  2421  2409  2313

4. Data Distribution by Source

# Plot distribution of documents across sources
source_data <- data.frame(
  Source = c("Blogs", "News", "Twitter"),
  Count = c(length(blogs), length(news), length(twitter))
)

ggplot(source_data, aes(x = Source, y = Count, fill = Source)) +
  geom_bar(stat = "identity", show.legend = FALSE) +
  geom_text(aes(label = Count), vjust = -0.5, size = 5) +
  labs(title = "Number of Documents by Source",
       x = "Dataset",
       y = "Number of Documents") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

### 5. Bigram Analysis (Two-Word Phrases)

# Custom function to extract n-grams using base R
extract_ngrams <- function(text, n = 2) {
  # 1. Use unlist() to turn the list from strsplit into a character vector
  words <- unlist(strsplit(text, "\\s+"))
  
  # 2. Clean up empty strings or NAs
  words <- words[words != "" & !is.na(words)]
  
  # 3. Safety check: if the line is too short, return NULL
  if (length(words) < n) return(NULL)
  
  # 4. Generate n-grams using a sliding window
  ngrams <- sapply(1:(length(words) - n + 1), function(i) {
    paste(words[i:(i + n - 1)], collapse = " ")
  })
  
  return(ngrams)
}

# Extract bigrams from all documents
cat("Extracting bigrams from corpus...\n")
## Extracting bigrams from corpus...
all_bigrams <- c()
for (i in 1:min(1000, length(Sample_Text))) {
  bg <- extract_ngrams(Sample_Text[i], n = 2)
  if (!is.null(bg) && length(bg) > 0) {
    all_bigrams <- c(all_bigrams, bg)
  }
}

# Check if we have any bigrams
if (length(all_bigrams) == 0) {
  cat("No bigrams extracted. Check data input.\n")
  top_bigrams <- NULL
} else {
  # Calculate bigram frequencies
  bigram_table <- table(all_bigrams)
  bigram_freq <- sort(bigram_table, decreasing = TRUE)

  # Top 20 bigrams
  top_bigrams <- head(bigram_freq, 20)

  # Display top bigrams
  cat("Top 20 Most Common Bigrams:\n")
  print(top_bigrams)
  
  # Visualize top bigrams (only if we have data)
  if (length(top_bigrams) > 0 && all(!is.na(top_bigrams)) && all(top_bigrams > 0)) {
    barplot(top_bigrams, 
            main = "Top 20 Most Common Bigrams",
            xlab = "Bigram",
            ylab = "Frequency",
            las = 2,
            cex.names = 0.7,
            col = "#2E86AB")
  } else {
    cat("No valid bigrams to plot.\n")
  }
}
## Top 20 Most Common Bigrams:
## all_bigrams
##   of the   in the   to the   on the    to be  for the   at the     is a 
##      123      117       75       57       52       49       40       39 
##     in a  and the    for a     of a from the   that I   I have    I was 
##       35       32       32       31       28       26       25       25 
##     to a  will be    it is     as a 
##       25       25       24       23

6. Trigram Analysis (Three-Word Phrases)

# Extract trigrams using custom function
extract_ngrams <- function(text, n = 3) {
  # 1. Use unlist() to turn the list from strsplit into a character vector
  words <- unlist(strsplit(text, "\\s+"))
  
  # 2. Clean up empty strings or NAs
  words <- words[words != "" & !is.na(words)]
  
  # 3. Safety check: if the line is too short, return NULL
  if (length(words) < n) return(NULL)
  
  # 4. Generate n-grams using a sliding window
  ngrams <- sapply(1:(length(words) - n + 1), function(i) {
    paste(words[i:(i + n - 1)], collapse = " ")
  })
  
  return(ngrams)
}
# Extract trigrams from all documents
cat("Extracting trigrams from corpus...\n")
## Extracting trigrams from corpus...
all_trigrams <- c()
for (i in 1:min(1000, length(Sample_Text))) {
  tg <- extract_ngrams(Sample_Text[i], n = 3)
  if (!is.null(tg) && length(tg) > 0) {
    all_trigrams <- c(all_trigrams, tg)
  }
}

# Check if we have any trigrams
if (length(all_trigrams) == 0) {
  cat("No trigrams extracted. Check data input.\n")
  top_trigrams <- NULL
} else {
  # Calculate trigram frequencies
  trigram_table <- table(all_trigrams)
  trigram_freq <- sort(trigram_table, decreasing = TRUE)

  # Top 15 trigrams
  top_trigrams <- head(trigram_freq, 15)

  # Display top trigrams
  cat("Top 15 Most Common Trigrams:\n")
  print(top_trigrams)
  
  # Visualize top trigrams (only if we have data)
  if (length(top_trigrams) > 0 && all(!is.na(top_trigrams)) && all(top_trigrams > 0)) {
    barplot(top_trigrams, 
            main = "Top 15 Most Common Trigrams",
            xlab = "Trigram",
            ylab = "Frequency",
            las = 2,
            cex.names = 0.6,
            col = "#A23B72")
  } else {
    cat("No valid trigrams to plot.\n")
  }
}
## Top 15 Most Common Trigrams:
## all_trigrams
##     a lot of   as well as   one of the   One of the        . . . in the first 
##            8            7            6            6            5            5 
##  some of the     to be in are going to   end of the    I need to    I want to 
##            5            5            4            4            4            4 
##    is in the    is one of    it to the 
##            4            4            4

### 7. Word Frequency Distribution

# Plot word frequency distribution
top_words_df <- data.frame(
  Word = names(top_50_words),
  Frequency = as.numeric(top_50_words)
)

# Ensure we have valid data
if (nrow(top_words_df) > 0 && all(!is.na(top_words_df$Frequency))) {
  ggplot(top_words_df, aes(x = reorder(Word, Frequency), y = Frequency, fill = Frequency)) +
    geom_bar(stat = "identity", show.legend = FALSE) +
    coord_flip() +
    labs(title = "Top 50 Most Common Words",
         x = "Word",
         y = "Frequency") +
    theme_minimal() +
    theme(axis.text.y = element_text(size = 8))
} else {
  cat("No word frequency data available for plotting.\n")
}

Summary Statistics

Key Findings from EDA

Metric Description
Total Documents Combined count across blogs, news, and Twitter datasets
Data Types Unstructured text data from 3 different sources
Language English language text
Content Variety Blog posts, news articles, social media posts
Document Length Highly variable (Twitter shorter, Blogs longer)
Vocabulary Size Large vocabulary with significant overlap across sources
N-gram Patterns Common words, phrases, and idioms identifiable

Data Characteristics

  1. Heterogeneous Sources: The data combines formal (news) and informal (Twitter, blogs) writing styles
  2. Variable Length: Document lengths vary significantly by source type
  3. Rich Vocabulary: Large vocabulary with domain-specific terminology
  4. Common Patterns: Frequent n-grams that can be leveraged for prediction

Plan for Creating the Text Predictive Model

Approach Overview

Based on the exploratory data analysis, I will develop a n-gram language model for next-word prediction. Here’s my plan:

1. Data Preparation

Steps: - Load and clean all three corpora completely - Apply consistent preprocessing (lowercase, remove punctuation, remove stopwords) - Tokenize text into individual words - Handle edge cases (special characters, numbers, URLs)

2. N-gram Model Development

N-gram Type Purpose
Unigrams Base word frequency probabilities
Bigrams Two-word sequence predictions
Trigrams Three-word sequence predictions

Strategy: - Build separate models for each n-gram level (1-4) - Use Kneser-Ney smoothing to handle unseen n-grams - Implement backoff strategy for rare/unseen sequences

3. Model Evaluation

Metrics to evaluate: - Perplexity: Lower is better (measures prediction uncertainty) - Accuracy: Percentage of correct next-word predictions - Holdout validation: Test on unseen data (20% split)

4. Final Data Product

Deliverable: An R function/predictor that: - Takes user input (previous words) as context - Returns ranked list of predicted next words - Provides confidence scores for predictions - Works in real-time for keyboard integration

5. Technical Implementation

Tools and Packages: - tm for text processing - Base R (strsplit, paste, table) for n-gram extraction - Custom implementation for n-gram models - ggplot2 for visualization - dplyr for data manipulation

Conclusion

This exploratory data analysis has revealed the key characteristics of the SwiftKey text prediction dataset:

  1. Rich, diverse text data from three complementary sources
  2. Clear patterns in word usage and n-gram frequencies
  3. Feasible for n-gram language modeling approach