Coursera Data Science Capstone

Introduction

This milestone report is part of the Johns Hopkins University Data Science Specialization Capstone Project on Coursera, in collaboration with SwiftKey.

The main objective of this capstone project is to develop a predictive text model similar to those used by SwiftKey for mobile keyboard text prediction. The goal is to transform raw text data into a usable data product that can predict the next word in a sentence.

Data Overview

Data Source

The data for this analysis comes from three different corpora containing English text data:

Dataset	Description
blogs	Blog posts from various websites
news	News articles from multiple sources
twitter	Twitter posts/tweets

The data is available in three text files: - en_US.blogs.txt - en_US.news.txt
- en_US.twitter.txt

Exploratory Data Analysis

Loading Required Libraries

# Set CRAN mirror to fix the error
options(repos = c(CRAN = "https://cloud.r-project.org"))

required_packages <- c("ggplot2", "dplyr", "tm", "SnowballC", "wordcloud", "RColorBrewer", "stringr", "tidyr")

for (pkg in required_packages) {
  if (!require(pkg, character.only = TRUE, quietly = TRUE)) {
    install.packages(pkg, dependencies = TRUE, repos = "https://cloud.r-project.org")
    library(pkg, character.only = TRUE, quietly = TRUE)
  }
}

cat("All required packages loaded successfully.\n")

## All required packages loaded successfully.

Loading and Sampling the Data

set.seed(123)

# Read the text files
blogs <- readLines("data/en_US.blogs.txt", warn = FALSE)
news <- readLines("data/en_US.news.txt", warn = FALSE)
twitter <- readLines("data/en_US.twitter.txt", warn = FALSE)

cat("Data loading complete. Ready for analysis.\n")

## Data loading complete. Ready for analysis.

cat("Blogs:", length(blogs), "documents\n")

## Blogs: 899288 documents

cat("News:", length(news), "documents\n")

## News: 1010206 documents

cat("Twitter:", length(twitter), "documents\n")

## Twitter: 2360148 documents

Data Size and Structure

cat("=== Dataset Information ===\n\n")

## === Dataset Information ===

cat("Blogs dataset:", length(blogs), "documents\n")

## Blogs dataset: 899288 documents

cat("News dataset:", length(news), "documents\n")

## News dataset: 1010206 documents

cat("Twitter dataset:", length(twitter), "documents\n")

## Twitter dataset: 2360148 documents

cat("\nTotal documents:", length(blogs) + length(news) + length(twitter), "\n")

## 
## Total documents: 4269642

Text Preprocessing and Cleaning

set.seed(123)

# Sample a subset for analysis
sample_size <- 10000

Sample_Text <- rbind(
  sample(blogs, min(sample_size, length(blogs))),
  sample(news, min(sample_size, length(news))),
  sample(twitter, min(sample_size, length(twitter)))
)

# Create corpus from the text
corpus <- Corpus(VectorSource(Sample_Text))

# Text cleaning pipeline
clean_corpus <- function(corpus) {
  corpus <- tm_map(corpus, content_transformer(tolower))
  corpus <- tm_map(corpus, removePunctuation)
  corpus <- tm_map(corpus, removeNumbers)
  corpus <- tm_map(corpus, removeWords, stopwords("english"))
  corpus <- tm_map(corpus, stripWhitespace)
  corpus <- tm_map(corpus, stemDocument)
  return(corpus)
}

cat("Text preprocessing functions defined.\n")

## Text preprocessing functions defined.

cat("Sample size:", nrow(Sample_Text), "documents\n")

## Sample size: 3 documents

Major Features of the Data

1. File Size Analysis

# Get file sizes
blogs_size <- file.info("data/en_US.blogs.txt")$size
news_size <- file.info("data/en_US.news.txt")$size
twitter_size <- file.info("data/en_US.twitter.txt")$size

# Convert to MB
size_mb <- c(Blogs = blogs_size/1024/1024, 
             News = news_size/1024/1024, 
             Twitter = twitter_size/1024/1024)

# Visualize file sizes
barplot(size_mb, 
        main = "Dataset File Sizes (MB)",
        xlab = "Dataset",
        ylab = "Size (MB)",
        col = c("#2E86AB", "#A23B72", "#F18F01"),
        border = NA)

2. Document Length Distribution

# Calculate number of words per document - FIXED syntax
library(stringi)
blogs_length <- stri_count_words(blogs)
news_length <- stri_count_words(news)
twitter_length <- stri_count_words(twitter)

# Create summary statistics
doc_stats <- data.frame(
  Dataset = c("Blogs", "News", "Twitter"),
  Mean_Length = c(mean(blogs_length), mean(news_length), mean(twitter_length)),
  Median_Length = c(median(blogs_length), median(news_length), median(twitter_length)),
  SD_Length = c(sd(blogs_length), sd(news_length), sd(twitter_length)),
  Min_Length = c(min(blogs_length), min(news_length), min(twitter_length)),
  Max_Length = c(max(blogs_length), max(news_length), max(twitter_length))
)

knitr::kable(doc_stats, 
             caption = "Document Length Statistics by Dataset",
             digits = 2)

Document Length Statistics by Dataset
Dataset	Mean_Length	Median_Length	SD_Length	Min_Length	Max_Length
Blogs	41.75	28	46.59	0	6726
News	34.41	32	22.83	1	1796
Twitter	12.75	12	6.91	1	47

# Plot document length distributio
par(mfrow = c(1, 3))
boxplot(blogs_length, names = "Blogs", main = "Blogs Document Length", col = "#2E86AB", ylab = "Word Count")
boxplot(news_length, names = "News", main = "News Document Length", col = "#A23B72", ylab = "Word Count")
boxplot(twitter_length, names = "Twitter", main = "Twitter Document Length", col = "#F18F01", ylab = "Word Count")

par(mfrow = c(1, 1))

3. Most Common Words (Unigrams) with Word Cloud

# Create Document Term Matrix
dtm <- DocumentTermMatrix(corpus)
dtm <- removeSparseTerms(dtm, 0.99)

# Calculate word frequencies
word_freq <- colSums(as.matrix(dtm))
word_freq <- sort(word_freq, decreasing = TRUE)

# Top 50 most common words
top_50_words <- head(word_freq, 50)

# Display top 20 words
cat("Top 20 Most Common Words:\n")

## Top 20 Most Common Words:

print(head(word_freq, 20))

##   the   and   for  that  with   was   you  have  this   but   are   not  from 
## 43574 22462  9090  9047  6441  5852  5742  4479  4335  4183  3980  3461  3341 
##   his  they  will   all   has about   one 
##  2924  2910  2709  2438  2421  2409  2313

4. Data Distribution by Source

# Plot distribution of documents across sources
source_data <- data.frame(
  Source = c("Blogs", "News", "Twitter"),
  Count = c(length(blogs), length(news), length(twitter))
)

ggplot(source_data, aes(x = Source, y = Count, fill = Source)) +
  geom_bar(stat = "identity", show.legend = FALSE) +
  geom_text(aes(label = Count), vjust = -0.5, size = 5) +
  labs(title = "Number of Documents by Source",
       x = "Dataset",
       y = "Number of Documents") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

### 5. Bigram Analysis (Two-Word Phrases)

# Custom function to extract n-grams using base R
extract_ngrams <- function(text, n = 2) {
  # 1. Use unlist() to turn the list from strsplit into a character vector
  words <- unlist(strsplit(text, "\\s+"))
  
  # 2. Clean up empty strings or NAs
  words <- words[words != "" & !is.na(words)]
  
  # 3. Safety check: if the line is too short, return NULL
  if (length(words) < n) return(NULL)
  
  # 4. Generate n-grams using a sliding window
  ngrams <- sapply(1:(length(words) - n + 1), function(i) {
    paste(words[i:(i + n - 1)], collapse = " ")
  })
  
  return(ngrams)
}

# Extract bigrams from all documents
cat("Extracting bigrams from corpus...\n")

## Extracting bigrams from corpus...

all_bigrams <- c()
for (i in 1:min(1000, length(Sample_Text))) {
  bg <- extract_ngrams(Sample_Text[i], n = 2)
  if (!is.null(bg) && length(bg) > 0) {
    all_bigrams <- c(all_bigrams, bg)
  }
}

# Check if we have any bigrams
if (length(all_bigrams) == 0) {
  cat("No bigrams extracted. Check data input.\n")
  top_bigrams <- NULL
} else {
  # Calculate bigram frequencies
  bigram_table <- table(all_bigrams)
  bigram_freq <- sort(bigram_table, decreasing = TRUE)

  # Top 20 bigrams
  top_bigrams <- head(bigram_freq, 20)

  # Display top bigrams
  cat("Top 20 Most Common Bigrams:\n")
  print(top_bigrams)
  
  # Visualize top bigrams (only if we have data)
  if (length(top_bigrams) > 0 && all(!is.na(top_bigrams)) && all(top_bigrams > 0)) {
    barplot(top_bigrams, 
            main = "Top 20 Most Common Bigrams",
            xlab = "Bigram",
            ylab = "Frequency",
            las = 2,
            cex.names = 0.7,
            col = "#2E86AB")
  } else {
    cat("No valid bigrams to plot.\n")
  }
}

## Top 20 Most Common Bigrams:
## all_bigrams
##   of the   in the   to the   on the    to be  for the   at the     is a 
##      123      117       75       57       52       49       40       39 
##     in a  and the    for a     of a from the   that I   I have    I was 
##       35       32       32       31       28       26       25       25 
##     to a  will be    it is     as a 
##       25       25       24       23

6. Trigram Analysis (Three-Word Phrases)

# Extract trigrams using custom function
extract_ngrams <- function(text, n = 3) {
  # 1. Use unlist() to turn the list from strsplit into a character vector
  words <- unlist(strsplit(text, "\\s+"))
  
  # 2. Clean up empty strings or NAs
  words <- words[words != "" & !is.na(words)]
  
  # 3. Safety check: if the line is too short, return NULL
  if (length(words) < n) return(NULL)
  
  # 4. Generate n-grams using a sliding window
  ngrams <- sapply(1:(length(words) - n + 1), function(i) {
    paste(words[i:(i + n - 1)], collapse = " ")
  })
  
  return(ngrams)
}
# Extract trigrams from all documents
cat("Extracting trigrams from corpus...\n")

## Extracting trigrams from corpus...

all_trigrams <- c()
for (i in 1:min(1000, length(Sample_Text))) {
  tg <- extract_ngrams(Sample_Text[i], n = 3)
  if (!is.null(tg) && length(tg) > 0) {
    all_trigrams <- c(all_trigrams, tg)
  }
}

# Check if we have any trigrams
if (length(all_trigrams) == 0) {
  cat("No trigrams extracted. Check data input.\n")
  top_trigrams <- NULL
} else {
  # Calculate trigram frequencies
  trigram_table <- table(all_trigrams)
  trigram_freq <- sort(trigram_table, decreasing = TRUE)

  # Top 15 trigrams
  top_trigrams <- head(trigram_freq, 15)

  # Display top trigrams
  cat("Top 15 Most Common Trigrams:\n")
  print(top_trigrams)
  
  # Visualize top trigrams (only if we have data)
  if (length(top_trigrams) > 0 && all(!is.na(top_trigrams)) && all(top_trigrams > 0)) {
    barplot(top_trigrams, 
            main = "Top 15 Most Common Trigrams",
            xlab = "Trigram",
            ylab = "Frequency",
            las = 2,
            cex.names = 0.6,
            col = "#A23B72")
  } else {
    cat("No valid trigrams to plot.\n")
  }
}

## Top 15 Most Common Trigrams:
## all_trigrams
##     a lot of   as well as   one of the   One of the        . . . in the first 
##            8            7            6            6            5            5 
##  some of the     to be in are going to   end of the    I need to    I want to 
##            5            5            4            4            4            4 
##    is in the    is one of    it to the 
##            4            4            4

### 7. Word Frequency Distribution

# Plot word frequency distribution
top_words_df <- data.frame(
  Word = names(top_50_words),
  Frequency = as.numeric(top_50_words)
)

# Ensure we have valid data
if (nrow(top_words_df) > 0 && all(!is.na(top_words_df$Frequency))) {
  ggplot(top_words_df, aes(x = reorder(Word, Frequency), y = Frequency, fill = Frequency)) +
    geom_bar(stat = "identity", show.legend = FALSE) +
    coord_flip() +
    labs(title = "Top 50 Most Common Words",
         x = "Word",
         y = "Frequency") +
    theme_minimal() +
    theme(axis.text.y = element_text(size = 8))
} else {
  cat("No word frequency data available for plotting.\n")
}

Summary Statistics

Key Findings from EDA

Metric	Description
Total Documents	Combined count across blogs, news, and Twitter datasets
Data Types	Unstructured text data from 3 different sources
Language	English language text
Content Variety	Blog posts, news articles, social media posts
Document Length	Highly variable (Twitter shorter, Blogs longer)
Vocabulary Size	Large vocabulary with significant overlap across sources
N-gram Patterns	Common words, phrases, and idioms identifiable

Data Characteristics

Heterogeneous Sources: The data combines formal (news) and informal (Twitter, blogs) writing styles
Variable Length: Document lengths vary significantly by source type
Rich Vocabulary: Large vocabulary with domain-specific terminology
Common Patterns: Frequent n-grams that can be leveraged for prediction

Plan for Creating the Text Predictive Model

Approach Overview

Based on the exploratory data analysis, I will develop a n-gram language model for next-word prediction. Here’s my plan:

1. Data Preparation

Steps: - Load and clean all three corpora completely - Apply consistent preprocessing (lowercase, remove punctuation, remove stopwords) - Tokenize text into individual words - Handle edge cases (special characters, numbers, URLs)

2. N-gram Model Development

N-gram Type	Purpose
Unigrams	Base word frequency probabilities
Bigrams	Two-word sequence predictions
Trigrams	Three-word sequence predictions

Strategy: - Build separate models for each n-gram level (1-4) - Use Kneser-Ney smoothing to handle unseen n-grams - Implement backoff strategy for rare/unseen sequences

3. Model Evaluation

Metrics to evaluate: - Perplexity: Lower is better (measures prediction uncertainty) - Accuracy: Percentage of correct next-word predictions - Holdout validation: Test on unseen data (20% split)

4. Final Data Product

Deliverable: An R function/predictor that: - Takes user input (previous words) as context - Returns ranked list of predicted next words - Provides confidence scores for predictions - Works in real-time for keyboard integration

5. Technical Implementation

Tools and Packages: - tm for text processing - Base R (strsplit, paste, table) for n-gram extraction - Custom implementation for n-gram models - ggplot2 for visualization - dplyr for data manipulation

Conclusion

This exploratory data analysis has revealed the key characteristics of the SwiftKey text prediction dataset:

Rich, diverse text data from three complementary sources
Clear patterns in word usage and n-gram frequencies
Feasible for n-gram language modeling approach

Coursera Data Science Capstone - Milestone Report

Swarup Kumar Roy

2026-05-28