Executive Summary

This report presents an exploratory analysis of a large text corpus to be used for building a predictive text application. The dataset consists of blog posts, news articles, and Twitter messages in English. Our analysis reveals key characteristics of the data and outlines our strategy for developing a word prediction algorithm.

Key Findings:

  • Dataset contains over 4 million lines of text across three sources
  • Vocabulary analysis shows distinct patterns between formal (news) and informal (Twitter) content
  • Most common words follow expected natural language patterns
  • A sampling-based approach will enable efficient model building

1. Introduction

Project Goal

Develop a data-driven text prediction application (similar to smartphone keyboards) that suggests the next word as users type.

Dataset Overview

The Coursera-SwiftKey dataset contains English text from three sources:

  • Blogs: Personal writing, moderate formality
  • News: Professional journalism, high formality
  • Twitter: Social media, informal communication

2. Data Loading and Basic Statistics

# Load required libraries
library(ggplot2)
library(dplyr)
library(tidyr)
library(stringi)
library(knitr)
library(gridExtra)
# Define file paths
blogs_file <- "final/en_US/en_US.blogs.txt"
news_file <- "final/en_US/en_US.news.txt"
twitter_file <- "final/en_US/en_US.twitter.txt"

2.1 File Statistics

# Function to get file statistics
get_file_stats <- function(filepath) {
  # File size
  size_mb <- file.info(filepath)$size / (1024^2)
  
  # Read file for line and word counts
  con <- file(filepath, "r")
  lines <- readLines(con, warn = FALSE)
  close(con)
  
  # Line count
  line_count <- length(lines)
  
  # Word count
  word_count <- sum(stri_count_words(lines))
  
  # Character count
  char_count <- sum(nchar(lines))
  
  # Average words per line
  avg_words <- word_count / line_count
  
  return(data.frame(
    Size_MB = round(size_mb, 2),
    Lines = line_count,
    Words = word_count,
    Characters = char_count,
    Avg_Words_Per_Line = round(avg_words, 2)
  ))
}

# Get statistics for all files (if files exist)
file_stats <- data.frame()

if(file.exists(blogs_file)) {
  blogs_stats <- get_file_stats(blogs_file)
  blogs_stats$Source <- "Blogs"
  file_stats <- rbind(file_stats, blogs_stats)
}

if(file.exists(news_file)) {
  news_stats <- get_file_stats(news_file)
  news_stats$Source <- "News"
  file_stats <- rbind(file_stats, news_stats)
}

if(file.exists(twitter_file)) {
  twitter_stats <- get_file_stats(twitter_file)
  twitter_stats$Source <- "Twitter"
  file_stats <- rbind(file_stats, twitter_stats)
}

# Reorder columns
file_stats <- file_stats[, c("Source", "Size_MB", "Lines", "Words", 
                              "Characters", "Avg_Words_Per_Line")]

# Display table
kable(file_stats, 
      format.args = list(big.mark = ","),
      caption = "Table 1: Summary Statistics of Text Datasets")
Table 1: Summary Statistics of Text Datasets
Source Size_MB Lines Words Characters Avg_Words_Per_Line
Blogs 200.42 899,288 37,546,621 206,824,505 41.75
News 196.28 1,010,242 34,760,761 203,223,159 34.41
Twitter 159.36 2,360,148 30,109,954 162,096,031 12.76

Observations:

  • The datasets are substantial, totaling over 500 MB of text data
  • Twitter has the most entries but shortest average length (character limit)
  • Blogs and news articles are longer-form content

2.2 Visual Comparison

# Reshape data for plotting
file_stats_long <- file_stats %>%
  select(Source, Lines, Words) %>%
  pivot_longer(cols = c(Lines, Words), 
               names_to = "Metric", 
               values_to = "Count")

# Create comparison plot
ggplot(file_stats_long, aes(x = Source, y = Count, fill = Source)) +
  geom_bar(stat = "identity") +
  facet_wrap(~Metric, scales = "free_y") +
  scale_y_continuous(labels = scales::comma) +
  theme_minimal() +
  labs(title = "Figure 1: Comparison of Dataset Sizes",
       subtitle = "Lines vs Words across three text sources",
       y = "Count",
       x = "") +
  theme(legend.position = "none",
        plot.title = element_text(face = "bold"))

3. Sampling Strategy

Given the large size of the datasets, we’ll use a 1% random sample for detailed exploration. This approach:

  • Reduces computation time significantly
  • Maintains statistical representativeness
  • Enables rapid prototyping and iteration
set.seed(12345)  # For reproducibility

# Function to sample lines from file
sample_file <- function(filepath, sample_rate = 0.01) {
  con <- file(filepath, "r")
  lines <- readLines(con, warn = FALSE)
  close(con)
  
  # Random sampling
  sample_size <- floor(length(lines) * sample_rate)
  sampled_lines <- sample(lines, sample_size)
  
  return(sampled_lines)
}

# Create samples
if(file.exists(blogs_file)) {
  blogs_sample <- sample_file(blogs_file)
}

if(file.exists(news_file)) {
  news_sample <- sample_file(news_file)
}

if(file.exists(twitter_file)) {
  twitter_sample <- sample_file(twitter_file)
}

# Combine all samples
all_samples <- c(blogs_sample, news_sample, twitter_sample)

cat("Sample sizes:\n")
## Sample sizes:
cat("  Blogs:", length(blogs_sample), "lines\n")
##   Blogs: 8992 lines
cat("  News:", length(news_sample), "lines\n")
##   News: 10102 lines
cat("  Twitter:", length(twitter_sample), "lines\n")
##   Twitter: 23601 lines
cat("  Total:", length(all_samples), "lines\n")
##   Total: 42695 lines

4. Text Preprocessing and Tokenization

# Function to clean and tokenize text
clean_and_tokenize <- function(text) {
  # Convert to lowercase
  text <- tolower(text)
  
  # Remove URLs
  text <- gsub("http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\\(\\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+", 
               "", text)
  
  # Remove email addresses
  text <- gsub("\\S+@\\S+", "", text)
  
  # Remove Twitter handles
  text <- gsub("@\\w+", "", text)
  
  # Extract words (including contractions)
  words <- unlist(stri_extract_all_words(text))
  
  return(words)
}

# Tokenize all samples
all_words <- clean_and_tokenize(all_samples)

cat("Total words in sample:", length(all_words), "\n")
## Total words in sample: 1023771
cat("Unique words in sample:", length(unique(all_words)), "\n")
## Unique words in sample: 55061

5. Word Frequency Analysis

5.1 Most Common Words

# Calculate word frequencies
word_freq <- as.data.frame(table(all_words))
colnames(word_freq) <- c("Word", "Frequency")
word_freq <- word_freq %>%
  arrange(desc(Frequency)) %>%
  mutate(Proportion = Frequency / sum(Frequency) * 100)

# Top 20 words
top20 <- head(word_freq, 20)

kable(top20, 
      row.names = FALSE,
      caption = "Table 2: Top 20 Most Frequent Words",
      digits = 2)
Table 2: Top 20 Most Frequent Words
Word Frequency Proportion
the 48048 4.69
to 27627 2.70
a 23871 2.33
and 23862 2.33
of 20199 1.97
in 16653 1.63
i 16523 1.61
for 10905 1.07
is 10814 1.06
that 10398 1.02
you 9780 0.96
it 9157 0.89
on 8210 0.80
with 7058 0.69
was 6177 0.60
my 6030 0.59
at 5723 0.56
this 5514 0.54
be 5447 0.53
have 5369 0.52
# Visualize top 30 words
top30 <- head(word_freq, 30)

ggplot(top30, aes(x = reorder(Word, Frequency), y = Frequency)) +
  geom_bar(stat = "identity", fill = "steelblue") +
  coord_flip() +
  theme_minimal() +
  labs(title = "Figure 2: Top 30 Most Frequent Words",
       subtitle = "Based on 1% sample of combined datasets",
       x = "Word",
       y = "Frequency") +
  theme(plot.title = element_text(face = "bold"))

Observations:

  • Common English stop words (the, to, and, a, of) dominate
  • These will be important for prediction but may need special handling

5.2 Coverage Analysis

# Calculate cumulative coverage
word_freq <- word_freq %>%
  mutate(Cumulative_Prop = cumsum(Proportion))

# How many words needed for 50% and 90% coverage?
words_50 <- which(word_freq$Cumulative_Prop >= 50)[1]
words_90 <- which(word_freq$Cumulative_Prop >= 90)[1]

cat("Words needed to cover:\n")
## Words needed to cover:
cat("  50% of text:", words_50, "words\n")
##   50% of text: 155 words
cat("  90% of text:", words_90, "words\n")
##   90% of text: 7517 words
# Plot coverage curve
coverage_data <- word_freq[1:1000, ]

ggplot(coverage_data, aes(x = 1:nrow(coverage_data), y = Cumulative_Prop)) +
  geom_line(color = "darkblue", size = 1) +
  geom_hline(yintercept = 50, linetype = "dashed", color = "red") +
  geom_hline(yintercept = 90, linetype = "dashed", color = "orange") +
  annotate("text", x = 700, y = 52, label = "50% coverage", color = "red") +
  annotate("text", x = 700, y = 92, label = "90% coverage", color = "orange") +
  theme_minimal() +
  labs(title = "Figure 3: Cumulative Word Coverage",
       subtitle = "How many unique words are needed to cover percentage of text",
       x = "Number of Unique Words (ranked by frequency)",
       y = "Cumulative Coverage (%)") +
  theme(plot.title = element_text(face = "bold"))

Key Finding: A relatively small vocabulary can cover a large portion of the text, which will help optimize our prediction model.

5.3 Word Length Distribution

# Calculate word lengths
word_lengths <- nchar(all_words)

# Create histogram
ggplot(data.frame(Length = word_lengths), aes(x = Length)) +
  geom_histogram(binwidth = 1, fill = "forestgreen", color = "black") +
  scale_x_continuous(breaks = seq(0, 20, 2)) +
  theme_minimal() +
  labs(title = "Figure 4: Distribution of Word Lengths",
       subtitle = "Character count per word",
       x = "Word Length (characters)",
       y = "Frequency") +
  theme(plot.title = element_text(face = "bold"))

6. N-gram Analysis

N-grams are sequences of N consecutive words. They’re essential for predicting the next word based on context.

6.1 Bigrams (2-word sequences)

# Function to create n-grams
create_ngrams <- function(words, n = 2) {
  if(length(words) < n) return(character(0))
  
  ngrams <- character(length(words) - n + 1)
  for(i in 1:(length(words) - n + 1)) {
    ngrams[i] <- paste(words[i:(i + n - 1)], collapse = " ")
  }
  return(ngrams)
}

# Create bigrams
bigrams <- create_ngrams(all_words, 2)

# Calculate bigram frequencies
bigram_freq <- as.data.frame(table(bigrams))
colnames(bigram_freq) <- c("Bigram", "Frequency")
bigram_freq <- bigram_freq %>%
  arrange(desc(Frequency))

# Top 20 bigrams
top20_bigrams <- head(bigram_freq, 20)

kable(top20_bigrams, 
      row.names = FALSE,
      caption = "Table 3: Top 20 Most Frequent Bigrams")
Table 3: Top 20 Most Frequent Bigrams
Bigram Frequency
of the 4317
in the 4162
to the 2098
for the 2014
on the 1923
to be 1612
at the 1434
and the 1270
in a 1201
with the 1057
is a 1008
it was 977
i have 903
for a 887
from the 834
i was 833
going to 818
of a 815
and i 813
it is 796
# Visualize top 20 bigrams
ggplot(top20_bigrams, aes(x = reorder(Bigram, Frequency), y = Frequency)) +
  geom_bar(stat = "identity", fill = "coral") +
  coord_flip() +
  theme_minimal() +
  labs(title = "Figure 5: Top 20 Most Frequent Bigrams",
       subtitle = "Two-word sequences",
       x = "Bigram",
       y = "Frequency") +
  theme(plot.title = element_text(face = "bold"))

6.2 Trigrams (3-word sequences)

# Create trigrams
trigrams <- create_ngrams(all_words, 3)

# Calculate trigram frequencies
trigram_freq <- as.data.frame(table(trigrams))
colnames(trigram_freq) <- c("Trigram", "Frequency")
trigram_freq <- trigram_freq %>%
  arrange(desc(Frequency))

# Top 20 trigrams
top20_trigrams <- head(trigram_freq, 20)

kable(top20_trigrams, 
      row.names = FALSE,
      caption = "Table 4: Top 20 Most Frequent Trigrams")
Table 4: Top 20 Most Frequent Trigrams
Trigram Frequency
one of the 348
a lot of 302
thanks for the 233
to be a 186
going to be 174
out of the 160
i want to 156
as well as 150
it was a 141
part of the 138
the end of 138
be able to 137
some of the 137
the rest of 123
i have a 121
looking forward to 118
thank you for 115
there is a 111
i need to 110
is going to 109
# Visualize top 20 trigrams
ggplot(top20_trigrams, aes(x = reorder(Trigram, Frequency), y = Frequency)) +
  geom_bar(stat = "identity", fill = "mediumpurple") +
  coord_flip() +
  theme_minimal() +
  labs(title = "Figure 6: Top 20 Most Frequent Trigrams",
       subtitle = "Three-word sequences",
       x = "Trigram",
       y = "Frequency") +
  theme(plot.title = element_text(face = "bold"))

7. Interesting Findings

7.1 Rare Words Distribution

# Words that appear only once (hapax legomena)
singleton_words <- word_freq %>% filter(Frequency == 1)

# Frequency of frequencies
freq_of_freq <- as.data.frame(table(word_freq$Frequency))
colnames(freq_of_freq) <- c("Occurrence", "Number_of_Words")
freq_of_freq$Occurrence <- as.numeric(as.character(freq_of_freq$Occurrence))

# Plot first 50 frequency levels
ggplot(head(freq_of_freq, 50), 
       aes(x = Occurrence, y = Number_of_Words)) +
  geom_bar(stat = "identity", fill = "darkorange") +
  scale_y_log10(labels = scales::comma) +
  theme_minimal() +
  labs(title = "Figure 7: Frequency of Frequencies",
       subtitle = "How many words appear exactly N times (log scale)",
       x = "Number of Occurrences",
       y = "Number of Words (log scale)") +
  theme(plot.title = element_text(face = "bold"))

cat("Words appearing only once:", nrow(singleton_words), "\n")
## Words appearing only once: 28328
cat("Percentage of vocabulary:", 
    round(nrow(singleton_words) / nrow(word_freq) * 100, 2), "%\n")
## Percentage of vocabulary: 51.45 %

Observation: Many words appear very rarely. This “long tail” is typical in natural language and will require special handling in our model.

7.2 Source-Specific Patterns

# Tokenize each source separately
blogs_words <- clean_and_tokenize(blogs_sample)
news_words <- clean_and_tokenize(news_sample)
twitter_words <- clean_and_tokenize(twitter_sample)

# Calculate unique word counts
unique_counts <- data.frame(
  Source = c("Blogs", "News", "Twitter"),
  Total_Words = c(length(blogs_words), length(news_words), length(twitter_words)),
  Unique_Words = c(length(unique(blogs_words)), 
                   length(unique(news_words)), 
                   length(unique(twitter_words)))
)

unique_counts <- unique_counts %>%
  mutate(Vocabulary_Richness = round(Unique_Words / Total_Words * 100, 2))

kable(unique_counts, 
      caption = "Table 5: Vocabulary Richness by Source",
      format.args = list(big.mark = ","))
Table 5: Vocabulary Richness by Source
Source Total_Words Unique_Words Vocabulary_Richness
Blogs 373,397 28,723 7.69
News 348,594 30,983 8.89
Twitter 301,780 25,638 8.50
ggplot(unique_counts, aes(x = Source, y = Vocabulary_Richness, fill = Source)) +
  geom_bar(stat = "identity") +
  theme_minimal() +
  labs(title = "Figure 8: Vocabulary Richness by Source",
       subtitle = "Percentage of unique words",
       y = "Vocabulary Richness (%)",
       x = "") +
  theme(legend.position = "none",
        plot.title = element_text(face = "bold"))

8. Plans for Prediction Algorithm

8.1 Proposed Approach

Our text prediction algorithm will use an N-gram model with backoff:

  1. N-gram Model: Use sequences of 2-5 words to predict the next word
  2. Backoff Strategy: If no match in 5-grams, try 4-grams, then 3-grams, etc.
  3. Smoothing: Handle unseen word combinations using Kneser-Ney smoothing
  4. Profanity Filter: Remove offensive words from predictions

8.2 Model Building Steps

Table 6: Model Development Plan
Step Task Description
1 Data Sampling & Cleaning Sample data, remove profanity, clean text
2 N-gram Generation (2-5 grams) Create all n-grams from training corpus
3 Frequency Calculation & Pruning Keep only n-grams above frequency threshold
4 Implement Backoff Algorithm Implement fallback to shorter n-grams
5 Add Smoothing & Profanity Filter Handle unseen combinations, filter bad words
6 Performance Testing & Optimization Test accuracy and speed, optimize size

8.3 Shiny App Features

The final application will include:

  • Real-time prediction: Suggest words as user types
  • Multiple suggestions: Show top 3-5 predictions
  • Responsive interface: Fast predictions (< 100ms)
  • User feedback: Allow users to report issues
  • Statistics dashboard: Show model performance metrics

8.4 Technical Challenges

Challenge 1: Model Size - Solution: Prune rare n-grams, use efficient data structures (hash tables)

Challenge 2: Speed - Solution: Pre-compute lookup tables, implement caching

Challenge 3: Accuracy - Solution: Use larger n-grams (4-5 words) where possible, validate against test set

Challenge 4: Unknown Words - Solution: Implement backoff to shorter n-grams, use word similarity for suggestions

9. Next Steps

  1. Expand Sample: Test with 5-10% sample to validate findings
  2. Build N-gram Database: Create comprehensive n-gram frequency tables
  3. Implement Prediction Function: Code the core algorithm
  4. Develop Shiny Interface: Create user-friendly web application
  5. Validate & Test: Measure prediction accuracy and speed
  6. Deploy: Publish to shinyapps.io

10. Conclusion

Our exploratory analysis has revealed:

Successful data loading of 4+ million lines across three text sources
Clear patterns in word frequency following natural language distributions
Efficient sampling strategy enables rapid analysis of massive datasets
N-gram patterns show promise for building accurate predictions
Feasible approach identified using backoff models and smoothing

The data is clean, well-understood, and ready for model building. Our planned n-gram approach with backoff is well-suited for this prediction task and will deliver a responsive, accurate text prediction application.


Report generated on 2025-11-25 for Coursera Data Science Capstone Project