Text Prediction Model: Exploratory Data Analysis

Executive Summary

This report presents an exploratory analysis of a large text corpus to be used for building a predictive text application. The dataset consists of blog posts, news articles, and Twitter messages in English. Our analysis reveals key characteristics of the data and outlines our strategy for developing a word prediction algorithm.

Key Findings:

Dataset contains over 4 million lines of text across three sources
Vocabulary analysis shows distinct patterns between formal (news) and informal (Twitter) content
Most common words follow expected natural language patterns
A sampling-based approach will enable efficient model building

1. Introduction

Project Goal

Develop a data-driven text prediction application (similar to smartphone keyboards) that suggests the next word as users type.

Dataset Overview

The Coursera-SwiftKey dataset contains English text from three sources:

Blogs: Personal writing, moderate formality
News: Professional journalism, high formality
Twitter: Social media, informal communication

2. Data Loading and Basic Statistics

# Load required libraries
library(ggplot2)
library(dplyr)
library(tidyr)
library(stringi)
library(knitr)
library(gridExtra)

# Define file paths
blogs_file <- "final/en_US/en_US.blogs.txt"
news_file <- "final/en_US/en_US.news.txt"
twitter_file <- "final/en_US/en_US.twitter.txt"

2.1 File Statistics

# Function to get file statistics
get_file_stats <- function(filepath) {
  # File size
  size_mb <- file.info(filepath)$size / (1024^2)
  
  # Read file for line and word counts
  con <- file(filepath, "r")
  lines <- readLines(con, warn = FALSE)
  close(con)
  
  # Line count
  line_count <- length(lines)
  
  # Word count
  word_count <- sum(stri_count_words(lines))
  
  # Character count
  char_count <- sum(nchar(lines))
  
  # Average words per line
  avg_words <- word_count / line_count
  
  return(data.frame(
    Size_MB = round(size_mb, 2),
    Lines = line_count,
    Words = word_count,
    Characters = char_count,
    Avg_Words_Per_Line = round(avg_words, 2)
  ))
}

# Get statistics for all files (if files exist)
file_stats <- data.frame()

if(file.exists(blogs_file)) {
  blogs_stats <- get_file_stats(blogs_file)
  blogs_stats$Source <- "Blogs"
  file_stats <- rbind(file_stats, blogs_stats)
}

if(file.exists(news_file)) {
  news_stats <- get_file_stats(news_file)
  news_stats$Source <- "News"
  file_stats <- rbind(file_stats, news_stats)
}

if(file.exists(twitter_file)) {
  twitter_stats <- get_file_stats(twitter_file)
  twitter_stats$Source <- "Twitter"
  file_stats <- rbind(file_stats, twitter_stats)
}

# Reorder columns
file_stats <- file_stats[, c("Source", "Size_MB", "Lines", "Words", 
                              "Characters", "Avg_Words_Per_Line")]

# Display table
kable(file_stats, 
      format.args = list(big.mark = ","),
      caption = "Table 1: Summary Statistics of Text Datasets")

Table 1: Summary Statistics of Text Datasets
Source	Size_MB	Lines	Words	Characters	Avg_Words_Per_Line
Blogs	200.42	899,288	37,546,621	206,824,505	41.75
News	196.28	1,010,242	34,760,761	203,223,159	34.41
Twitter	159.36	2,360,148	30,109,954	162,096,031	12.76

Observations:

The datasets are substantial, totaling over 500 MB of text data
Twitter has the most entries but shortest average length (character limit)
Blogs and news articles are longer-form content

2.2 Visual Comparison

# Reshape data for plotting
file_stats_long <- file_stats %>%
  select(Source, Lines, Words) %>%
  pivot_longer(cols = c(Lines, Words), 
               names_to = "Metric", 
               values_to = "Count")

# Create comparison plot
ggplot(file_stats_long, aes(x = Source, y = Count, fill = Source)) +
  geom_bar(stat = "identity") +
  facet_wrap(~Metric, scales = "free_y") +
  scale_y_continuous(labels = scales::comma) +
  theme_minimal() +
  labs(title = "Figure 1: Comparison of Dataset Sizes",
       subtitle = "Lines vs Words across three text sources",
       y = "Count",
       x = "") +
  theme(legend.position = "none",
        plot.title = element_text(face = "bold"))

3. Sampling Strategy

Given the large size of the datasets, we’ll use a 1% random sample for detailed exploration. This approach:

Reduces computation time significantly
Maintains statistical representativeness
Enables rapid prototyping and iteration

set.seed(12345)  # For reproducibility

# Function to sample lines from file
sample_file <- function(filepath, sample_rate = 0.01) {
  con <- file(filepath, "r")
  lines <- readLines(con, warn = FALSE)
  close(con)
  
  # Random sampling
  sample_size <- floor(length(lines) * sample_rate)
  sampled_lines <- sample(lines, sample_size)
  
  return(sampled_lines)
}

# Create samples
if(file.exists(blogs_file)) {
  blogs_sample <- sample_file(blogs_file)
}

if(file.exists(news_file)) {
  news_sample <- sample_file(news_file)
}

if(file.exists(twitter_file)) {
  twitter_sample <- sample_file(twitter_file)
}

# Combine all samples
all_samples <- c(blogs_sample, news_sample, twitter_sample)

cat("Sample sizes:\n")

## Sample sizes:

cat("  Blogs:", length(blogs_sample), "lines\n")

##   Blogs: 8992 lines

cat("  News:", length(news_sample), "lines\n")

##   News: 10102 lines

cat("  Twitter:", length(twitter_sample), "lines\n")

##   Twitter: 23601 lines

cat("  Total:", length(all_samples), "lines\n")

##   Total: 42695 lines

4. Text Preprocessing and Tokenization

# Function to clean and tokenize text
clean_and_tokenize <- function(text) {
  # Convert to lowercase
  text <- tolower(text)
  
  # Remove URLs
  text <- gsub("http[s]?://(?:[a-zA-Z]|[0-9]|[$-_@.&+]|[!*\\(\\),]|(?:%[0-9a-fA-F][0-9a-fA-F]))+", 
               "", text)
  
  # Remove email addresses
  text <- gsub("\\S+@\\S+", "", text)
  
  # Remove Twitter handles
  text <- gsub("@\\w+", "", text)
  
  # Extract words (including contractions)
  words <- unlist(stri_extract_all_words(text))
  
  return(words)
}

# Tokenize all samples
all_words <- clean_and_tokenize(all_samples)

cat("Total words in sample:", length(all_words), "\n")

## Total words in sample: 1023771

cat("Unique words in sample:", length(unique(all_words)), "\n")

## Unique words in sample: 55061

5. Word Frequency Analysis

5.1 Most Common Words

# Calculate word frequencies
word_freq <- as.data.frame(table(all_words))
colnames(word_freq) <- c("Word", "Frequency")
word_freq <- word_freq %>%
  arrange(desc(Frequency)) %>%
  mutate(Proportion = Frequency / sum(Frequency) * 100)

# Top 20 words
top20 <- head(word_freq, 20)

kable(top20, 
      row.names = FALSE,
      caption = "Table 2: Top 20 Most Frequent Words",
      digits = 2)

Table 2: Top 20 Most Frequent Words
Word	Frequency	Proportion
the	48048	4.69
to	27627	2.70
a	23871	2.33
and	23862	2.33
of	20199	1.97
in	16653	1.63
i	16523	1.61
for	10905	1.07
is	10814	1.06
that	10398	1.02
you	9780	0.96
it	9157	0.89
on	8210	0.80
with	7058	0.69
was	6177	0.60
my	6030	0.59
at	5723	0.56
this	5514	0.54
be	5447	0.53
have	5369	0.52

# Visualize top 30 words
top30 <- head(word_freq, 30)

ggplot(top30, aes(x = reorder(Word, Frequency), y = Frequency)) +
  geom_bar(stat = "identity", fill = "steelblue") +
  coord_flip() +
  theme_minimal() +
  labs(title = "Figure 2: Top 30 Most Frequent Words",
       subtitle = "Based on 1% sample of combined datasets",
       x = "Word",
       y = "Frequency") +
  theme(plot.title = element_text(face = "bold"))

Observations:

Common English stop words (the, to, and, a, of) dominate
These will be important for prediction but may need special handling

5.2 Coverage Analysis

# Calculate cumulative coverage
word_freq <- word_freq %>%
  mutate(Cumulative_Prop = cumsum(Proportion))

# How many words needed for 50% and 90% coverage?
words_50 <- which(word_freq$Cumulative_Prop >= 50)[1]
words_90 <- which(word_freq$Cumulative_Prop >= 90)[1]

cat("Words needed to cover:\n")

## Words needed to cover:

cat("  50% of text:", words_50, "words\n")

##   50% of text: 155 words

cat("  90% of text:", words_90, "words\n")

##   90% of text: 7517 words

# Plot coverage curve
coverage_data <- word_freq[1:1000, ]

ggplot(coverage_data, aes(x = 1:nrow(coverage_data), y = Cumulative_Prop)) +
  geom_line(color = "darkblue", size = 1) +
  geom_hline(yintercept = 50, linetype = "dashed", color = "red") +
  geom_hline(yintercept = 90, linetype = "dashed", color = "orange") +
  annotate("text", x = 700, y = 52, label = "50% coverage", color = "red") +
  annotate("text", x = 700, y = 92, label = "90% coverage", color = "orange") +
  theme_minimal() +
  labs(title = "Figure 3: Cumulative Word Coverage",
       subtitle = "How many unique words are needed to cover percentage of text",
       x = "Number of Unique Words (ranked by frequency)",
       y = "Cumulative Coverage (%)") +
  theme(plot.title = element_text(face = "bold"))

Key Finding: A relatively small vocabulary can cover a large portion of the text, which will help optimize our prediction model.

5.3 Word Length Distribution

# Calculate word lengths
word_lengths <- nchar(all_words)

# Create histogram
ggplot(data.frame(Length = word_lengths), aes(x = Length)) +
  geom_histogram(binwidth = 1, fill = "forestgreen", color = "black") +
  scale_x_continuous(breaks = seq(0, 20, 2)) +
  theme_minimal() +
  labs(title = "Figure 4: Distribution of Word Lengths",
       subtitle = "Character count per word",
       x = "Word Length (characters)",
       y = "Frequency") +
  theme(plot.title = element_text(face = "bold"))

6. N-gram Analysis

N-grams are sequences of N consecutive words. They’re essential for predicting the next word based on context.

6.1 Bigrams (2-word sequences)

# Function to create n-grams
create_ngrams <- function(words, n = 2) {
  if(length(words) < n) return(character(0))
  
  ngrams <- character(length(words) - n + 1)
  for(i in 1:(length(words) - n + 1)) {
    ngrams[i] <- paste(words[i:(i + n - 1)], collapse = " ")
  }
  return(ngrams)
}

# Create bigrams
bigrams <- create_ngrams(all_words, 2)

# Calculate bigram frequencies
bigram_freq <- as.data.frame(table(bigrams))
colnames(bigram_freq) <- c("Bigram", "Frequency")
bigram_freq <- bigram_freq %>%
  arrange(desc(Frequency))

# Top 20 bigrams
top20_bigrams <- head(bigram_freq, 20)

kable(top20_bigrams, 
      row.names = FALSE,
      caption = "Table 3: Top 20 Most Frequent Bigrams")

Table 3: Top 20 Most Frequent Bigrams
Bigram	Frequency
of the	4317
in the	4162
to the	2098
for the	2014
on the	1923
to be	1612
at the	1434
and the	1270
in a	1201
with the	1057
is a	1008
it was	977
i have	903
for a	887
from the	834
i was	833
going to	818
of a	815
and i	813
it is	796

# Visualize top 20 bigrams
ggplot(top20_bigrams, aes(x = reorder(Bigram, Frequency), y = Frequency)) +
  geom_bar(stat = "identity", fill = "coral") +
  coord_flip() +
  theme_minimal() +
  labs(title = "Figure 5: Top 20 Most Frequent Bigrams",
       subtitle = "Two-word sequences",
       x = "Bigram",
       y = "Frequency") +
  theme(plot.title = element_text(face = "bold"))

6.2 Trigrams (3-word sequences)

# Create trigrams
trigrams <- create_ngrams(all_words, 3)

# Calculate trigram frequencies
trigram_freq <- as.data.frame(table(trigrams))
colnames(trigram_freq) <- c("Trigram", "Frequency")
trigram_freq <- trigram_freq %>%
  arrange(desc(Frequency))

# Top 20 trigrams
top20_trigrams <- head(trigram_freq, 20)

kable(top20_trigrams, 
      row.names = FALSE,
      caption = "Table 4: Top 20 Most Frequent Trigrams")

Table 4: Top 20 Most Frequent Trigrams
Trigram	Frequency
one of the	348
a lot of	302
thanks for the	233
to be a	186
going to be	174
out of the	160
i want to	156
as well as	150
it was a	141
part of the	138
the end of	138
be able to	137
some of the	137
the rest of	123
i have a	121
looking forward to	118
thank you for	115
there is a	111
i need to	110
is going to	109

# Visualize top 20 trigrams
ggplot(top20_trigrams, aes(x = reorder(Trigram, Frequency), y = Frequency)) +
  geom_bar(stat = "identity", fill = "mediumpurple") +
  coord_flip() +
  theme_minimal() +
  labs(title = "Figure 6: Top 20 Most Frequent Trigrams",
       subtitle = "Three-word sequences",
       x = "Trigram",
       y = "Frequency") +
  theme(plot.title = element_text(face = "bold"))

7. Interesting Findings

7.1 Rare Words Distribution

# Words that appear only once (hapax legomena)
singleton_words <- word_freq %>% filter(Frequency == 1)

# Frequency of frequencies
freq_of_freq <- as.data.frame(table(word_freq$Frequency))
colnames(freq_of_freq) <- c("Occurrence", "Number_of_Words")
freq_of_freq$Occurrence <- as.numeric(as.character(freq_of_freq$Occurrence))

# Plot first 50 frequency levels
ggplot(head(freq_of_freq, 50), 
       aes(x = Occurrence, y = Number_of_Words)) +
  geom_bar(stat = "identity", fill = "darkorange") +
  scale_y_log10(labels = scales::comma) +
  theme_minimal() +
  labs(title = "Figure 7: Frequency of Frequencies",
       subtitle = "How many words appear exactly N times (log scale)",
       x = "Number of Occurrences",
       y = "Number of Words (log scale)") +
  theme(plot.title = element_text(face = "bold"))

cat("Words appearing only once:", nrow(singleton_words), "\n")

## Words appearing only once: 28328

cat("Percentage of vocabulary:", 
    round(nrow(singleton_words) / nrow(word_freq) * 100, 2), "%\n")

## Percentage of vocabulary: 51.45 %

Observation: Many words appear very rarely. This “long tail” is typical in natural language and will require special handling in our model.

7.2 Source-Specific Patterns

# Tokenize each source separately
blogs_words <- clean_and_tokenize(blogs_sample)
news_words <- clean_and_tokenize(news_sample)
twitter_words <- clean_and_tokenize(twitter_sample)

# Calculate unique word counts
unique_counts <- data.frame(
  Source = c("Blogs", "News", "Twitter"),
  Total_Words = c(length(blogs_words), length(news_words), length(twitter_words)),
  Unique_Words = c(length(unique(blogs_words)), 
                   length(unique(news_words)), 
                   length(unique(twitter_words)))
)

unique_counts <- unique_counts %>%
  mutate(Vocabulary_Richness = round(Unique_Words / Total_Words * 100, 2))

kable(unique_counts, 
      caption = "Table 5: Vocabulary Richness by Source",
      format.args = list(big.mark = ","))

Table 5: Vocabulary Richness by Source
Source	Total_Words	Unique_Words	Vocabulary_Richness
Blogs	373,397	28,723	7.69
News	348,594	30,983	8.89
Twitter	301,780	25,638	8.50

ggplot(unique_counts, aes(x = Source, y = Vocabulary_Richness, fill = Source)) +
  geom_bar(stat = "identity") +
  theme_minimal() +
  labs(title = "Figure 8: Vocabulary Richness by Source",
       subtitle = "Percentage of unique words",
       y = "Vocabulary Richness (%)",
       x = "") +
  theme(legend.position = "none",
        plot.title = element_text(face = "bold"))

8. Plans for Prediction Algorithm

8.1 Proposed Approach

Our text prediction algorithm will use an N-gram model with backoff:

N-gram Model: Use sequences of 2-5 words to predict the next word
Backoff Strategy: If no match in 5-grams, try 4-grams, then 3-grams, etc.
Smoothing: Handle unseen word combinations using Kneser-Ney smoothing
Profanity Filter: Remove offensive words from predictions

8.2 Model Building Steps

Table 6: Model Development Plan
Step	Task	Description
1	Data Sampling & Cleaning	Sample data, remove profanity, clean text
2	N-gram Generation (2-5 grams)	Create all n-grams from training corpus
3	Frequency Calculation & Pruning	Keep only n-grams above frequency threshold
4	Implement Backoff Algorithm	Implement fallback to shorter n-grams
5	Add Smoothing & Profanity Filter	Handle unseen combinations, filter bad words
6	Performance Testing & Optimization	Test accuracy and speed, optimize size

8.3 Shiny App Features

The final application will include:

Real-time prediction: Suggest words as user types
Multiple suggestions: Show top 3-5 predictions
Responsive interface: Fast predictions (< 100ms)
User feedback: Allow users to report issues
Statistics dashboard: Show model performance metrics

8.4 Technical Challenges

Challenge 1: Model Size - Solution: Prune rare n-grams, use efficient data structures (hash tables)

Challenge 2: Speed - Solution: Pre-compute lookup tables, implement caching

Challenge 3: Accuracy - Solution: Use larger n-grams (4-5 words) where possible, validate against test set

Challenge 4: Unknown Words - Solution: Implement backoff to shorter n-grams, use word similarity for suggestions

9. Next Steps

Expand Sample: Test with 5-10% sample to validate findings
Build N-gram Database: Create comprehensive n-gram frequency tables
Implement Prediction Function: Code the core algorithm
Develop Shiny Interface: Create user-friendly web application
Validate & Test: Measure prediction accuracy and speed
Deploy: Publish to shinyapps.io

10. Conclusion

Our exploratory analysis has revealed:

✓ Successful data loading of 4+ million lines across three text sources
✓ Clear patterns in word frequency following natural language distributions
✓ Efficient sampling strategy enables rapid analysis of massive datasets
✓ N-gram patterns show promise for building accurate predictions
✓ Feasible approach identified using backoff models and smoothing

The data is clean, well-understood, and ready for model building. Our planned n-gram approach with backoff is well-suited for this prediction task and will deliver a responsive, accurate text prediction application.

Report generated on 2025-11-25 for Coursera Data Science Capstone Project