Executive Summary

This report presents an exploratory data analysis of three English language text corpora (blogs, news, and Twitter) to inform the development of a word prediction algorithm. The analysis examines file characteristics, vocabulary distributions, n-gram patterns, and text properties that are essential for building an effective predictive text application.

Most Significant Findings:

Vocabulary size and word frequency distributions follow a heavy tailed distribution.
N-gram analysis reveals common phrase structures among text corpora, useful for prediction.
Different text sources exhibit unique linguistic characteristics.

1. Environment Setup

# Load required libraries
library(tidyverse)
library(stringi)
library(knitr)
library(gridExtra)

# Set seed for reproducibility
set.seed(123)

2. Data Loading and Basic Statistics

files <- c(
  blogs = "en_US.blogs.txt",
  news = "en_US.news.txt",
  twitter = "en_US.twitter.txt"
)

# Function to get file statistics
get_file_stats <- function(filepath) {
  # Check if file exists
  if (!file.exists(filepath)) {
    return(list(
      size_mb = NA,
      lines = NA,
      words = NA,
      chars = NA,
      lines_sample = character(0)
    ))
  }
  
  # Get file size
  size_mb <- file.info(filepath)$size / 1024^2
  
  # Read file
  con <- file(filepath, "r")
  lines <- readLines(con, encoding = "UTF-8", skipNul = TRUE)
  close(con)
  
  # Calculate statistics
  n_lines <- length(lines)
  n_words <- sum(stri_count_words(lines))
  n_chars <- sum(stri_length(lines))
  
  list(
    size_mb = size_mb,
    lines = n_lines,
    words = n_words,
    chars = n_chars,
    lines_sample = lines
  )
}

# Collect statistics for all files
stats_list <- lapply(files, get_file_stats)

# Create summary table
summary_table <- data.frame(
  Source = names(files),
  Size_MB = sapply(stats_list, function(x) round(x$size_mb, 2)),
  Total_Lines = sapply(stats_list, function(x) format(x$lines, big.mark = ",")),
  Total_Words = sapply(stats_list, function(x) format(x$words, big.mark = ",")),
  Total_Characters = sapply(stats_list, function(x) format(x$chars, big.mark = ",")),
  Avg_Words_Per_Line = sapply(stats_list, function(x) 
    round(x$words / x$lines, 1))
)

kable(summary_table, caption = "Table 1: Corpus File Statistics",
      align = c('l', 'r', 'r', 'r', 'r', 'r'))

Table 1: Corpus File Statistics
	Source	Size_MB	Total_Lines	Total_Words	Total_Characters	Avg_Words_Per_Line
blogs	blogs	200.42	899,288	37,546,250	206,824,505	41.8
news	news	196.28	1,010,242	34,762,395	203,223,159	34.4
twitter	twitter	159.36	2,360,148	30,093,413	162,096,241	12.8

Notable Observations

The table above presents the fundamental characteristics of each corpus. Twitter entries show shorter average length per line compared to blogs and news, reflecting the platform’s character limitations and conversational nature.

3. Sampling Strategy

For efficient analysis and model development, we’ll work with representative samples from each corpus.

# Function to sample lines from corpus
sample_corpus <- function(lines, sample_rate = 0.01) {
  if (length(lines) == 0) return(character(0))
  n_sample <- max(1, floor(length(lines) * sample_rate))
  sample(lines, n_sample)
}

# Create samples (1% of each corpus for demonstration)
samples <- lapply(stats_list, function(x) sample_corpus(x$lines_sample, 0.01))

# Combine all samples
all_samples <- unlist(samples)

cat(sprintf("Total sampled lines: %s\n", format(length(all_samples), big.mark = ",")))

## Total sampled lines: 42,695

cat(sprintf("Sample from blogs: %s lines\n", format(length(samples$blogs), big.mark = ",")))

## Sample from blogs: 8,992 lines

cat(sprintf("Sample from news: %s lines\n", format(length(samples$news), big.mark = ",")))

## Sample from news: 10,102 lines

cat(sprintf("Sample from twitter: %s lines\n", format(length(samples$twitter), big.mark = ",")))

## Sample from twitter: 23,601 lines

4. Text Preprocessing

# Function to clean text using regular expression
clean_text <- function(text) {
  text %>%
    # Convert to lowercase
    tolower() %>%
    # Remove URLs
    str_replace_all("http\\S+|www\\S+", "") %>%
    # Remove email addresses
    str_replace_all("\\S+@\\S+", "") %>%
    # Keep only letters, apostrophes, and spaces
    str_replace_all("[^a-z' ]", " ") %>%
    # Remove extra spaces
    str_replace_all("\\s+", " ") %>%
    str_trim()
}

# Clean the samples
clean_samples <- lapply(samples, clean_text)

# Example: show before and after cleaning
cat("Original text sample:\n")

## Original text sample:

cat(samples$twitter[1], "\n\n")

## you guys have that also?? It runs our campus for a week

cat("Cleaned text sample:\n")

## Cleaned text sample:

cat(clean_samples$twitter[1], "\n")

## you guys have that also it runs our campus for a week

5. Word Frequency Analysis

# Function to get word frequencies
get_word_freq <- function(text_vector) {
  words <- unlist(strsplit(text_vector, "\\s+"))
  words <- words[words != ""]
  
  word_freq <- table(words)
  data.frame(
    word = names(word_freq),
    frequency = as.numeric(word_freq),
    stringsAsFactors = FALSE
  ) %>%
    arrange(desc(frequency))
}

# Get word frequencies for each corpus
word_freqs <- lapply(clean_samples, get_word_freq)

# Top 20 words by corpus
top_words_comparison <- bind_rows(
  word_freqs$blogs %>% head(20) %>% mutate(source = "Blogs"),
  word_freqs$news %>% head(20) %>% mutate(source = "News"),
  word_freqs$twitter %>% head(20) %>% mutate(source = "Twitter")
)

# Plot top words by source
ggplot(top_words_comparison, aes(x = reorder(word, frequency), y = frequency, fill = source)) +
  geom_col() +
  coord_flip() +
  facet_wrap(~source, scales = "free_y", ncol = 3) +
  labs(title = "Figure 1: Top 20 Most Frequent Words by Corpus",
       x = "Word", y = "Frequency") +
  theme_minimal() +
  theme(legend.position = "none")

Vocabulary Size Analysis

# Calculate vocabulary statistics
vocab_stats <- data.frame(
  Source = c("Blogs", "News", "Twitter", "Combined"),
  Unique_Words = c(
    nrow(word_freqs$blogs),
    nrow(word_freqs$news),
    nrow(word_freqs$twitter),
    length(unique(unlist(lapply(clean_samples, function(x) 
      unlist(strsplit(x, "\\s+"))))))
  ),
  Total_Words = c(
    sum(word_freqs$blogs$frequency),
    sum(word_freqs$news$frequency),
    sum(word_freqs$twitter$frequency),
    sum(sapply(word_freqs, function(x) sum(x$frequency)))
  )
)

vocab_stats$Type_Token_Ratio <- round(
  vocab_stats$Unique_Words / vocab_stats$Total_Words, 4
)

kable(vocab_stats, caption = "Table 2: Vocabulary Statistics",
      format.args = list(big.mark = ","))

Table 2: Vocabulary Statistics
Source	Unique_Words	Total_Words	Type_Token_Ratio
Blogs	27,713	374,449	0.0740
News	28,866	340,639	0.0847
Twitter	24,383	297,318	0.0820
Combined	51,534	1,012,406	0.0509

6. Coverage Analysis

Understanding vocabulary coverage is crucial for prediction algorithm efficiency.

# Function to calculate coverage
calculate_coverage <- function(word_freq_df) {
  word_freq_df <- word_freq_df %>% arrange(desc(frequency))
  word_freq_df$cumulative_freq <- cumsum(word_freq_df$frequency)
  word_freq_df$coverage <- word_freq_df$cumulative_freq / sum(word_freq_df$frequency)
  word_freq_df$rank <- 1:nrow(word_freq_df)
  word_freq_df
}

# Calculate coverage for combined corpus
combined_words <- bind_rows(word_freqs) %>%
  group_by(word) %>%
  summarise(frequency = sum(frequency), .groups = "drop")

coverage_data <- calculate_coverage(combined_words)

# Find words needed for different coverage levels
coverage_levels <- c(0.5, 0.75, 0.9, 0.95)
coverage_summary <- data.frame(
  Coverage = paste0(coverage_levels * 100, "%"),
  Words_Needed = sapply(coverage_levels, function(level) {
    min(which(coverage_data$coverage >= level))
  })
)

kable(coverage_summary, caption = "Table 3: Words Required for Coverage Levels")

Table 3: Words Required for Coverage Levels
Coverage	Words_Needed
50%	141
75%	1433
90%	6928
95%	15163

# Plot coverage curve
ggplot(coverage_data %>% filter(rank <= 5000), 
       aes(x = rank, y = coverage * 100)) +
  geom_line(color = "#2C3E50", size = 1.2) +
  geom_hline(yintercept = c(50, 75, 90, 95), 
             linetype = "dashed", color = "red", alpha = 0.5) +
  labs(title = "Figure 2: Cumulative Word Coverage",
       subtitle = "Percentage of total words covered by top N unique words",
       x = "Number of Unique Words (Ranked by Frequency)",
       y = "Coverage (%)") +
  theme_minimal() +
  scale_y_continuous(breaks = seq(0, 100, 10))

Insights

Approximately 141 unique words provide 50% coverage of the corpus, while 15163 words are needed for 95% coverage. This demonstrates the Zipfian distribution typical of natural language.

7. N-gram Analysis

N-grams (sequences of n words) are fundamental to word prediction algorithms.

# Function to generate n-grams
generate_ngrams <- function(text_vector, n = 2) {
  ngrams <- character()
  
  for (line in text_vector) {
    words <- unlist(strsplit(line, "\\s+"))
    words <- words[words != ""]
    
    if (length(words) >= n) {
      for (i in 1:(length(words) - n + 1)) {
        ngram <- paste(words[i:(i + n - 1)], collapse = " ")
        ngrams <- c(ngrams, ngram)
      }
    }
  }
  
  ngram_freq <- table(ngrams)
  data.frame(
    ngram = names(ngram_freq),
    frequency = as.numeric(ngram_freq),
    stringsAsFactors = FALSE
  ) %>%
    arrange(desc(frequency))
}

# Generate bigrams and trigrams (using smaller sample for speed)
sample_for_ngrams <- sample(all_samples, min(length(all_samples), 1000))
clean_for_ngrams <- clean_text(sample_for_ngrams)

bigrams <- generate_ngrams(clean_for_ngrams, 2)
trigrams <- generate_ngrams(clean_for_ngrams, 3)

# Display top n-grams
cat("Top 10 Bigrams:\n")

## Top 10 Bigrams:

kable(head(bigrams, 10), caption = "Table 4: Most Frequent Bigrams")

Table 4: Most Frequent Bigrams
ngram	frequency
in the	91
of the	85
to the	60
on the	54
for the	37
at the	35
to be	34
and the	28
but i	27
is a	26

cat("\n\nTop 10 Trigrams:\n")

## 
## 
## Top 10 Trigrams:

kable(head(trigrams, 10), caption = "Table 5: Most Frequent Trigrams")

Table 5: Most Frequent Trigrams
ngram	frequency
one of the	10
w sunset blvd	10
the u s	7
is one of	6
a bit of	5
a lot of	5
i have to	5
thanks for the	5
there is a	5
a good thing	4

# Visualize top bigrams
top_bigrams <- head(bigrams, 15)

ggplot(top_bigrams, aes(x = reorder(ngram, frequency), y = frequency)) +
  geom_col(fill = "#3498DB") +
  coord_flip() +
  labs(title = "Figure 3: Top 15 Most Frequent Bigrams",
       x = "Bigram", y = "Frequency") +
  theme_minimal()

8. Application Design Recommendations

Based on this exploratory analysis, here are some recommendations for building a word prediction application:

Algorithm Strategy

1. N-gram Model with Backoff

Implement a hierarchical model using trigrams, bigrams, and unigrams
Use higher-order n-grams when available, backing off to lower orders when necessary
Apply smoothing techniques (e.g., Kneser-Ney) to handle unseen combinations

2. Vocabulary Management

Focus on the most frequent ~5,000-10,000 words for efficient prediction
These words cover 90%+ of typical usage
Implement an “unknown word” category for rare terms

3. Corpus-Specific Models

Consider context-aware predictions:

Different models for formal (news) vs. informal (Twitter/blogs) contexts
User-selectable or automatic context detection

4. Performance Optimization

Pre-compute n-gram probabilities offline
Use efficient data structures (hash tables, tries) for fast lookup
Implement caching for recently predicted sequences

Application Features

Core Functionality:

Real-time word suggestions as user types
Display top 3-5 predictions ranked by probability
Update predictions with each keystroke

9. Conclusions

This exploratory analysis reveals several key insights for word prediction:

Language follows predictable patterns: Common words and phrases appear with high frequency across all corpora
Efficient coverage: A relatively small vocabulary (~10,000 words) covers the vast majority of everyday usage
Context matters: Different text sources show distinct linguistic characteristics that could improve prediction accuracy

Exploratory Data Analysis of Text Corpus for Word Prediction

Eric Nielsen

2026-01-10