Executive Summary

This report presents an exploratory analysis of text data from three sources (blogs, news articles, and Twitter) that will be used to build a predictive text algorithm. The analysis reveals key characteristics of the dataset and outlines our strategy for developing a Shiny application that can predict the next word as users type.

Key Findings: - Dataset contains over 4 million lines of text across three sources - Twitter data has the shortest messages, while blogs contain the longest - A relatively small vocabulary covers the majority of word usage - Clear patterns in word combinations provide foundation for prediction algorithm


Data Loading and Basic Statistics

# Set data path
data_path <- "C:/capstone/rawData/final/en_US"

# Function to safely read large text files
safe_read_lines <- function(file_path, encoding = "UTF-8") {
  tryCatch({
    readLines(file_path, encoding = encoding, warn = FALSE)
  }, error = function(e) {
    message(paste("Error reading file:", file_path))
    return(character(0))
  })
}

# Read the three main text files
blogs <- safe_read_lines(file.path(data_path, "en_US.blogs.txt"))
news <- safe_read_lines(file.path(data_path, "en_US.news.txt"))
twitter <- safe_read_lines(file.path(data_path, "en_US.twitter.txt"))

File Summary Statistics

# Calculate basic file statistics
file_stats <- data.frame(
  Source = c("Blogs", "News", "Twitter"),
  Lines = c(length(blogs), length(news), length(twitter)),
  Characters = c(sum(nchar(blogs)), sum(nchar(news)), sum(nchar(twitter))),
  Words = c(sum(stri_count_words(blogs)), sum(stri_count_words(news)), sum(stri_count_words(twitter))),
  stringsAsFactors = FALSE
)

# Add file sizes in MB
file_sizes <- c(
  file.info(file.path(data_path, "en_US.blogs.txt"))$size,
  file.info(file.path(data_path, "en_US.news.txt"))$size,
  file.info(file.path(data_path, "en_US.twitter.txt"))$size
) / (1024^2)

file_stats$Size_MB <- round(file_sizes, 2)
file_stats$Avg_Words_Per_Line <- round(file_stats$Words / file_stats$Lines, 2)

# Display formatted table
kable(file_stats, format.args = list(big.mark = ","), 
      caption = "Summary Statistics for Text Data Sources")
Summary Statistics for Text Data Sources
Source Lines Characters Words Size_MB Avg_Words_Per_Line
Blogs 899,288 206,824,505 37,546,806 200.42 41.75
News 1,010,206 203,214,543 34,761,151 196.28 34.41
Twitter 2,360,148 162,096,031 30,096,649 159.36 12.75

The dataset consists of three distinct text sources with different characteristics:

  • Blogs: 899,288 entries with longer, more formal content
  • News: 1,010,206 articles with structured journalism writing
  • Twitter: 2,360,148 short messages reflecting casual communication

Data Visualization

File Statistics Comparison

# Create visualization of basic stats
stats_viz <- file_stats %>%
  select(Source, Lines, Words, Characters) %>%
  melt(id.vars = "Source") %>%
  ggplot(aes(x = Source, y = value, fill = Source)) +
  geom_bar(stat = "identity") +
  facet_wrap(~variable, scales = "free_y") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        legend.position = "none") +
  labs(title = "Comparison of Data Sources",
       subtitle = "Lines, Words, and Characters by Source Type",
       x = "Data Source", y = "Count") +
  scale_fill_brewer(type = "qual", palette = "Set2")

print(stats_viz)

Average Message Length

# Create histogram of average words per line
length_plot <- ggplot(file_stats, aes(x = Source, y = Avg_Words_Per_Line, fill = Source)) +
  geom_col() +
  theme_minimal() +
  theme(legend.position = "none") +
  labs(title = "Average Words per Message/Article",
       subtitle = "Twitter messages are significantly shorter than blogs and news",
       x = "Source", y = "Average Words per Line") +
  scale_fill_brewer(type = "qual", palette = "Set2") +
  geom_text(aes(label = Avg_Words_Per_Line), vjust = -0.5)

print(length_plot)


Word Frequency Analysis

# Sample data for analysis (managing memory)
set.seed(123)
sample_size <- 10000

# Create samples from each source
blogs_sample <- sample(blogs, min(sample_size, length(blogs)))
news_sample <- sample(news, min(sample_size, length(news)))
twitter_sample <- sample(twitter, min(sample_size, length(twitter)))
combined_sample <- c(blogs_sample, news_sample, twitter_sample)

# Text preprocessing
preprocess_text <- function(text_vector) {
  corpus <- Corpus(VectorSource(text_vector))
  corpus <- tm_map(corpus, content_transformer(tolower))
  corpus <- tm_map(corpus, removePunctuation)
  corpus <- tm_map(corpus, removeNumbers)
  corpus <- tm_map(corpus, stripWhitespace)
  corpus <- tm_map(corpus, removeWords, stopwords("english"))
  return(corpus)
}

# Process the sample
clean_corpus <- preprocess_text(combined_sample)
dtm <- DocumentTermMatrix(clean_corpus)
term_freq <- sort(colSums(as.matrix(dtm)), decreasing = TRUE)

Most Frequent Words

# Top 20 most frequent words
top_words <- head(term_freq, 20)
word_freq_df <- data.frame(
  word = names(top_words),
  freq = as.numeric(top_words),
  stringsAsFactors = FALSE
)

word_plot <- ggplot(word_freq_df, aes(x = reorder(word, freq), y = freq)) +
  geom_bar(stat = "identity", fill = "steelblue", alpha = 0.8) +
  coord_flip() +
  theme_minimal() +
  labs(title = "Top 20 Most Frequent Words",
       subtitle = "After removing common stop words",
       x = "Words", y = "Frequency") +
  theme(axis.text.y = element_text(size = 10))

print(word_plot)

Word Cloud Visualization

# Create word cloud
wordcloud(names(term_freq), term_freq, max.words = 100, 
          random.order = FALSE, colors = brewer.pal(8, "Dark2"),
          scale = c(3, 0.5))


N-gram Analysis

# Function to create n-grams using base R
create_ngrams_base <- function(text_vector, n, max_lines = 5000) {
  text_subset <- text_vector[1:min(length(text_vector), max_lines)]
  
  # Clean text
  clean_text <- tolower(text_subset)
  clean_text <- gsub("[^a-zA-Z\\s]", "", clean_text)
  clean_text <- gsub("\\s+", " ", clean_text)
  clean_text <- trimws(clean_text)
  
  # Split into words and remove stop words
  all_words <- unlist(strsplit(clean_text, "\\s+"))
  all_words <- all_words[nchar(all_words) > 0]
  
  stop_words <- c("the", "a", "an", "and", "or", "but", "in", "on", "at", "to", 
                  "for", "of", "with", "by", "is", "are", "was", "were", "be")
  all_words <- all_words[!all_words %in% stop_words]
  
  # Create n-grams
  if (length(all_words) < n) return(character(0))
  
  ngrams <- character()
  for (i in 1:(length(all_words) - n + 1)) {
    ngram <- paste(all_words[i:(i + n - 1)], collapse = " ")
    ngrams <- c(ngrams, ngram)
  }
  
  ngram_freq <- table(ngrams)
  return(sort(ngram_freq, decreasing = TRUE))
}

# Generate bigrams and trigrams
bigrams <- create_ngrams_base(combined_sample, 2)
trigrams <- create_ngrams_base(combined_sample, 3)

Two-Word Combinations (Bigrams)

if(length(bigrams) > 0) {
  top_bigrams <- head(bigrams, 15)
  bigram_df <- data.frame(
    bigram = names(top_bigrams),
    freq = as.numeric(top_bigrams),
    stringsAsFactors = FALSE
  )
  
  bigram_plot <- ggplot(bigram_df, aes(x = reorder(bigram, freq), y = freq)) +
    geom_bar(stat = "identity", fill = "darkgreen", alpha = 0.8) +
    coord_flip() +
    theme_minimal() +
    labs(title = "Top 15 Two-Word Combinations",
         subtitle = "Most common word pairs in the dataset",
         x = "Word Pairs", y = "Frequency") +
    theme(axis.text.y = element_text(size = 9))
  
  print(bigram_plot)
}

Three-Word Combinations (Trigrams)

if(length(trigrams) > 0) {
  top_trigrams <- head(trigrams, 15)
  trigram_df <- data.frame(
    trigram = names(top_trigrams),
    freq = as.numeric(top_trigrams),
    stringsAsFactors = FALSE
  )
  
  trigram_plot <- ggplot(trigram_df, aes(x = reorder(trigram, freq), y = freq)) +
    geom_bar(stat = "identity", fill = "darkred", alpha = 0.8) +
    coord_flip() +
    theme_minimal() +
    labs(title = "Top 15 Three-Word Combinations",
         subtitle = "Most common three-word phrases in the dataset",
         x = "Word Combinations", y = "Frequency") +
    theme(axis.text.y = element_text(size = 9))
  
  print(trigram_plot)
}


Vocabulary Coverage Analysis

# Calculate vocabulary coverage
total_words <- sum(term_freq)
cumulative_coverage <- cumsum(term_freq) / total_words

# Key coverage milestones
coverage_50 <- which(cumulative_coverage >= 0.5)[1]
coverage_90 <- which(cumulative_coverage >= 0.9)[1]

coverage_stats <- data.frame(
  Coverage = c("50%", "90%"),
  Words_Needed = c(coverage_50, coverage_90),
  Percentage_of_Vocabulary = c(
    round(coverage_50 / length(term_freq) * 100, 2),
    round(coverage_90 / length(term_freq) * 100, 2)
  )
)

kable(coverage_stats, 
      caption = "Vocabulary Coverage Analysis: How many words are needed to cover X% of all text?")
Vocabulary Coverage Analysis: How many words are needed to cover X% of all text?
Coverage Words_Needed Percentage_of_Vocabulary
chocolate 50% 1077 1.98
borel 90% 16079 29.56
# Coverage visualization
coverage_df <- data.frame(
  rank = 1:min(1000, length(cumulative_coverage)),
  coverage = cumulative_coverage[1:min(1000, length(cumulative_coverage))]
)

coverage_plot <- ggplot(coverage_df, aes(x = rank, y = coverage)) +
  geom_line(color = "blue", size = 1.2) +
  geom_hline(yintercept = 0.5, linetype = "dashed", color = "red", alpha = 0.7) +
  geom_hline(yintercept = 0.9, linetype = "dashed", color = "red", alpha = 0.7) +
  theme_minimal() +
  labs(title = "Vocabulary Efficiency",
       subtitle = "A small number of common words covers most of the text",
       x = "Number of Most Frequent Words", 
       y = "Percentage of Text Covered") +
  scale_y_continuous(labels = scales::percent) +
  annotate("text", x = 250, y = 0.5, label = "50% Coverage", vjust = -0.5, color = "red") +
  annotate("text", x = 250, y = 0.9, label = "90% Coverage", vjust = -0.5, color = "red")

print(coverage_plot)


Key Insights and Findings

Data Characteristics

  1. Volume: Our dataset contains 4,269,642 total lines with 102,404,606 words across all sources.

  2. Diversity: The three data sources show distinct writing styles:

    • Twitter: Short, informal messages averaging 12.75 words
    • News: Structured articles averaging 34.41 words
    • Blogs: Longer content averaging 41.75 words
  3. Vocabulary Efficiency: Only 1077 words (representing 2% of the vocabulary) cover 50% of all text usage.

Prediction Algorithm Implications

  • Memory Optimization: The vocabulary coverage analysis shows we can achieve good prediction accuracy with a relatively small dictionary
  • Context Patterns: Clear bigram and trigram patterns provide strong foundation for next-word prediction
  • Source Diversity: Multiple text sources ensure the algorithm works across different writing styles

Next Steps: Prediction Algorithm and Shiny App

Algorithm Strategy

Approach: We will implement a Katz Back-off Model with Good-Turing smoothing for robust prediction.

N-gram Implementation: - Build 4-gram, 3-gram, 2-gram, and 1-gram models - Use back-off strategy: try 4-gram prediction first, fall back to shorter n-grams if needed - Apply smoothing techniques to handle unseen word combinations

Shiny Application Features

User Interface: - Clean, intuitive text input box - Real-time prediction as user types - Display top 3 word suggestions with confidence indicators - Mobile-friendly responsive design

Performance Targets: - Speed: <100ms response time for predictions - Accuracy: >15% top-1 accuracy, >40% top-3 accuracy - Size: <50MB total app size for web deployment

Technical Implementation

Data Structures: - Compressed hash tables for fast n-gram lookup - Sparse matrices to minimize memory usage - Efficient caching of frequent predictions

Optimization Strategies: - Prune rare n-grams to balance accuracy vs. memory - Implement lazy loading of prediction models - Use data.table for high-performance operations


Conclusion

This exploratory analysis demonstrates successful data loading and reveals key characteristics that will guide our prediction algorithm development. The dataset’s vocabulary efficiency and clear n-gram patterns provide a strong foundation for building an accurate and fast text prediction application.

The next phase will focus on implementing the Katz back-off model and creating an intuitive Shiny interface that delivers real-time predictions to users.

## Analysis completed on: 2025-09-11 17:17:36.366236
## R version: R version 4.5.0 (2025-04-11 ucrt)