Milestone Report

Text Prediction Application Development

Exploratory Analysis and Implementation Plan

Executive Summary

This report summarizes our exploratory analysis of text data from three sources (blogs, news articles, and Twitter) and outlines our approach for developing a next-word prediction application. The ultimate goal is to develop a user-friendly Shiny app that leverages this text data to suggest the next word in a sentence, enhancing users’ text input experience. Our analysis reveals that a relatively small subset of words accounts for the majority of text usage, which will allow us to build an efficient and effective word prediction algorithm.

Data Overview

We successfully loaded and processed three text datasets:

Blog posts
News articles
Twitter posts

# Read in data
blogs <- read_lines("en_US.blogs.txt")
news <- read_lines("en_US.news.txt")
twitter <- read_lines("en_US.twitter.txt")

For development efficiency, we worked with a 1% random sample of each dataset, which provided sufficient data for our exploratory analysis while allowing faster processing.

Original data sizes:

Blogs: 899288 lines

News: 1010242 lines

Twitter: 2360148 lines

Sample data sizes (1%):

Blogs sample: 8992 lines

News sample: 10102 lines

Twitter sample: 23601 lines

Summary Statistics

The number of lines, words, and average words per line for each dataset are as follows:

  Dataset Number_of_Lines Number_of_Words Average_Words_per_Line
1   Blogs            8992          375737               41.78570
2    News           10102          342345               33.88883
3 Twitter           23601          304576               12.90522

Text preprocessing tasks

Our text preprocessing tasks successfully:

Tokenized text into individual words
Removed profanity and inappropriate content
Standardized formatting (quotes, dashes, etc.)
Removed URLs and numbers

# Process the sampled data
blogs_processed <- process_text_file(blogs_sample, profanity_list)
news_processed <- process_text_file(news_sample, profanity_list)
twitter_processed <- process_text_file(twitter_sample, profanity_list)

# Summary statistics of processed samples
cat("\nProcessed blogs sample:\n")


Processed blogs sample:

cat("Number of clean sentences:", length(blogs_processed$clean_sentences), "\n")

Number of clean sentences: 17958

cat("Total number of tokens:", sum(sapply(blogs_processed$tokenized_sentences, length)), "\n\n")

Total number of tokens: 330502

cat("Processed news sample:\n")

Processed news sample:

cat("Number of clean sentences:", length(news_processed$clean_sentences), "\n")

Number of clean sentences: 14288

cat("Total number of tokens:", sum(sapply(news_processed$tokenized_sentences, length)), "\n\n")

Total number of tokens: 274204

cat("Processed twitter sample:\n")

Processed twitter sample:

cat("Number of clean sentences:", length(twitter_processed$clean_sentences), "\n")

Number of clean sentences: 22967

cat("Total number of tokens:", sum(sapply(twitter_processed$tokenized_sentences, length)), "\n")

Total number of tokens: 256271

Exploratory Analysis

We processed the data by combining the datasets into a single corpus, cleaning the text data, and converting it into a format suitable for analysis.

# Combine datasets into a single corpus 
combined_sentences <- c(
  blogs_processed$clean_sentences,
  news_processed$clean_sentences,
  twitter_processed$clean_sentences
)

# Convert to a tibble
text_data <- tibble(text = combined_sentences)

# Unnest tokens: Converting text to a tidy format
word_counts <- text_data %>%
  unnest_tokens(word, text) %>%
  filter(!word %in% profanity_list) %>%
  count(word, sort = TRUE)

# Display the top 10 words
print(head(word_counts, 10), n=10)

# A tibble: 10 × 2
   word      n
   <chr> <int>
 1 the   35878
 2 to    21053
 3 and   18268
 4 a     17678
 5 of    14922
 6 i     12572
 7 in    12383
 8 for    8463
 9 is     8022
10 that   7731

1. Word Frequency Distribution

The most frequently occurring words were primarily common English articles, prepositions, and conjunctions.

# Collect the top 20 words
top_words <- word_counts %>% 
  head(20)  

# Create a plot with the top 20 words
ggplot(top_words, aes(x = reorder(word, n), y = n)) +
  geom_col(fill = "steelblue") +  
  coord_flip() +
  labs(
    title = "Top 20 Most Frequent Words",
    subtitle = "From combined corpus (blogs, news, twitter)",
    x = "",  
    y = "Frequency") +
  theme_minimal() +  
  theme(
    plot.title = element_text(face = "bold", size = 14),
    axis.text.y = element_text(size = 10),  
    panel.grid.major.y = element_blank()  
  )

2. Bigrams and Trigrams Frequencies

More interesting patterns emerged when examining sequences of words (n-grams). These n-grams reveal common phrases and word combinations that will form the foundation of our prediction algorithm.

# 2-grams
bigrams <- text_data %>%
  unnest_tokens(bigram, text, token = "ngrams", n = 2) %>%
  count(bigram, sort = TRUE)

# 3-grams
trigrams <- text_data %>%
  unnest_tokens(trigram, text, token = "ngrams", n = 3) %>%
  count(trigram, sort = TRUE)

# View the most common bigrams and trigrams
print(head(bigrams, 10))

# A tibble: 10 × 2
   bigram      n
   <chr>   <int>
 1 of the   3275
 2 in the   3074
 3 <NA>     2070
 4 to the   1636
 5 for the  1537
 6 on the   1486
 7 to be    1194
 8 at the   1016
 9 and the   909
10 in a      851

print(head(trigrams, 10))

# A tibble: 10 × 2
   trigram            n
   <chr>          <int>
 1 <NA>            4728
 2 one of the       259
 3 a lot of         195
 4 thanks for the   190
 5 the end of       129
 6 to be a          128
 7 going to be      126
 8 some of the      124
 9 out of the       119
10 i want to        117

# Visualize top 15 bigrams
bigrams %>%
  head(15) %>%
  ggplot(aes(x = reorder(bigram, n), y = n)) +
  geom_col(fill = "steelblue") +
  coord_flip() +
  labs(
    title = "Top 15 Bigrams",
    subtitle = "Most frequent word pairs in corpus",
    x = NULL,
    y = "Frequency"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(face = "bold"),
    panel.grid.major.y = element_blank()
  )

# Visualize top 15 trigrams
trigrams %>%
  head(15) %>%
  ggplot(aes(x = reorder(trigram, n), y = n)) +
  geom_col(fill = "darkgreen") +
  coord_flip() +
  labs(
    title = "Top 15 Trigrams",
    subtitle = "Most frequent word triplets in corpus",
    x = NULL,
    y = "Frequency"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(face = "bold"),
    panel.grid.major.y = element_blank()
  )

# Create a comparison plot of top 10 bigrams and trigrams
top_bigrams <- bigrams %>%
  head(10) %>%
  mutate(type = "Bigram")

top_trigrams <- trigrams %>%
  head(10) %>%
  mutate(type = "Trigram") %>%
  rename(bigram = trigram)  

combined_ngrams <- bind_rows(top_bigrams, top_trigrams)

# Create comparison plot
ggplot(combined_ngrams, aes(x = reorder(bigram, n), y = n, fill = type)) +
  geom_col() +
  coord_flip() +
  scale_fill_manual(values = c("Bigram" = "steelblue", "Trigram" = "darkgreen")) +
  labs(
    title = "Comparison of Top N-grams",
    x = NULL,
    y = "Frequency",
    fill = "N-gram Type"
  ) +
  theme_minimal() +
  theme(
    legend.position = "top",
    panel.grid.major.y = element_blank()
  )

Coverage Analysis

To analyze word coverage, we compute how many unique words are needed to cover a certain percentage (e.g., 50% and 90%) of all instances.

# Get the word frequencies sorted from most to least frequent
word_frequencies <- word_counts %>%
  arrange(desc(n))

# Calculate the total number of word instances
total_words <- sum(word_frequencies$n)

# Calculate the cumulative sum and percentage
word_coverage <- word_frequencies %>%
  mutate(
    cumulative_count = cumsum(n),
    coverage_percentage = cumulative_count / total_words * 100
  )

# Find how many words needed for specific coverage percentages
words_for_50_percent <- min(which(word_coverage$coverage_percentage >= 50))
words_for_90_percent <- min(which(word_coverage$coverage_percentage >= 90))

# Display the results
cat("Word coverage analysis:\n")

Word coverage analysis:

cat("Total unique words:", nrow(word_coverage), "\n")

Total unique words: 44614

cat("Total word instances:", total_words, "\n")

Total word instances: 755196

cat("Words needed for 50% coverage:", words_for_50_percent, "\n")

Words needed for 50% coverage: 140

cat("Words needed for 90% coverage:", words_for_90_percent, "\n\n")

Words needed for 90% coverage: 6698

# Create a coverage curve plot
ggplot(word_coverage %>% head(5000), aes(x = 1:5000, y = coverage_percentage)) +
  geom_line(color = "blue") +
  geom_hline(yintercept = c(50, 90), linetype = "dashed", color = "red") +
  geom_vline(xintercept = c(words_for_50_percent, words_for_90_percent), 
             linetype = "dashed", color = "green") +
  scale_y_continuous(breaks = seq(0, 100, by = 10)) +
  scale_x_log10(
    breaks = scales::trans_breaks("log10", function(x) 10^x),
    labels = scales::trans_format("log10", scales::math_format(10^.x))
  ) +
  annotation_logticks(sides = "b") +
  labs(
    title = "Word Coverage Analysis",
    subtitle = "Number of unique words needed to cover percentage of all word instances",
    x = "Number of unique words (log scale)",
    y = "Cumulative percentage coverage"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(face = "bold"),
    panel.grid.minor = element_blank()
  )

The coverage analysis results are very informative:

There are 44,614 unique words in the corpus.
Only 140 words (0.3% of unique words) are needed to cover 50% of all word usage and 6,698 words (15% of unique words) to cover 90% of usage.
This follows Zipf’s law, which states that word frequency is inversely proportional to rank. The findings show that a relatively small vocabulary covers most of the common usage, which is very helpful for developing efficient text prediction algorithms.

Goals for the Prediction Algorithm and Shiny App

Prediction Algorithm

The following steps are planned for building the predictive text algorithm:

N-gram Model Construction: Create a model using ngrams (e.g., bigrams, trigrams) to predict the next word based on one or two preceding words.
Handling Unseen N-grams: Implement a backoff strategy that uses lower-order n-grams when encountering unseen n-grams. Incorporate Laplace smoothing.
Evaluation: Implement metrics such as prediction accuracy and perplexity to evaluate the model’s performance.

Shiny App Development

The Shiny application will have the following features:

Text Input Field: Users can type in sentences, and the app will suggest the next word.
Dynamic Suggestions: Display the top predicted words based on the user’s input in real-time.
User Interface: Create a simple and intuitive layout that is accessible.