Text Prediction Application Development


Exploratory Analysis and Implementation Plan


Executive Summary

This report summarizes our exploratory analysis of text data from three sources (blogs, news articles, and Twitter) and outlines our approach for developing a next-word prediction application. The ultimate goal is to develop a user-friendly Shiny app that leverages this text data to suggest the next word in a sentence, enhancing users’ text input experience. Our analysis reveals that a relatively small subset of words accounts for the majority of text usage, which will allow us to build an efficient and effective word prediction algorithm.

Data Overview

We successfully loaded and processed three text datasets:

  • Blog posts
  • News articles
  • Twitter posts
# Read in data
blogs <- read_lines("en_US.blogs.txt")
news <- read_lines("en_US.news.txt")
twitter <- read_lines("en_US.twitter.txt")

For development efficiency, we worked with a 1% random sample of each dataset, which provided sufficient data for our exploratory analysis while allowing faster processing.

Original data sizes:
Blogs: 899288 lines
News: 1010242 lines
Twitter: 2360148 lines
Sample data sizes (1%):
Blogs sample: 8992 lines
News sample: 10102 lines
Twitter sample: 23601 lines


Summary Statistics

The number of lines, words, and average words per line for each dataset are as follows:

  Dataset Number_of_Lines Number_of_Words Average_Words_per_Line
1   Blogs            8992          375737               41.78570
2    News           10102          342345               33.88883
3 Twitter           23601          304576               12.90522


Text preprocessing tasks

Our text preprocessing tasks successfully:

  • Tokenized text into individual words
  • Removed profanity and inappropriate content
  • Standardized formatting (quotes, dashes, etc.)
  • Removed URLs and numbers
# Process the sampled data
blogs_processed <- process_text_file(blogs_sample, profanity_list)
news_processed <- process_text_file(news_sample, profanity_list)
twitter_processed <- process_text_file(twitter_sample, profanity_list)

# Summary statistics of processed samples
cat("\nProcessed blogs sample:\n")

Processed blogs sample:
cat("Number of clean sentences:", length(blogs_processed$clean_sentences), "\n")
Number of clean sentences: 17958 
cat("Total number of tokens:", sum(sapply(blogs_processed$tokenized_sentences, length)), "\n\n")
Total number of tokens: 330502 
cat("Processed news sample:\n")
Processed news sample:
cat("Number of clean sentences:", length(news_processed$clean_sentences), "\n")
Number of clean sentences: 14288 
cat("Total number of tokens:", sum(sapply(news_processed$tokenized_sentences, length)), "\n\n")
Total number of tokens: 274204 
cat("Processed twitter sample:\n")
Processed twitter sample:
cat("Number of clean sentences:", length(twitter_processed$clean_sentences), "\n")
Number of clean sentences: 22967 
cat("Total number of tokens:", sum(sapply(twitter_processed$tokenized_sentences, length)), "\n")
Total number of tokens: 256271 


Exploratory Analysis

We processed the data by combining the datasets into a single corpus, cleaning the text data, and converting it into a format suitable for analysis.

# Combine datasets into a single corpus 
combined_sentences <- c(
  blogs_processed$clean_sentences,
  news_processed$clean_sentences,
  twitter_processed$clean_sentences
)

# Convert to a tibble
text_data <- tibble(text = combined_sentences)

# Unnest tokens: Converting text to a tidy format
word_counts <- text_data %>%
  unnest_tokens(word, text) %>%
  filter(!word %in% profanity_list) %>%
  count(word, sort = TRUE)

# Display the top 10 words
print(head(word_counts, 10), n=10)
# A tibble: 10 × 2
   word      n
   <chr> <int>
 1 the   35878
 2 to    21053
 3 and   18268
 4 a     17678
 5 of    14922
 6 i     12572
 7 in    12383
 8 for    8463
 9 is     8022
10 that   7731


1. Word Frequency Distribution

The most frequently occurring words were primarily common English articles, prepositions, and conjunctions.

# Collect the top 20 words
top_words <- word_counts %>% 
  head(20)  

# Create a plot with the top 20 words
ggplot(top_words, aes(x = reorder(word, n), y = n)) +
  geom_col(fill = "steelblue") +  
  coord_flip() +
  labs(
    title = "Top 20 Most Frequent Words",
    subtitle = "From combined corpus (blogs, news, twitter)",
    x = "",  
    y = "Frequency") +
  theme_minimal() +  
  theme(
    plot.title = element_text(face = "bold", size = 14),
    axis.text.y = element_text(size = 10),  
    panel.grid.major.y = element_blank()  
  )


2. Bigrams and Trigrams Frequencies

More interesting patterns emerged when examining sequences of words (n-grams). These n-grams reveal common phrases and word combinations that will form the foundation of our prediction algorithm.

# 2-grams
bigrams <- text_data %>%
  unnest_tokens(bigram, text, token = "ngrams", n = 2) %>%
  count(bigram, sort = TRUE)

# 3-grams
trigrams <- text_data %>%
  unnest_tokens(trigram, text, token = "ngrams", n = 3) %>%
  count(trigram, sort = TRUE)

# View the most common bigrams and trigrams
print(head(bigrams, 10))
# A tibble: 10 × 2
   bigram      n
   <chr>   <int>
 1 of the   3275
 2 in the   3074
 3 <NA>     2070
 4 to the   1636
 5 for the  1537
 6 on the   1486
 7 to be    1194
 8 at the   1016
 9 and the   909
10 in a      851
print(head(trigrams, 10))
# A tibble: 10 × 2
   trigram            n
   <chr>          <int>
 1 <NA>            4728
 2 one of the       259
 3 a lot of         195
 4 thanks for the   190
 5 the end of       129
 6 to be a          128
 7 going to be      126
 8 some of the      124
 9 out of the       119
10 i want to        117
# Visualize top 15 bigrams
bigrams %>%
  head(15) %>%
  ggplot(aes(x = reorder(bigram, n), y = n)) +
  geom_col(fill = "steelblue") +
  coord_flip() +
  labs(
    title = "Top 15 Bigrams",
    subtitle = "Most frequent word pairs in corpus",
    x = NULL,
    y = "Frequency"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(face = "bold"),
    panel.grid.major.y = element_blank()
  )

# Visualize top 15 trigrams
trigrams %>%
  head(15) %>%
  ggplot(aes(x = reorder(trigram, n), y = n)) +
  geom_col(fill = "darkgreen") +
  coord_flip() +
  labs(
    title = "Top 15 Trigrams",
    subtitle = "Most frequent word triplets in corpus",
    x = NULL,
    y = "Frequency"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(face = "bold"),
    panel.grid.major.y = element_blank()
  )

# Create a comparison plot of top 10 bigrams and trigrams
top_bigrams <- bigrams %>%
  head(10) %>%
  mutate(type = "Bigram")

top_trigrams <- trigrams %>%
  head(10) %>%
  mutate(type = "Trigram") %>%
  rename(bigram = trigram)  

combined_ngrams <- bind_rows(top_bigrams, top_trigrams)

# Create comparison plot
ggplot(combined_ngrams, aes(x = reorder(bigram, n), y = n, fill = type)) +
  geom_col() +
  coord_flip() +
  scale_fill_manual(values = c("Bigram" = "steelblue", "Trigram" = "darkgreen")) +
  labs(
    title = "Comparison of Top N-grams",
    x = NULL,
    y = "Frequency",
    fill = "N-gram Type"
  ) +
  theme_minimal() +
  theme(
    legend.position = "top",
    panel.grid.major.y = element_blank()
  )


Coverage Analysis

To analyze word coverage, we compute how many unique words are needed to cover a certain percentage (e.g., 50% and 90%) of all instances.

# Get the word frequencies sorted from most to least frequent
word_frequencies <- word_counts %>%
  arrange(desc(n))

# Calculate the total number of word instances
total_words <- sum(word_frequencies$n)

# Calculate the cumulative sum and percentage
word_coverage <- word_frequencies %>%
  mutate(
    cumulative_count = cumsum(n),
    coverage_percentage = cumulative_count / total_words * 100
  )

# Find how many words needed for specific coverage percentages
words_for_50_percent <- min(which(word_coverage$coverage_percentage >= 50))
words_for_90_percent <- min(which(word_coverage$coverage_percentage >= 90))

# Display the results
cat("Word coverage analysis:\n")
Word coverage analysis:
cat("Total unique words:", nrow(word_coverage), "\n")
Total unique words: 44614 
cat("Total word instances:", total_words, "\n")
Total word instances: 755196 
cat("Words needed for 50% coverage:", words_for_50_percent, "\n")
Words needed for 50% coverage: 140 
cat("Words needed for 90% coverage:", words_for_90_percent, "\n\n")
Words needed for 90% coverage: 6698 
# Create a coverage curve plot
ggplot(word_coverage %>% head(5000), aes(x = 1:5000, y = coverage_percentage)) +
  geom_line(color = "blue") +
  geom_hline(yintercept = c(50, 90), linetype = "dashed", color = "red") +
  geom_vline(xintercept = c(words_for_50_percent, words_for_90_percent), 
             linetype = "dashed", color = "green") +
  scale_y_continuous(breaks = seq(0, 100, by = 10)) +
  scale_x_log10(
    breaks = scales::trans_breaks("log10", function(x) 10^x),
    labels = scales::trans_format("log10", scales::math_format(10^.x))
  ) +
  annotation_logticks(sides = "b") +
  labs(
    title = "Word Coverage Analysis",
    subtitle = "Number of unique words needed to cover percentage of all word instances",
    x = "Number of unique words (log scale)",
    y = "Cumulative percentage coverage"
  ) +
  theme_minimal() +
  theme(
    plot.title = element_text(face = "bold"),
    panel.grid.minor = element_blank()
  )

The coverage analysis results are very informative:

  • There are 44,614 unique words in the corpus.
  • Only 140 words (0.3% of unique words) are needed to cover 50% of all word usage and 6,698 words (15% of unique words) to cover 90% of usage.
  • This follows Zipf’s law, which states that word frequency is inversely proportional to rank. The findings show that a relatively small vocabulary covers most of the common usage, which is very helpful for developing efficient text prediction algorithms.

Goals for the Prediction Algorithm and Shiny App


Prediction Algorithm

The following steps are planned for building the predictive text algorithm:

  • N-gram Model Construction: Create a model using ngrams (e.g., bigrams, trigrams) to predict the next word based on one or two preceding words.
  • Handling Unseen N-grams: Implement a backoff strategy that uses lower-order n-grams when encountering unseen n-grams. Incorporate Laplace smoothing.
  • Evaluation: Implement metrics such as prediction accuracy and perplexity to evaluate the model’s performance.

Shiny App Development

The Shiny application will have the following features:

  • Text Input Field: Users can type in sentences, and the app will suggest the next word.
  • Dynamic Suggestions: Display the top predicted words based on the user’s input in real-time.
  • User Interface: Create a simple and intuitive layout that is accessible.