This report summarizes our exploratory analysis of text data from three sources (blogs, news articles, and Twitter) and outlines our approach for developing a next-word prediction application. The ultimate goal is to develop a user-friendly Shiny app that leverages this text data to suggest the next word in a sentence, enhancing users’ text input experience. Our analysis reveals that a relatively small subset of words accounts for the majority of text usage, which will allow us to build an efficient and effective word prediction algorithm.
We successfully loaded and processed three text datasets:
# Read in data
blogs <- read_lines("en_US.blogs.txt")
news <- read_lines("en_US.news.txt")
twitter <- read_lines("en_US.twitter.txt")
For development efficiency, we worked with a 1% random sample of each dataset, which provided sufficient data for our exploratory analysis while allowing faster processing.
Original data sizes:
Blogs: 899288 lines
News: 1010242 lines
Twitter: 2360148 lines
Sample data sizes (1%):
Blogs sample: 8992 lines
News sample: 10102 lines
Twitter sample: 23601 lines
The number of lines, words, and average words per line for each dataset are as follows:
Dataset Number_of_Lines Number_of_Words Average_Words_per_Line
1 Blogs 8992 375737 41.78570
2 News 10102 342345 33.88883
3 Twitter 23601 304576 12.90522
Our text preprocessing tasks successfully:
# Process the sampled data
blogs_processed <- process_text_file(blogs_sample, profanity_list)
news_processed <- process_text_file(news_sample, profanity_list)
twitter_processed <- process_text_file(twitter_sample, profanity_list)
# Summary statistics of processed samples
cat("\nProcessed blogs sample:\n")
Processed blogs sample:
cat("Number of clean sentences:", length(blogs_processed$clean_sentences), "\n")
Number of clean sentences: 17958
cat("Total number of tokens:", sum(sapply(blogs_processed$tokenized_sentences, length)), "\n\n")
Total number of tokens: 330502
cat("Processed news sample:\n")
Processed news sample:
cat("Number of clean sentences:", length(news_processed$clean_sentences), "\n")
Number of clean sentences: 14288
cat("Total number of tokens:", sum(sapply(news_processed$tokenized_sentences, length)), "\n\n")
Total number of tokens: 274204
cat("Processed twitter sample:\n")
Processed twitter sample:
cat("Number of clean sentences:", length(twitter_processed$clean_sentences), "\n")
Number of clean sentences: 22967
cat("Total number of tokens:", sum(sapply(twitter_processed$tokenized_sentences, length)), "\n")
Total number of tokens: 256271
We processed the data by combining the datasets into a single corpus, cleaning the text data, and converting it into a format suitable for analysis.
# Combine datasets into a single corpus
combined_sentences <- c(
blogs_processed$clean_sentences,
news_processed$clean_sentences,
twitter_processed$clean_sentences
)
# Convert to a tibble
text_data <- tibble(text = combined_sentences)
# Unnest tokens: Converting text to a tidy format
word_counts <- text_data %>%
unnest_tokens(word, text) %>%
filter(!word %in% profanity_list) %>%
count(word, sort = TRUE)
# Display the top 10 words
print(head(word_counts, 10), n=10)
# A tibble: 10 × 2
word n
<chr> <int>
1 the 35878
2 to 21053
3 and 18268
4 a 17678
5 of 14922
6 i 12572
7 in 12383
8 for 8463
9 is 8022
10 that 7731
The most frequently occurring words were primarily common English articles, prepositions, and conjunctions.
# Collect the top 20 words
top_words <- word_counts %>%
head(20)
# Create a plot with the top 20 words
ggplot(top_words, aes(x = reorder(word, n), y = n)) +
geom_col(fill = "steelblue") +
coord_flip() +
labs(
title = "Top 20 Most Frequent Words",
subtitle = "From combined corpus (blogs, news, twitter)",
x = "",
y = "Frequency") +
theme_minimal() +
theme(
plot.title = element_text(face = "bold", size = 14),
axis.text.y = element_text(size = 10),
panel.grid.major.y = element_blank()
)
More interesting patterns emerged when examining sequences of words (n-grams). These n-grams reveal common phrases and word combinations that will form the foundation of our prediction algorithm.
# 2-grams
bigrams <- text_data %>%
unnest_tokens(bigram, text, token = "ngrams", n = 2) %>%
count(bigram, sort = TRUE)
# 3-grams
trigrams <- text_data %>%
unnest_tokens(trigram, text, token = "ngrams", n = 3) %>%
count(trigram, sort = TRUE)
# View the most common bigrams and trigrams
print(head(bigrams, 10))
# A tibble: 10 × 2
bigram n
<chr> <int>
1 of the 3275
2 in the 3074
3 <NA> 2070
4 to the 1636
5 for the 1537
6 on the 1486
7 to be 1194
8 at the 1016
9 and the 909
10 in a 851
print(head(trigrams, 10))
# A tibble: 10 × 2
trigram n
<chr> <int>
1 <NA> 4728
2 one of the 259
3 a lot of 195
4 thanks for the 190
5 the end of 129
6 to be a 128
7 going to be 126
8 some of the 124
9 out of the 119
10 i want to 117
# Visualize top 15 bigrams
bigrams %>%
head(15) %>%
ggplot(aes(x = reorder(bigram, n), y = n)) +
geom_col(fill = "steelblue") +
coord_flip() +
labs(
title = "Top 15 Bigrams",
subtitle = "Most frequent word pairs in corpus",
x = NULL,
y = "Frequency"
) +
theme_minimal() +
theme(
plot.title = element_text(face = "bold"),
panel.grid.major.y = element_blank()
)
# Visualize top 15 trigrams
trigrams %>%
head(15) %>%
ggplot(aes(x = reorder(trigram, n), y = n)) +
geom_col(fill = "darkgreen") +
coord_flip() +
labs(
title = "Top 15 Trigrams",
subtitle = "Most frequent word triplets in corpus",
x = NULL,
y = "Frequency"
) +
theme_minimal() +
theme(
plot.title = element_text(face = "bold"),
panel.grid.major.y = element_blank()
)
# Create a comparison plot of top 10 bigrams and trigrams
top_bigrams <- bigrams %>%
head(10) %>%
mutate(type = "Bigram")
top_trigrams <- trigrams %>%
head(10) %>%
mutate(type = "Trigram") %>%
rename(bigram = trigram)
combined_ngrams <- bind_rows(top_bigrams, top_trigrams)
# Create comparison plot
ggplot(combined_ngrams, aes(x = reorder(bigram, n), y = n, fill = type)) +
geom_col() +
coord_flip() +
scale_fill_manual(values = c("Bigram" = "steelblue", "Trigram" = "darkgreen")) +
labs(
title = "Comparison of Top N-grams",
x = NULL,
y = "Frequency",
fill = "N-gram Type"
) +
theme_minimal() +
theme(
legend.position = "top",
panel.grid.major.y = element_blank()
)
To analyze word coverage, we compute how many unique words are needed to cover a certain percentage (e.g., 50% and 90%) of all instances.
# Get the word frequencies sorted from most to least frequent
word_frequencies <- word_counts %>%
arrange(desc(n))
# Calculate the total number of word instances
total_words <- sum(word_frequencies$n)
# Calculate the cumulative sum and percentage
word_coverage <- word_frequencies %>%
mutate(
cumulative_count = cumsum(n),
coverage_percentage = cumulative_count / total_words * 100
)
# Find how many words needed for specific coverage percentages
words_for_50_percent <- min(which(word_coverage$coverage_percentage >= 50))
words_for_90_percent <- min(which(word_coverage$coverage_percentage >= 90))
# Display the results
cat("Word coverage analysis:\n")
Word coverage analysis:
cat("Total unique words:", nrow(word_coverage), "\n")
Total unique words: 44614
cat("Total word instances:", total_words, "\n")
Total word instances: 755196
cat("Words needed for 50% coverage:", words_for_50_percent, "\n")
Words needed for 50% coverage: 140
cat("Words needed for 90% coverage:", words_for_90_percent, "\n\n")
Words needed for 90% coverage: 6698
# Create a coverage curve plot
ggplot(word_coverage %>% head(5000), aes(x = 1:5000, y = coverage_percentage)) +
geom_line(color = "blue") +
geom_hline(yintercept = c(50, 90), linetype = "dashed", color = "red") +
geom_vline(xintercept = c(words_for_50_percent, words_for_90_percent),
linetype = "dashed", color = "green") +
scale_y_continuous(breaks = seq(0, 100, by = 10)) +
scale_x_log10(
breaks = scales::trans_breaks("log10", function(x) 10^x),
labels = scales::trans_format("log10", scales::math_format(10^.x))
) +
annotation_logticks(sides = "b") +
labs(
title = "Word Coverage Analysis",
subtitle = "Number of unique words needed to cover percentage of all word instances",
x = "Number of unique words (log scale)",
y = "Cumulative percentage coverage"
) +
theme_minimal() +
theme(
plot.title = element_text(face = "bold"),
panel.grid.minor = element_blank()
)
The coverage analysis results are very informative:
The following steps are planned for building the predictive text algorithm:
The Shiny application will have the following features: