This report presents an exploratory analysis of text data from three sources (blogs, news articles, and Twitter) that will be used to build a predictive text algorithm. The analysis reveals key characteristics of the dataset and outlines our strategy for developing a Shiny application that can predict the next word as users type.
Key Findings: - Dataset contains over 4 million lines of text across three sources - Twitter data has the shortest messages, while blogs contain the longest - A relatively small vocabulary covers the majority of word usage - Clear patterns in word combinations provide foundation for prediction algorithm
# Set data path
data_path <- "C:/capstone/rawData/final/en_US"
# Function to safely read large text files
safe_read_lines <- function(file_path, encoding = "UTF-8") {
tryCatch({
readLines(file_path, encoding = encoding, warn = FALSE)
}, error = function(e) {
message(paste("Error reading file:", file_path))
return(character(0))
})
}
# Read the three main text files
blogs <- safe_read_lines(file.path(data_path, "en_US.blogs.txt"))
news <- safe_read_lines(file.path(data_path, "en_US.news.txt"))
twitter <- safe_read_lines(file.path(data_path, "en_US.twitter.txt"))
# Calculate basic file statistics
file_stats <- data.frame(
Source = c("Blogs", "News", "Twitter"),
Lines = c(length(blogs), length(news), length(twitter)),
Characters = c(sum(nchar(blogs)), sum(nchar(news)), sum(nchar(twitter))),
Words = c(sum(stri_count_words(blogs)), sum(stri_count_words(news)), sum(stri_count_words(twitter))),
stringsAsFactors = FALSE
)
# Add file sizes in MB
file_sizes <- c(
file.info(file.path(data_path, "en_US.blogs.txt"))$size,
file.info(file.path(data_path, "en_US.news.txt"))$size,
file.info(file.path(data_path, "en_US.twitter.txt"))$size
) / (1024^2)
file_stats$Size_MB <- round(file_sizes, 2)
file_stats$Avg_Words_Per_Line <- round(file_stats$Words / file_stats$Lines, 2)
# Display formatted table
kable(file_stats, format.args = list(big.mark = ","),
caption = "Summary Statistics for Text Data Sources")
Source | Lines | Characters | Words | Size_MB | Avg_Words_Per_Line |
---|---|---|---|---|---|
Blogs | 899,288 | 206,824,505 | 37,546,806 | 200.42 | 41.75 |
News | 1,010,206 | 203,214,543 | 34,761,151 | 196.28 | 34.41 |
2,360,148 | 162,096,031 | 30,096,649 | 159.36 | 12.75 |
The dataset consists of three distinct text sources with different characteristics:
# Create visualization of basic stats
stats_viz <- file_stats %>%
select(Source, Lines, Words, Characters) %>%
melt(id.vars = "Source") %>%
ggplot(aes(x = Source, y = value, fill = Source)) +
geom_bar(stat = "identity") +
facet_wrap(~variable, scales = "free_y") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1),
legend.position = "none") +
labs(title = "Comparison of Data Sources",
subtitle = "Lines, Words, and Characters by Source Type",
x = "Data Source", y = "Count") +
scale_fill_brewer(type = "qual", palette = "Set2")
print(stats_viz)
# Create histogram of average words per line
length_plot <- ggplot(file_stats, aes(x = Source, y = Avg_Words_Per_Line, fill = Source)) +
geom_col() +
theme_minimal() +
theme(legend.position = "none") +
labs(title = "Average Words per Message/Article",
subtitle = "Twitter messages are significantly shorter than blogs and news",
x = "Source", y = "Average Words per Line") +
scale_fill_brewer(type = "qual", palette = "Set2") +
geom_text(aes(label = Avg_Words_Per_Line), vjust = -0.5)
print(length_plot)
# Sample data for analysis (managing memory)
set.seed(123)
sample_size <- 10000
# Create samples from each source
blogs_sample <- sample(blogs, min(sample_size, length(blogs)))
news_sample <- sample(news, min(sample_size, length(news)))
twitter_sample <- sample(twitter, min(sample_size, length(twitter)))
combined_sample <- c(blogs_sample, news_sample, twitter_sample)
# Text preprocessing
preprocess_text <- function(text_vector) {
corpus <- Corpus(VectorSource(text_vector))
corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, stripWhitespace)
corpus <- tm_map(corpus, removeWords, stopwords("english"))
return(corpus)
}
# Process the sample
clean_corpus <- preprocess_text(combined_sample)
dtm <- DocumentTermMatrix(clean_corpus)
term_freq <- sort(colSums(as.matrix(dtm)), decreasing = TRUE)
# Top 20 most frequent words
top_words <- head(term_freq, 20)
word_freq_df <- data.frame(
word = names(top_words),
freq = as.numeric(top_words),
stringsAsFactors = FALSE
)
word_plot <- ggplot(word_freq_df, aes(x = reorder(word, freq), y = freq)) +
geom_bar(stat = "identity", fill = "steelblue", alpha = 0.8) +
coord_flip() +
theme_minimal() +
labs(title = "Top 20 Most Frequent Words",
subtitle = "After removing common stop words",
x = "Words", y = "Frequency") +
theme(axis.text.y = element_text(size = 10))
print(word_plot)
# Create word cloud
wordcloud(names(term_freq), term_freq, max.words = 100,
random.order = FALSE, colors = brewer.pal(8, "Dark2"),
scale = c(3, 0.5))
# Function to create n-grams using base R
create_ngrams_base <- function(text_vector, n, max_lines = 5000) {
text_subset <- text_vector[1:min(length(text_vector), max_lines)]
# Clean text
clean_text <- tolower(text_subset)
clean_text <- gsub("[^a-zA-Z\\s]", "", clean_text)
clean_text <- gsub("\\s+", " ", clean_text)
clean_text <- trimws(clean_text)
# Split into words and remove stop words
all_words <- unlist(strsplit(clean_text, "\\s+"))
all_words <- all_words[nchar(all_words) > 0]
stop_words <- c("the", "a", "an", "and", "or", "but", "in", "on", "at", "to",
"for", "of", "with", "by", "is", "are", "was", "were", "be")
all_words <- all_words[!all_words %in% stop_words]
# Create n-grams
if (length(all_words) < n) return(character(0))
ngrams <- character()
for (i in 1:(length(all_words) - n + 1)) {
ngram <- paste(all_words[i:(i + n - 1)], collapse = " ")
ngrams <- c(ngrams, ngram)
}
ngram_freq <- table(ngrams)
return(sort(ngram_freq, decreasing = TRUE))
}
# Generate bigrams and trigrams
bigrams <- create_ngrams_base(combined_sample, 2)
trigrams <- create_ngrams_base(combined_sample, 3)
if(length(bigrams) > 0) {
top_bigrams <- head(bigrams, 15)
bigram_df <- data.frame(
bigram = names(top_bigrams),
freq = as.numeric(top_bigrams),
stringsAsFactors = FALSE
)
bigram_plot <- ggplot(bigram_df, aes(x = reorder(bigram, freq), y = freq)) +
geom_bar(stat = "identity", fill = "darkgreen", alpha = 0.8) +
coord_flip() +
theme_minimal() +
labs(title = "Top 15 Two-Word Combinations",
subtitle = "Most common word pairs in the dataset",
x = "Word Pairs", y = "Frequency") +
theme(axis.text.y = element_text(size = 9))
print(bigram_plot)
}
if(length(trigrams) > 0) {
top_trigrams <- head(trigrams, 15)
trigram_df <- data.frame(
trigram = names(top_trigrams),
freq = as.numeric(top_trigrams),
stringsAsFactors = FALSE
)
trigram_plot <- ggplot(trigram_df, aes(x = reorder(trigram, freq), y = freq)) +
geom_bar(stat = "identity", fill = "darkred", alpha = 0.8) +
coord_flip() +
theme_minimal() +
labs(title = "Top 15 Three-Word Combinations",
subtitle = "Most common three-word phrases in the dataset",
x = "Word Combinations", y = "Frequency") +
theme(axis.text.y = element_text(size = 9))
print(trigram_plot)
}
# Calculate vocabulary coverage
total_words <- sum(term_freq)
cumulative_coverage <- cumsum(term_freq) / total_words
# Key coverage milestones
coverage_50 <- which(cumulative_coverage >= 0.5)[1]
coverage_90 <- which(cumulative_coverage >= 0.9)[1]
coverage_stats <- data.frame(
Coverage = c("50%", "90%"),
Words_Needed = c(coverage_50, coverage_90),
Percentage_of_Vocabulary = c(
round(coverage_50 / length(term_freq) * 100, 2),
round(coverage_90 / length(term_freq) * 100, 2)
)
)
kable(coverage_stats,
caption = "Vocabulary Coverage Analysis: How many words are needed to cover X% of all text?")
Coverage | Words_Needed | Percentage_of_Vocabulary | |
---|---|---|---|
chocolate | 50% | 1077 | 1.98 |
borel | 90% | 16079 | 29.56 |
# Coverage visualization
coverage_df <- data.frame(
rank = 1:min(1000, length(cumulative_coverage)),
coverage = cumulative_coverage[1:min(1000, length(cumulative_coverage))]
)
coverage_plot <- ggplot(coverage_df, aes(x = rank, y = coverage)) +
geom_line(color = "blue", size = 1.2) +
geom_hline(yintercept = 0.5, linetype = "dashed", color = "red", alpha = 0.7) +
geom_hline(yintercept = 0.9, linetype = "dashed", color = "red", alpha = 0.7) +
theme_minimal() +
labs(title = "Vocabulary Efficiency",
subtitle = "A small number of common words covers most of the text",
x = "Number of Most Frequent Words",
y = "Percentage of Text Covered") +
scale_y_continuous(labels = scales::percent) +
annotate("text", x = 250, y = 0.5, label = "50% Coverage", vjust = -0.5, color = "red") +
annotate("text", x = 250, y = 0.9, label = "90% Coverage", vjust = -0.5, color = "red")
print(coverage_plot)
Volume: Our dataset contains 4,269,642 total lines with 102,404,606 words across all sources.
Diversity: The three data sources show distinct writing styles:
Vocabulary Efficiency: Only 1077 words (representing 2% of the vocabulary) cover 50% of all text usage.
Approach: We will implement a Katz Back-off Model with Good-Turing smoothing for robust prediction.
N-gram Implementation: - Build 4-gram, 3-gram, 2-gram, and 1-gram models - Use back-off strategy: try 4-gram prediction first, fall back to shorter n-grams if needed - Apply smoothing techniques to handle unseen word combinations
User Interface: - Clean, intuitive text input box - Real-time prediction as user types - Display top 3 word suggestions with confidence indicators - Mobile-friendly responsive design
Performance Targets: - Speed: <100ms response time for predictions - Accuracy: >15% top-1 accuracy, >40% top-3 accuracy - Size: <50MB total app size for web deployment
Data Structures: - Compressed hash tables for fast n-gram lookup - Sparse matrices to minimize memory usage - Efficient caching of frequent predictions
Optimization Strategies: - Prune rare n-grams to balance accuracy vs. memory - Implement lazy loading of prediction models - Use data.table for high-performance operations
This exploratory analysis demonstrates successful data loading and reveals key characteristics that will guide our prediction algorithm development. The dataset’s vocabulary efficiency and clear n-gram patterns provide a strong foundation for building an accurate and fast text prediction application.
The next phase will focus on implementing the Katz back-off model and creating an intuitive Shiny interface that delivers real-time predictions to users.
## Analysis completed on: 2025-09-11 17:17:36.366236
## R version: R version 4.5.0 (2025-04-11 ucrt)