This report presents a basic exploratory data analysis of three English text datasets: blogs, news, and tweets. The goal is to understand their general structure before building a predictive model.
We use a small sample (10,000 lines) from each file to keep processing manageable.
sample_size <- 10000
# Adjust the path as needed
blogs <- readLines("dados/final/en_US/en_US.blogs.txt", n = sample_size, warn = FALSE)
news <- readLines("dados/final/en_US/en_US.news.txt", n = sample_size, warn = FALSE)
twitter <- readLines("dados/final/en_US/en_US.twitter.txt", n = sample_size, warn = FALSE)
We compute basic statistics: number of lines, words, memory size, and average words per line.
generate_summary <- function(text_data, name) {
lines <- length(text_data)
words <- sum(stri_count_words(text_data))
size_mb <- object.size(text_data) / (1024^2)
avg_words <- mean(stri_count_words(text_data))
data.frame(
Source = name,
Lines = lines,
Words = words,
Size_MB = round(size_mb, 2),
Avg_Words_Per_Line = round(avg_words, 2)
)
}
summary <- rbind(
generate_summary(blogs, "Blogs"),
generate_summary(news, "News"),
generate_summary(twitter, "Twitter")
)
summary
## Source Lines Words Size_MB Avg_Words_Per_Line
## 1 Blogs 10000 412805 2.8 bytes 41.28
## 2 News 10000 348070 2.6 bytes 34.81
## 3 Twitter 10000 126511 1.4 bytes 12.65
twitter_word_counts <- stri_count_words(twitter)
qplot(twitter_word_counts, bins = 30, main = "Words per Line Distribution (Twitter)",
xlab = "Words per Line", ylab = "Frequency")
## Warning: `qplot()` was deprecated in ggplot2 3.4.0.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
The next step is to build a word prediction model using
n-grams (sequences of 1, 2, or 3 words).
We will apply a backoff strategy to predict the next
word when a direct match is not found.
This model will be deployed in a simple Shiny app where
users can enter text and receive word suggestions.