We successfully downloaded and loaded the datasets.
# 修改路径为你本地的数据路径
blogs <- readLines("en_US.blogs.txt", encoding = "UTF-8", skipNul = TRUE)
news <- readLines("en_US.news.txt", encoding = "UTF-8", skipNul = TRUE)
twitter <- readLines("en_US.twitter.txt", encoding = "UTF-8", skipNul = TRUE)
length(blogs); length(news); length(twitter)
We computed basic statistics such as line counts, total words, and average words per line.
summary_stats <- function(data){
words_per_line <- sapply(strsplit(data, "\s+"), length)
data.frame(
Lines = length(data),
TotalWords = sum(words_per_line),
AvgWordsPerLine = round(mean(words_per_line), 2)
)
}
stats <- rbind(
Blogs = summary_stats(blogs),
News = summary_stats(news),
Twitter = summary_stats(twitter)
)
kable(stats, caption = "Summary Statistics of the Three Datasets")
Here we show histograms of words per line to illustrate differences between datasets.
blogs_words <- sapply(strsplit(blogs, "\s+"), length)
news_words <- sapply(strsplit(news, "\s+"), length)
twitter_words <- sapply(strsplit(twitter, "\s+"), length)
df_plot <- data.frame(
words = c(blogs_words, news_words, twitter_words),
source = factor(c(
rep("Blogs", length(blogs_words)),
rep("News", length(news_words)),
rep("Twitter", length(twitter_words))
))
)
ggplot(df_plot, aes(x = words, fill = source)) +
geom_histogram(bins = 50, alpha = 0.6, position = "identity") +
xlim(0, 200) +
labs(title = "Distribution of Words per Line",
x = "Words per Line", y = "Count")
We confirmed successful data loading and
exploration.
Key differences across datasets were identified.
Next step: build n-gram prediction model and Shiny
app.