The purpose of this report is to demonstrate initial exploration of the text data that will later be used to build a prediction algorithm and Shiny application.
At this stage, the goal is not to build a model, but to understand the size, structure, and basic characteristics of the data.
This report is written for a non-technical audience and highlights only the most important findings.
# Example if files are in "final/en_US/" folder
blogs <- readLines("final/en_US/en_US.blogs.txt", encoding = "UTF-8", skipNul = TRUE)
news <- readLines("final/en_US/en_US.news.txt", encoding = "UTF-8", skipNul = TRUE)
twitter <- readLines("final/en_US/en_US.twitter.txt", encoding = "UTF-8", skipNul = TRUE)
cat("Data loaded successfully!\n")
## Data loaded successfully!
data_summary <- data.frame(
Source = c("Blogs", "News", "Twitter"),
Lines = c(length(blogs), length(news), length(twitter)),
Words = c(
sum(stri_count_words(blogs)),
sum(stri_count_words(news)),
sum(stri_count_words(twitter))
)
)
knitr::kable(data_summary, format.args = list(big.mark = ","),
caption = "Summary of Text Data Sources")
| Source | Lines | Words |
|---|---|---|
| Blogs | 899,288 | 37,546,250 |
| News | 1,010,242 | 34,762,395 |
| 2,360,148 | 30,093,413 |
The table above shows:
Let’s examine how many words appear in each line across the three sources.
blog_words <- stri_count_words(blogs)
news_words <- stri_count_words(news)
twitter_words <- stri_count_words(twitter)
# Summary statistics
summary_stats <- data.frame(
Source = c("Blogs", "News", "Twitter"),
Mean = c(mean(blog_words, na.rm = TRUE),
mean(news_words, na.rm = TRUE),
mean(twitter_words, na.rm = TRUE)),
Median = c(median(blog_words, na.rm = TRUE),
median(news_words, na.rm = TRUE),
median(twitter_words, na.rm = TRUE)),
Max = c(max(blog_words, na.rm = TRUE),
max(news_words, na.rm = TRUE),
max(twitter_words, na.rm = TRUE))
)
knitr::kable(summary_stats, digits = 2,
caption = "Word Count Statistics per Line")
| Source | Mean | Median | Max |
|---|---|---|---|
| Blogs | 41.75 | 28 | 6726 |
| News | 34.41 | 32 | 1796 |
| 12.75 | 12 | 47 |
par(mfrow = c(1, 3))
hist(blog_words[blog_words < 200],
breaks = 50,
main = "Blogs",
xlab = "Words per Line",
col = "lightblue",
border = "white")
hist(news_words[news_words < 200],
breaks = 50,
main = "News",
xlab = "Words per Line",
col = "lightgreen",
border = "white")
hist(twitter_words[twitter_words < 200],
breaks = 50,
main = "Twitter",
xlab = "Words per Line",
col = "lightcoral",
border = "white")
par(mfrow = c(1, 1))
Note: Histograms limited to < 200 words per line for better visualization
Future analysis will include: