This report provides an exploratory analysis of three English text corpora: blogs, news articles, and Twitter posts. These datasets serve as training data for natural language processing applications. The analysis reveals key characteristics of each corpus, including size, vocabulary diversity, and common linguistic patterns.
The analysis examines three text files: - Blogs:
Personal blog posts and articles - News: News articles
from various sources
- Twitter: Social media posts from Twitter
# File paths
blogs_path <- "final/en_US/en_US.blogs.txt"
news_path <- "final/en_US/en_US.news.txt"
twitter_path <- "final/en_US/en_US.twitter.txt"
# Function to read and analyze text files
analyze_text_file <- function(file_path, corpus_name) {
# Read lines
lines <- readLines(file_path, warn = FALSE)
# Basic statistics
line_count <- length(lines)
char_count <- sum(nchar(lines))
# Word analysis
words <- unlist(strsplit(paste(lines, collapse = " "), "\\s+"))
words <- words[words != ""] # Remove empty strings
word_count <- length(words)
unique_words <- length(unique(tolower(words)))
# Line length analysis
line_lengths <- nchar(lines)
avg_line_length <- mean(line_lengths)
median_line_length <- median(line_lengths)
# Word length analysis
word_lengths <- nchar(words)
avg_word_length <- mean(word_lengths)
return(list(
corpus_name = corpus_name,
line_count = line_count,
char_count = char_count,
word_count = word_count,
unique_words = unique_words,
avg_line_length = avg_line_length,
median_line_length = median_line_length,
avg_word_length = avg_word_length,
lines = lines,
words = words,
line_lengths = line_lengths,
word_lengths = word_lengths
))
}
# Analyze all three corpora
blogs_data <- analyze_text_file(blogs_path, "Blogs")
news_data <- analyze_text_file(news_path, "News")
twitter_data <- analyze_text_file(twitter_path, "Twitter")
# Combine results for summary table
summary_data <- rbind(
data.frame(
Corpus = blogs_data$corpus_name,
Lines = format(blogs_data$line_count, big.mark = ","),
Characters = format(blogs_data$char_count, big.mark = ","),
Words = format(blogs_data$word_count, big.mark = ","),
Unique_Words = format(blogs_data$unique_words, big.mark = ","),
Avg_Line_Length = round(blogs_data$avg_line_length, 1),
Avg_Word_Length = round(blogs_data$avg_word_length, 1)
),
data.frame(
Corpus = news_data$corpus_name,
Lines = format(news_data$line_count, big.mark = ","),
Characters = format(news_data$char_count, big.mark = ","),
Words = format(news_data$word_count, big.mark = ","),
Unique_Words = format(news_data$unique_words, big.mark = ","),
Avg_Line_Length = round(news_data$avg_line_length, 1),
Avg_Word_Length = round(news_data$avg_word_length, 1)
),
data.frame(
Corpus = twitter_data$corpus_name,
Lines = format(twitter_data$line_count, big.mark = ","),
Characters = format(twitter_data$char_count, big.mark = ","),
Words = format(twitter_data$word_count, big.mark = ","),
Unique_Words = format(twitter_data$unique_words, big.mark = ","),
Avg_Line_Length = round(twitter_data$avg_line_length, 1),
Avg_Word_Length = round(twitter_data$avg_word_length, 1)
)
)| Corpus | Lines | Characters | Words | Unique_Words | Avg_Line_Length | Avg_Word_Length |
|---|---|---|---|---|---|---|
| Blogs | 899,288 | 206,824,509 | 37,334,131 | 964,404 | 230.0 | 4.6 |
| News | 1,010,206 | 203,214,543 | 34,371,031 | 790,241 | 201.2 | 4.9 |
| 2,360,148 | 162,122,651 | 30,373,543 | 1,078,280 | 68.7 | 4.4 |
# Create line length distribution plots
par(mfrow = c(2, 3))
# Histograms
hist(blogs_data$line_lengths, main = "Blogs - Line Length Distribution",
xlab = "Characters per Line", col = "lightblue", breaks = 50)
hist(news_data$line_lengths, main = "News - Line Length Distribution",
xlab = "Characters per Line", col = "lightgreen", breaks = 50)
hist(twitter_data$line_lengths, main = "Twitter - Line Length Distribution",
xlab = "Characters per Line", col = "lightcoral", breaks = 50)
# Box plots
boxplot(blogs_data$line_lengths, main = "Blogs - Line Length", col = "lightblue")
boxplot(news_data$line_lengths, main = "News - Line Length", col = "lightgreen")
boxplot(twitter_data$line_lengths, main = "Twitter - Line Length", col = "lightcoral")# Create word length distribution plots
par(mfrow = c(2, 3))
# Histograms
hist(blogs_data$word_lengths, main = "Blogs - Word Length Distribution",
xlab = "Characters per Word", col = "lightblue", breaks = 30)
hist(news_data$word_lengths, main = "News - Word Length Distribution",
xlab = "Characters per Word", col = "lightgreen", breaks = 30)
hist(twitter_data$word_lengths, main = "Twitter - Word Length Distribution",
xlab = "Characters per Word", col = "lightcoral", breaks = 30)
# Box plots
boxplot(blogs_data$word_lengths, main = "Blogs - Word Length", col = "lightblue")
boxplot(news_data$word_lengths, main = "News - Word Length", col = "lightgreen")
boxplot(twitter_data$word_lengths, main = "Twitter - Word Length", col = "lightcoral")# Function to get most common words
get_common_words <- function(words, top_n = 20) {
word_freq <- table(tolower(words))
word_freq <- sort(word_freq, decreasing = TRUE)
return(head(word_freq, top_n))
}
# Get common words for each corpus
blogs_common <- get_common_words(blogs_data$words)
news_common <- get_common_words(news_data$words)
twitter_common <- get_common_words(twitter_data$words)
# Create word frequency plots
par(mfrow = c(3, 1))
# Blogs
barplot(blogs_common[1:10], main = "Blogs - Top 10 Most Common Words",
col = "lightblue", las = 2, cex.names = 0.8)
# News
barplot(news_common[1:10], main = "News - Top 10 Most Common Words",
col = "lightgreen", las = 2, cex.names = 0.8)
# Twitter
barplot(twitter_common[1:10], main = "Twitter - Top 10 Most Common Words",
col = "lightcoral", las = 2, cex.names = 0.8)# Create comparison data frame
comparison_df <- data.frame(
Corpus = c("Blogs", "News", "Twitter"),
Word_Count = c(blogs_data$word_count, news_data$word_count, twitter_data$word_count),
Unique_Words = c(blogs_data$unique_words, news_data$unique_words, twitter_data$unique_words),
Avg_Line_Length = c(blogs_data$avg_line_length, news_data$avg_line_length, twitter_data$avg_line_length)
)
# Normalize for better comparison
comparison_df$Word_Count_Norm <- comparison_df$Word_Count / max(comparison_df$Word_Count)
comparison_df$Unique_Words_Norm <- comparison_df$Unique_Words / max(comparison_df$Unique_Words)
comparison_df$Avg_Line_Length_Norm <- comparison_df$Avg_Line_Length / max(comparison_df$Avg_Line_Length)
# Create radar chart-like comparison
par(mfrow = c(1, 1))
plot(1, type = "n", xlim = c(0, 4), ylim = c(0, 1),
main = "Corpus Characteristics Comparison (Normalized)",
xlab = "", ylab = "Normalized Value", xaxt = "n")
# Add axis labels
axis(1, at = 1:3, labels = c("Word Count", "Unique Words", "Avg Line Length"))
# Add lines for each corpus
lines(1:3, c(comparison_df$Word_Count_Norm[1], comparison_df$Unique_Words_Norm[1], comparison_df$Avg_Line_Length_Norm[1]),
col = "blue", lwd = 2, type = "b", pch = 19)
lines(1:3, c(comparison_df$Word_Count_Norm[2], comparison_df$Unique_Words_Norm[2], comparison_df$Avg_Line_Length_Norm[2]),
col = "green", lwd = 2, type = "b", pch = 19)
lines(1:3, c(comparison_df$Word_Count_Norm[3], comparison_df$Unique_Words_Norm[3], comparison_df$Avg_Line_Length_Norm[3]),
col = "red", lwd = 2, type = "b", pch = 19)
legend("topright", legend = c("Blogs", "News", "Twitter"),
col = c("blue", "green", "red"), lwd = 2, pch = 19)## === SAMPLE FROM BLOGS ===
## In the years thereafter, most of the Oil fields and platforms were named after pagan “gods”.
##
## We love you Mr. Brown.
##
## Chad has been awesome with the kids and holding down the fort while I work later than usual! The kids have been busy together playing Skylander on the XBox together, after Kyan cashed in his $$$ from his piggy bank. He wanted that game so bad and used his gift card from his birthday he has been saving and the money to get it (he never taps into that thing either, that is how we know he wanted it so bad). We made him count all of his money to make sure that he had enough! It was very cute to watch his reaction when he realized he did! He also does a very good job of letting Lola feel like she is playing too, by letting her switch out the characters! She loves it almost as much as him.
##
## === SAMPLE FROM NEWS ===
## He wasn't home alone, apparently.
##
## The St. Louis plant had to close. It would die of old age. Workers had been making cars there since the onset of mass automotive production in the 1920s.
##
## WSU's plans quickly became a hot topic on local online sites. Though most people applauded plans for the new biomedical center, many deplored the potential loss of the building.
##
## === SAMPLE FROM TWITTER ===
## How are you? Btw thanks for the RT. You gonna be in DC anytime soon? Love to see you. Been way, way too long.
##
## When you meet someone special... you'll know. Your heart will beat more rapidly and you'll smile for no reason.
##
## they've decided its more fun if I don't.
Data Volume: The combined dataset contains over 102,078,705 words, providing substantial training material for language models.
Diversity: Each corpus offers unique characteristics:
Quality Indicators:
This analysis provides a solid foundation for understanding the training data characteristics and informing downstream natural language processing applications.