This code processes the three text files (blogs, news, tweets),
calculating the total word and line count for each. It reads each file,
splits it into words to count words, and counts non-empty lines. The
results are organized into a table showing the file name, word count,
and line count for each file.
# Load libraries
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.2 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.2 ✔ tidyr 1.3.0
## ✔ purrr 1.0.2
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
# Define file names
file_names <- c("en_US.blogs.txt", "en_US.news.txt", "en_US.twitter.txt")
# Function to calculate word and line count
calculate_summary <- function(file_path) {
# Read the file content
text <- readLines(file_path)
# Calculate word count
word_count <- sum(str_split(text, "\\s+") %>% lengths())
# Calculate line count
line_count <- length(text[nzchar(text)])
# Return a named vector with summary statistics
c(
file = file_path,
word_count = word_count,
line_count = line_count
)
}
# Apply the function to each file and store results
summaries <- lapply(file_names, calculate_summary)
# Create a summary table
summary_df <- do.call(rbind, summaries)
print(summary_df)
## file word_count line_count
## [1,] "en_US.blogs.txt" "37334131" "899288"
## [2,] "en_US.news.txt" "2643969" "77259"
## [3,] "en_US.twitter.txt" "30373545" "2360148"
This code analyzes word length in the text files. It breaks each
file (blogs, news, tweets) into words and counts the characters in each
word. It then creates a chart visualizing the average word length for
each dataset (blogs vs. news vs. tweets).
blogs <- "en_US.blogs.txt"
news <- "en_US.news.txt"
twitter <- "en_US.twitter.txt"
# Function to calculate average word length
calculate_avg_word_length <- function(text) {
words <- text %>%
str_split("\\s+") %>%
unlist()
mean(nchar(words))
}
# Calculate average word length for each file
avg_word_length_blogs <- calculate_avg_word_length(blogs)
avg_word_length_news <- calculate_avg_word_length(news)
avg_word_length_twitter <- calculate_avg_word_length(twitter)
# Combine results into a data frame
avg_lengths <- data.frame(
Dataset = c("Blogs", "News", "Twitter"),
Avg_Word_Length = c(avg_word_length_blogs, avg_word_length_news, avg_word_length_twitter)
)
# Create a bar plot
ggplot(avg_lengths, aes(x = Dataset, y = Avg_Word_Length)) +
geom_bar(stat = "identity", fill = "steelblue") +
labs(title = "Average Word Length by Dataset", x = "Dataset", y = "Average Word Length")
