Introduction: This project aims to explore three large text datasets, understand their characteristics, and lay the groundwork for building a predictive model and a user-friendly application.

Data Loading: The necessary text data was downloaded from the specified source and loaded into R for analysis.

Data Summaries: Basic statistics, such as word and line counts, were calculated for each text file to understand the dataset’s scale and to uncover patterns in the data, including word length distribution.

This code processes the three text files (blogs, news, tweets), calculating the total word and line count for each. It reads each file, splits it into words to count words, and counts non-empty lines. The results are organized into a table showing the file name, word count, and line count for each file.

# Load libraries
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.2     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.0
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.2     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
# Define file names
file_names <- c("en_US.blogs.txt", "en_US.news.txt", "en_US.twitter.txt")

# Function to calculate word and line count
calculate_summary <- function(file_path) {
  # Read the file content
  text <- readLines(file_path)
  
  # Calculate word count
  word_count <- sum(str_split(text, "\\s+") %>% lengths())
  
  # Calculate line count
  line_count <- length(text[nzchar(text)])
  
  # Return a named vector with summary statistics
  c(
    file = file_path,
    word_count = word_count,
    line_count = line_count
  )
}

# Apply the function to each file and store results
summaries <- lapply(file_names, calculate_summary)

# Create a summary table
summary_df <- do.call(rbind, summaries)
print(summary_df)
##      file                word_count line_count
## [1,] "en_US.blogs.txt"   "37334131" "899288"  
## [2,] "en_US.news.txt"    "2643969"  "77259"   
## [3,] "en_US.twitter.txt" "30373545" "2360148"

This code analyzes word length in the text files. It breaks each file (blogs, news, tweets) into words and counts the characters in each word. It then creates a chart visualizing the average word length for each dataset (blogs vs. news vs. tweets).

blogs <- "en_US.blogs.txt"
news <- "en_US.news.txt"
twitter <- "en_US.twitter.txt"

# Function to calculate average word length
calculate_avg_word_length <- function(text) {
  words <- text %>%
    str_split("\\s+") %>%
    unlist()
  mean(nchar(words))
}

# Calculate average word length for each file
avg_word_length_blogs <- calculate_avg_word_length(blogs)
avg_word_length_news <- calculate_avg_word_length(news)
avg_word_length_twitter <- calculate_avg_word_length(twitter)

# Combine results into a data frame
avg_lengths <- data.frame(
  Dataset = c("Blogs", "News", "Twitter"),
  Avg_Word_Length = c(avg_word_length_blogs, avg_word_length_news, avg_word_length_twitter)
)

# Create a bar plot
ggplot(avg_lengths, aes(x = Dataset, y = Avg_Word_Length)) +
  geom_bar(stat = "identity", fill = "steelblue") +
  labs(title = "Average Word Length by Dataset", x = "Dataset", y = "Average Word Length")

Future Plans: A predictive model will be developed to extract useful information from the text data, and a Shiny app will be created to provide an interactive interface for users to explore the results.