1. Project Goal

This report explores the text data from blogs, news, and Twitter. The goal is to understand the structure of this data to inform the creation of a predictive text algorithm and a user-friendly Shiny application.

2. Data Loading & Basic Summaries

First, we load the data and calculate basic statistics. The table below proves we have successfully accessed the data and shows its key features.

# Define the file paths. CHANGE THESE PATHS IF YOUR FILES ARE IN A DIFFERENT LOCATION.
# The files MUST be in your RStudio project folder or the path must be correct.
blog_path <- "en_US.blogs.txt"
news_path <- "en_US.news.txt"
twitter_path <- "en_US.twitter.txt"

# Function to safely read files and handle encoding
read_file <- function(path) {
  con <- file(path, open="rb")
  lines <- readLines(con, encoding="UTF-8", skipNul=TRUE)
  close(con)
  return(lines)
}

# Load the data
blogs <- read_file(blog_path)
news <- read_file(news_path)
twitter <- read_file(twitter_path)

# Calculate basic statistics
library(stringr)

summary_data <- data.frame(
  Source = c("Blogs", "News", "Twitter"),
  `Number of Lines` = c(length(blogs), length(news), length(twitter)),
  `Total Words` = c(sum(str_count(blogs, "\\W+") + 1), 
                    sum(str_count(news, "\\W+") + 1), 
                    sum(str_count(twitter, "\\W+") + 1)),
  `Mean Words per Line` = c(mean(str_count(blogs, "\\W+") + 1), 
                            mean(str_count(news, "\\W+") + 1), 
                            mean(str_count(twitter, "\\W+") + 1)),
  `File Size (MB)` = c(round(file.info(blog_path)$size / (1024^2), 2), 
                       round(file.info(news_path)$size / (1024^2), 2), 
                       round(file.info(twitter_path)$size / (1024^2), 2))
)

# Display the table nicely
library(knitr)
kable(summary_data, caption = "Summary Statistics of the Three Text Data Sources", align = c('l', 'r', 'r', 'r', 'r'))
Summary Statistics of the Three Text Data Sources
Source Number.of.Lines Total.Words Mean.Words.per.Line File.Size..MB.
Blogs 899288 39116174 43.49683 200.42
News 1010242 36719075 36.34681 196.28
Twitter 2360148 32793641 13.89474 159.36