Exploratory Data Analysis for Text Prediction Algorithm

2. Data Loading & Basic Summaries

First, we load the data and calculate basic statistics. The table below proves we have successfully accessed the data and shows its key features.

# Define the file paths. CHANGE THESE PATHS IF YOUR FILES ARE IN A DIFFERENT LOCATION.
# The files MUST be in your RStudio project folder or the path must be correct.
blog_path <- "en_US.blogs.txt"
news_path <- "en_US.news.txt"
twitter_path <- "en_US.twitter.txt"

# Function to safely read files and handle encoding
read_file <- function(path) {
  con <- file(path, open="rb")
  lines <- readLines(con, encoding="UTF-8", skipNul=TRUE)
  close(con)
  return(lines)
}

# Load the data
blogs <- read_file(blog_path)
news <- read_file(news_path)
twitter <- read_file(twitter_path)

# Calculate basic statistics
library(stringr)

summary_data <- data.frame(
  Source = c("Blogs", "News", "Twitter"),
  `Number of Lines` = c(length(blogs), length(news), length(twitter)),
  `Total Words` = c(sum(str_count(blogs, "\\W+") + 1), 
                    sum(str_count(news, "\\W+") + 1), 
                    sum(str_count(twitter, "\\W+") + 1)),
  `Mean Words per Line` = c(mean(str_count(blogs, "\\W+") + 1), 
                            mean(str_count(news, "\\W+") + 1), 
                            mean(str_count(twitter, "\\W+") + 1)),
  `File Size (MB)` = c(round(file.info(blog_path)$size / (1024^2), 2), 
                       round(file.info(news_path)$size / (1024^2), 2), 
                       round(file.info(twitter_path)$size / (1024^2), 2))
)

# Display the table nicely
library(knitr)
kable(summary_data, caption = "Summary Statistics of the Three Text Data Sources", align = c('l', 'r', 'r', 'r', 'r'))

Summary Statistics of the Three Text Data Sources
Source	Number.of.Lines	Total.Words	Mean.Words.per.Line	File.Size..MB.
Blogs	899288	39116174	43.49683	200.42
News	1010242	36719075	36.34681	196.28
Twitter	2360148	32793641	13.89474	159.36

Exploratory Data Analysis for Text Prediction Algorithm

Rishi Jain

December 23, 2025

1. Project Goal

2. Data Loading & Basic Summaries