This report explores the text data from blogs, news, and Twitter. The goal is to understand the structure of this data to inform the creation of a predictive text algorithm and a user-friendly Shiny application.
First, we load the data and calculate basic statistics. The table below proves we have successfully accessed the data and shows its key features.
# Define the file paths. CHANGE THESE PATHS IF YOUR FILES ARE IN A DIFFERENT LOCATION.
# The files MUST be in your RStudio project folder or the path must be correct.
blog_path <- "en_US.blogs.txt"
news_path <- "en_US.news.txt"
twitter_path <- "en_US.twitter.txt"
# Function to safely read files and handle encoding
read_file <- function(path) {
con <- file(path, open="rb")
lines <- readLines(con, encoding="UTF-8", skipNul=TRUE)
close(con)
return(lines)
}
# Load the data
blogs <- read_file(blog_path)
news <- read_file(news_path)
twitter <- read_file(twitter_path)
# Calculate basic statistics
library(stringr)
summary_data <- data.frame(
Source = c("Blogs", "News", "Twitter"),
`Number of Lines` = c(length(blogs), length(news), length(twitter)),
`Total Words` = c(sum(str_count(blogs, "\\W+") + 1),
sum(str_count(news, "\\W+") + 1),
sum(str_count(twitter, "\\W+") + 1)),
`Mean Words per Line` = c(mean(str_count(blogs, "\\W+") + 1),
mean(str_count(news, "\\W+") + 1),
mean(str_count(twitter, "\\W+") + 1)),
`File Size (MB)` = c(round(file.info(blog_path)$size / (1024^2), 2),
round(file.info(news_path)$size / (1024^2), 2),
round(file.info(twitter_path)$size / (1024^2), 2))
)
# Display the table nicely
library(knitr)
kable(summary_data, caption = "Summary Statistics of the Three Text Data Sources", align = c('l', 'r', 'r', 'r', 'r'))
| Source | Number.of.Lines | Total.Words | Mean.Words.per.Line | File.Size..MB. |
|---|---|---|---|---|
| Blogs | 899288 | 39116174 | 43.49683 | 200.42 |
| News | 1010242 | 36719075 | 36.34681 | 196.28 |
| 2360148 | 32793641 | 13.89474 | 159.36 |