1. Introduction

The goal of this project is to build a predictive text model using a large corpus of English text sourced from blogs, news articles, and Twitter posts. This report performs an initial exploratory analysis of the dataset and outlines the planned steps for building the prediction model and Shiny web application.


2. Data Loading

# Adjust file paths if needed
dataset_path <- "./"
blogs <- readLines(paste0(dataset_path, "en_US.blogs.txt"), encoding = "UTF-8", skipNul = TRUE)
news <- readLines(paste0(dataset_path, "en_US.news.txt"), encoding = "UTF-8", skipNul = TRUE)
twitter <- readLines(paste0(dataset_path, "en_US.twitter.txt"), encoding = "UTF-8", skipNul = TRUE)

3. Summary Statistics

blogs_lines <- length(blogs)
news_lines <- length(news)
twitter_lines <- length(twitter)

blogs_words <- sum(stri_count_words(blogs))
news_words <- sum(stri_count_words(news))
twitter_words <- sum(stri_count_words(twitter))

blogs_max <- max(nchar(blogs))
news_max <- max(nchar(news))
twitter_max <- max(nchar(twitter))

summary_df <- data.frame(
  Dataset = c("Blogs", "News", "Twitter"),
  Line_Count = c(blogs_lines, news_lines, twitter_lines),
  Word_Count = c(blogs_words, news_words, twitter_words),
  Max_Line_Length = c(blogs_max, news_max, twitter_max)
)

kable(summary_df)
Dataset Line_Count Word_Count Max_Line_Length
Blogs 899288 37546806 40833
News 1010206 34761151 11384
Twitter 2360148 30096690 140

4. Word Frequencies

sample_size <- 5000
blogs_sample <- blogs[1:sample_size]
blogs_df <- data.frame(text = blogs_sample)

blogs_words_df <- blogs_df %>%
  unnest_tokens(word, text) %>%
  filter(!word %in% stop_words$word) %>%
  count(word, sort = TRUE)

top_words <- blogs_words_df %>% top_n(20)
## Selecting by n
ggplot(top_words, aes(x = reorder(word, n), y = n)) +
  geom_bar(stat = "identity", fill = "steelblue") +
  coord_flip() +
  labs(title = "Top 20 Frequent Words in Blogs Sample", x = "Word", y = "Count")


6. Prediction Plan

I will create a word prediction algorithm using n-gram language modeling. The steps include: