The goal of this project is to build a predictive text model using a large corpus of English text sourced from blogs, news articles, and Twitter posts. This report performs an initial exploratory analysis of the dataset and outlines the planned steps for building the prediction model and Shiny web application.
# Adjust file paths if needed
dataset_path <- "./"
blogs <- readLines(paste0(dataset_path, "en_US.blogs.txt"), encoding = "UTF-8", skipNul = TRUE)
news <- readLines(paste0(dataset_path, "en_US.news.txt"), encoding = "UTF-8", skipNul = TRUE)
twitter <- readLines(paste0(dataset_path, "en_US.twitter.txt"), encoding = "UTF-8", skipNul = TRUE)
blogs_lines <- length(blogs)
news_lines <- length(news)
twitter_lines <- length(twitter)
blogs_words <- sum(stri_count_words(blogs))
news_words <- sum(stri_count_words(news))
twitter_words <- sum(stri_count_words(twitter))
blogs_max <- max(nchar(blogs))
news_max <- max(nchar(news))
twitter_max <- max(nchar(twitter))
summary_df <- data.frame(
Dataset = c("Blogs", "News", "Twitter"),
Line_Count = c(blogs_lines, news_lines, twitter_lines),
Word_Count = c(blogs_words, news_words, twitter_words),
Max_Line_Length = c(blogs_max, news_max, twitter_max)
)
kable(summary_df)
| Dataset | Line_Count | Word_Count | Max_Line_Length |
|---|---|---|---|
| Blogs | 899288 | 37546806 | 40833 |
| News | 1010206 | 34761151 | 11384 |
| 2360148 | 30096690 | 140 |
sample_size <- 5000
blogs_sample <- blogs[1:sample_size]
blogs_df <- data.frame(text = blogs_sample)
blogs_words_df <- blogs_df %>%
unnest_tokens(word, text) %>%
filter(!word %in% stop_words$word) %>%
count(word, sort = TRUE)
top_words <- blogs_words_df %>% top_n(20)
## Selecting by n
ggplot(top_words, aes(x = reorder(word, n), y = n)) +
geom_bar(stat = "identity", fill = "steelblue") +
coord_flip() +
labs(title = "Top 20 Frequent Words in Blogs Sample", x = "Word", y = "Count")
I will create a word prediction algorithm using n-gram language modeling. The steps include: