This report presents the initial exploration of a text dataset from the HC Corpora corpus. The ultimate goal is to build a predictive text model (like SwiftKey) that suggests the next word in a sequence. This milestone is to show early progress: data ingestion, exploratory analysis, and a basic plan for model development.
The dataset includes three English files:
library(stringi)
# File paths
blogs <- readLines("final/en_US/en_US.blogs.txt", skipNul = TRUE)
news <- readLines("final/en_US/en_US.news.txt", skipNul = TRUE)
twitter <- readLines("final/en_US/en_US.twitter.txt", skipNul = TRUE)
# Summary
data_summary <- data.frame(
Source = c("Blogs", "News", "Twitter"),
Lines = c(length(blogs), length(news), length(twitter)),
Words = c(sum(stri_count_words(blogs)),
sum(stri_count_words(news)),
sum(stri_count_words(twitter)))
)
knitr::kable(data_summary)
| Source | Lines | Words |
|---|---|---|
| Blogs | 899288 | 37546806 |
| News | 77259 | 2674561 |
| 2360148 | 30096690 |
We created a 5% random sample from each file to make processing faster. Text was cleaned by converting to lowercase, removing punctuation and numbers.
library(dplyr)
library(tidytext)
library(ggplot2)
data("stop_words")
# Sample & clean
set.seed(42)
sample_text <- c(
sample(blogs, length(blogs) * 0.05),
sample(news, length(news) * 0.05),
sample(twitter, length(twitter) * 0.05)
)
clean_df <- data.frame(text = sample_text) %>%
mutate(text = tolower(text)) %>%
unnest_tokens(word, text) %>%
anti_join(stop_words)
top_words <- clean_df %>%
count(word, sort = TRUE) %>%
top_n(20)
ggplot(top_words, aes(x = reorder(word, n), y = n)) +
geom_bar(stat = "identity", fill = "steelblue") +
coord_flip() +
labs(title = "Top 20 Words", x = "Word", y = "Frequency")
We also examined frequent word pairs and trigrams.
bigrams <- data.frame(text = sample_text) %>%
unnest_tokens(bigram, text, token = "ngrams", n = 2) %>%
count(bigram, sort = TRUE)
head(bigrams, 10)
## bigram n
## 1 of the 12476
## 2 in the 12345
## 3 to the 6809
## 4 for the 6711
## 5 on the 6426
## 6 to be 6114
## 7 at the 4546
## 8 i have 3992
## 9 and the 3792
## 10 i was 3766
The most common bigrams include “right now”, “last year”, and “high school”, showing frequent conversational or journalistic patterns.
Model Type: We’ll use a Stupid Backoff n-gram model (tri-, bi-, uni-gram hierarchy).
Smoothing: To handle unseen combinations, we will apply backoff and assign default probabilities.
Efficiency: Rare n-grams will be removed to reduce model size.
Shiny App: A web interface will allow users to input a phrase and receive a suggested next word.
This report demonstrates that the data has been downloaded, loaded, and explored. Word frequency and basic token structure have been analyzed. The next step is to build the predictive model and integrate it into a deployable Shiny application.