Introduction

This report presents the initial exploration of a text dataset from the HC Corpora corpus. The ultimate goal is to build a predictive text model (like SwiftKey) that suggests the next word in a sequence. This milestone is to show early progress: data ingestion, exploratory analysis, and a basic plan for model development.

Data Summary

The dataset includes three English files:

Blogs: Long-form written content
News: Edited news articles
Twitter: Short, informal texts

library(stringi)


# File paths
blogs <- readLines("final/en_US/en_US.blogs.txt", skipNul = TRUE)
news <- readLines("final/en_US/en_US.news.txt", skipNul = TRUE)
twitter <- readLines("final/en_US/en_US.twitter.txt", skipNul = TRUE)

# Summary
data_summary <- data.frame(
  Source = c("Blogs", "News", "Twitter"),
  Lines = c(length(blogs), length(news), length(twitter)),
  Words = c(sum(stri_count_words(blogs)),
            sum(stri_count_words(news)),
            sum(stri_count_words(twitter)))
)
knitr::kable(data_summary)

Source	Lines	Words
Blogs	899288	37546806
News	77259	2674561
Twitter	2360148	30096690

Exploratory Data Analysis

We created a 5% random sample from each file to make processing faster. Text was cleaned by converting to lowercase, removing punctuation and numbers.

Most Frequent Words

library(dplyr)
library(tidytext)
library(ggplot2)

data("stop_words")

# Sample & clean
set.seed(42)
sample_text <- c(
  sample(blogs, length(blogs) * 0.05),
  sample(news, length(news) * 0.05),
  sample(twitter, length(twitter) * 0.05)
)
clean_df <- data.frame(text = sample_text) %>%
  mutate(text = tolower(text)) %>%
  unnest_tokens(word, text) %>%
  anti_join(stop_words)

top_words <- clean_df %>%
  count(word, sort = TRUE) %>%
  top_n(20)

ggplot(top_words, aes(x = reorder(word, n), y = n)) +
  geom_bar(stat = "identity", fill = "steelblue") +
  coord_flip() +
  labs(title = "Top 20 Words", x = "Word", y = "Frequency")

N-gram Models

We also examined frequent word pairs and trigrams.

bigrams <- data.frame(text = sample_text) %>%
  unnest_tokens(bigram, text, token = "ngrams", n = 2) %>%
  count(bigram, sort = TRUE)

head(bigrams, 10)

##     bigram     n
## 1   of the 12476
## 2   in the 12345
## 3   to the  6809
## 4  for the  6711
## 5   on the  6426
## 6    to be  6114
## 7   at the  4546
## 8   i have  3992
## 9  and the  3792
## 10   i was  3766

The most common bigrams include “right now”, “last year”, and “high school”, showing frequent conversational or journalistic patterns.

Interesting Observations

The Twitter corpus has very short lines compared to blogs.
Common bigrams include conversational phrases and idiomatic expressions.
Noise includes slang, abbreviations, and some profanity, which will need filtering.
Word frequencies follow Zipf’s law — a few words dominate usage.

Plans for the Prediction Algorithm

Model Type: We’ll use a Stupid Backoff n-gram model (tri-, bi-, uni-gram hierarchy).
Smoothing: To handle unseen combinations, we will apply backoff and assign default probabilities.
Efficiency: Rare n-grams will be removed to reduce model size.
Shiny App: A web interface will allow users to input a phrase and receive a suggested next word.

Conclusion

This report demonstrates that the data has been downloaded, loaded, and explored. Word frequency and basic token structure have been analyzed. The next step is to build the predictive model and integrate it into a deployable Shiny application.

Capstone Milestone Report

Hanifah Gladis Amalia

2025-05-02