This report explores the text data provided for the JHU Data Science Capstone project. The data, sourced from HC Corpora, consists of three English-language text files scraped from blogs, news sites, and Twitter. The goal of the project is to build a predictive text application that suggests the next word as a user types.
In this milestone, we:
# Connect to each file and read line by line
blogs <- readLines("~/capstone/final/en_US/en_US.blogs.txt",
encoding = "UTF-8", skipNul = TRUE)
news <- readLines("~/capstone/final/en_US/en_US.news.txt",
encoding = "UTF-8", skipNul = TRUE)
twitter <- readLines("~/capstone/final/en_US/en_US.twitter.txt",
encoding = "UTF-8", skipNul = TRUE)
We summarize each file by its size on disk, number of lines, number of words, and the length of the longest line.
file_sizes <- c(
file.info("~/capstone/final/en_US/en_US.blogs.txt")$size,
file.info("~/capstone/final/en_US/en_US.news.txt")$size,
file.info("~/capstone/final/en_US/en_US.twitter.txt")$size
) / 1024^2 # convert to MB
line_counts <- c(length(blogs), length(news), length(twitter))
word_counts <- c(
sum(stri_count_words(blogs)),
sum(stri_count_words(news)),
sum(stri_count_words(twitter))
)
max_line_lengths <- c(
max(nchar(blogs)),
max(nchar(news)),
max(nchar(twitter))
)
summary_df <- data.frame(
File = c("Blogs", "News", "Twitter"),
Size_MB = round(file_sizes, 2),
Lines = line_counts,
Words = word_counts,
Longest_Line = max_line_lengths
)
knitr::kable(summary_df, caption = "Summary statistics for the three text files")
| File | Size_MB | Lines | Words | Longest_Line |
|---|---|---|---|---|
| Blogs | 200.42 | 899288 | 37546250 | 40833 |
| News | 196.28 | 1010242 | 34762395 | 11384 |
| 159.36 | 2360148 | 30093413 | 140 |
Observations:
To make exploration manageable, we take a 1% random sample from each file.
set.seed(42)
sample_blogs <- sample(blogs, length(blogs) * 0.01)
sample_news <- sample(news, length(news) * 0.01)
sample_twitter <- sample(twitter, length(twitter) * 0.01)
# Combine into one corpus
sample_all <- c(sample_blogs, sample_news, sample_twitter)
We tokenize the sample into individual words and count the most common ones (excluding common stopwords like “the”, “a”, “is”).
# Simple tokenization: lowercase, remove punctuation/numbers, split on whitespace
clean_text <- tolower(sample_all)
clean_text <- gsub("[^a-z\\s]", " ", clean_text)
words <- unlist(strsplit(clean_text, "\\s+"))
words <- words[nchar(words) > 0]
# Common English stopwords
stopwords <- c("the", "a", "an", "and", "or", "but", "is", "are", "was",
"were", "be", "been", "being", "have", "has", "had", "do",
"does", "did", "will", "would", "could", "should", "may",
"might", "must", "can", "to", "of", "in", "on", "at", "by",
"for", "with", "about", "as", "it", "its", "this", "that",
"these", "those", "i", "you", "he", "she", "we", "they",
"me", "him", "her", "us", "them", "my", "your", "his",
"our", "their", "if", "then", "not", "no", "so")
words_filtered <- words[!words %in% stopwords]
word_freq <- sort(table(words_filtered), decreasing = TRUE)
top_words <- head(word_freq, 20)
The next steps for building the predictive text application are:
data.table) so predictions return in under a second.The dataset is large but workable through sampling. Basic exploration confirms that word frequencies follow the expected Zipfian distribution and that the three sources differ meaningfully in style. The next phase will focus on building and tuning the n-gram language model behind the prediction app.
Report generated on June 02, 2026