Executive Summary

This report explores the text data provided for the JHU Data Science Capstone project. The data, sourced from HC Corpora, consists of three English-language text files scraped from blogs, news sites, and Twitter. The goal of the project is to build a predictive text application that suggests the next word as a user types.

In this milestone, we:

  1. Load and inspect the three text files
  2. Summarize their size, line counts, and word counts
  3. Visualize word frequencies on a representative sample
  4. Outline the plan for building the prediction algorithm

Loading the Data

# Connect to each file and read line by line
blogs <- readLines("~/capstone/final/en_US/en_US.blogs.txt",
                   encoding = "UTF-8", skipNul = TRUE)
news <- readLines("~/capstone/final/en_US/en_US.news.txt",
                  encoding = "UTF-8", skipNul = TRUE)
twitter <- readLines("~/capstone/final/en_US/en_US.twitter.txt",
                     encoding = "UTF-8", skipNul = TRUE)

Basic Summary Statistics

We summarize each file by its size on disk, number of lines, number of words, and the length of the longest line.

file_sizes <- c(
  file.info("~/capstone/final/en_US/en_US.blogs.txt")$size,
  file.info("~/capstone/final/en_US/en_US.news.txt")$size,
  file.info("~/capstone/final/en_US/en_US.twitter.txt")$size
) / 1024^2  # convert to MB

line_counts <- c(length(blogs), length(news), length(twitter))

word_counts <- c(
  sum(stri_count_words(blogs)),
  sum(stri_count_words(news)),
  sum(stri_count_words(twitter))
)

max_line_lengths <- c(
  max(nchar(blogs)),
  max(nchar(news)),
  max(nchar(twitter))
)

summary_df <- data.frame(
  File = c("Blogs", "News", "Twitter"),
  Size_MB = round(file_sizes, 2),
  Lines = line_counts,
  Words = word_counts,
  Longest_Line = max_line_lengths
)

knitr::kable(summary_df, caption = "Summary statistics for the three text files")
Summary statistics for the three text files
File Size_MB Lines Words Longest_Line
Blogs 200.42 899288 37546250 40833
News 196.28 1010242 34762395 11384
Twitter 159.36 2360148 30093413 140

Observations:

Sampling the Data

To make exploration manageable, we take a 1% random sample from each file.

set.seed(42)
sample_blogs <- sample(blogs, length(blogs) * 0.01)
sample_news <- sample(news, length(news) * 0.01)
sample_twitter <- sample(twitter, length(twitter) * 0.01)

# Combine into one corpus
sample_all <- c(sample_blogs, sample_news, sample_twitter)

Word Frequency Analysis

We tokenize the sample into individual words and count the most common ones (excluding common stopwords like “the”, “a”, “is”).

# Simple tokenization: lowercase, remove punctuation/numbers, split on whitespace
clean_text <- tolower(sample_all)
clean_text <- gsub("[^a-z\\s]", " ", clean_text)
words <- unlist(strsplit(clean_text, "\\s+"))
words <- words[nchar(words) > 0]

# Common English stopwords
stopwords <- c("the", "a", "an", "and", "or", "but", "is", "are", "was",
               "were", "be", "been", "being", "have", "has", "had", "do",
               "does", "did", "will", "would", "could", "should", "may",
               "might", "must", "can", "to", "of", "in", "on", "at", "by",
               "for", "with", "about", "as", "it", "its", "this", "that",
               "these", "those", "i", "you", "he", "she", "we", "they",
               "me", "him", "her", "us", "them", "my", "your", "his",
               "our", "their", "if", "then", "not", "no", "so")

words_filtered <- words[!words %in% stopwords]
word_freq <- sort(table(words_filtered), decreasing = TRUE)
top_words <- head(word_freq, 20)

Top 20 Most Frequent Words

Distribution of Line Lengths by Source

Findings

Plan for the Prediction Algorithm and Shiny App

The next steps for building the predictive text application are:

  1. Build n-gram models — Generate unigrams, bigrams, trigrams, and quadgrams from the sampled corpus to capture common word sequences.
  2. Smoothing — Use a back-off (Katz) or Kneser-Ney smoothing strategy so the model handles unseen word combinations gracefully.
  3. Optimization — Store the n-gram lookup tables efficiently (e.g. using data.table) so predictions return in under a second.
  4. Shiny App — Build a simple interface with a text input box. As the user types, the app extracts the last 1-3 words and looks up the most likely next word.
  5. Deploy — Publish the Shiny app on shinyapps.io for grading.

Conclusion

The dataset is large but workable through sampling. Basic exploration confirms that word frequencies follow the expected Zipfian distribution and that the three sources differ meaningfully in style. The next phase will focus on building and tuning the n-gram language model behind the prediction app.

Report generated on June 02, 2026