Capstone Milestone Report: Exploratory Analysis of Text Data

Executive Summary

This report explores the text data provided for the JHU Data Science Capstone project. The data, sourced from HC Corpora, consists of three English-language text files scraped from blogs, news sites, and Twitter. The goal of the project is to build a predictive text application that suggests the next word as a user types.

In this milestone, we:

Load and inspect the three text files
Summarize their size, line counts, and word counts
Visualize word frequencies on a representative sample
Outline the plan for building the prediction algorithm

Loading the Data

# Connect to each file and read line by line
blogs <- readLines("~/capstone/final/en_US/en_US.blogs.txt",
                   encoding = "UTF-8", skipNul = TRUE)
news <- readLines("~/capstone/final/en_US/en_US.news.txt",
                  encoding = "UTF-8", skipNul = TRUE)
twitter <- readLines("~/capstone/final/en_US/en_US.twitter.txt",
                     encoding = "UTF-8", skipNul = TRUE)

Basic Summary Statistics

We summarize each file by its size on disk, number of lines, number of words, and the length of the longest line.

file_sizes <- c(
  file.info("~/capstone/final/en_US/en_US.blogs.txt")$size,
  file.info("~/capstone/final/en_US/en_US.news.txt")$size,
  file.info("~/capstone/final/en_US/en_US.twitter.txt")$size
) / 1024^2  # convert to MB

line_counts <- c(length(blogs), length(news), length(twitter))

word_counts <- c(
  sum(stri_count_words(blogs)),
  sum(stri_count_words(news)),
  sum(stri_count_words(twitter))
)

max_line_lengths <- c(
  max(nchar(blogs)),
  max(nchar(news)),
  max(nchar(twitter))
)

summary_df <- data.frame(
  File = c("Blogs", "News", "Twitter"),
  Size_MB = round(file_sizes, 2),
  Lines = line_counts,
  Words = word_counts,
  Longest_Line = max_line_lengths
)

knitr::kable(summary_df, caption = "Summary statistics for the three text files")

Summary statistics for the three text files
File	Size_MB	Lines	Words	Longest_Line
Blogs	200.42	899288	37546250	40833
News	196.28	1010242	34762395	11384
Twitter	159.36	2360148	30093413	140

Observations:

The blogs file is the largest in size and has the longest lines (some over 40,000 characters).
Twitter has the most lines but the shortest average length, consistent with the 140-character limit.
All three files together contain over 100 million words — far too much to use entirely. We’ll work with samples for analysis.

Sampling the Data

To make exploration manageable, we take a 1% random sample from each file.

set.seed(42)
sample_blogs <- sample(blogs, length(blogs) * 0.01)
sample_news <- sample(news, length(news) * 0.01)
sample_twitter <- sample(twitter, length(twitter) * 0.01)

# Combine into one corpus
sample_all <- c(sample_blogs, sample_news, sample_twitter)

Word Frequency Analysis

We tokenize the sample into individual words and count the most common ones (excluding common stopwords like “the”, “a”, “is”).

# Simple tokenization: lowercase, remove punctuation/numbers, split on whitespace
clean_text <- tolower(sample_all)
clean_text <- gsub("[^a-z\\s]", " ", clean_text)
words <- unlist(strsplit(clean_text, "\\s+"))
words <- words[nchar(words) > 0]

# Common English stopwords
stopwords <- c("the", "a", "an", "and", "or", "but", "is", "are", "was",
               "were", "be", "been", "being", "have", "has", "had", "do",
               "does", "did", "will", "would", "could", "should", "may",
               "might", "must", "can", "to", "of", "in", "on", "at", "by",
               "for", "with", "about", "as", "it", "its", "this", "that",
               "these", "those", "i", "you", "he", "she", "we", "they",
               "me", "him", "her", "us", "them", "my", "your", "his",
               "our", "their", "if", "then", "not", "no", "so")

words_filtered <- words[!words %in% stopwords]
word_freq <- sort(table(words_filtered), decreasing = TRUE)
top_words <- head(word_freq, 20)

Top 20 Most Frequent Words

Distribution of Line Lengths by Source

Findings

Twitter lines are short and bursty, dominated by casual language.
News lines are more uniformly sized, with formal vocabulary.
Blogs vary widely in length, sometimes spanning entire articles in a single line.
Even after removing stopwords, the top words (e.g. “just”, “like”, “time”, “people”) reflect everyday English usage.

Plan for the Prediction Algorithm and Shiny App

The next steps for building the predictive text application are:

Build n-gram models — Generate unigrams, bigrams, trigrams, and quadgrams from the sampled corpus to capture common word sequences.
Smoothing — Use a back-off (Katz) or Kneser-Ney smoothing strategy so the model handles unseen word combinations gracefully.
Optimization — Store the n-gram lookup tables efficiently (e.g. using data.table) so predictions return in under a second.
Shiny App — Build a simple interface with a text input box. As the user types, the app extracts the last 1-3 words and looks up the most likely next word.
Deploy — Publish the Shiny app on shinyapps.io for grading.

Conclusion

The dataset is large but workable through sampling. Basic exploration confirms that word frequencies follow the expected Zipfian distribution and that the three sources differ meaningfully in style. The next phase will focus on building and tuning the n-gram language model behind the prediction app.

Report generated on June 02, 2026