Introduction

This Milestone report is prepared for the Data Science Capstone (Johns Hopkins University, Coursera).
It replicates the RPubs milestone structure and includes data loading, cleaning, exploratory analysis, and visualization.

Data

Dataset: Coursera Capstone dataset (en_US.blogs.txt, en_US.news.txt, en_US.twitter.txt).
Make sure the files are placed in final/en_US/ or update the file paths below.

Load and sample data safely

# Paths to your dataset files
blogs_path  <- "final/en_US/en_US.blogs.txt"
news_path   <- "final/en_US/en_US.news.txt"
twitter_path<- "final/en_US/en_US.twitter.txt"

# Safe sampling to prevent memory issues
sample_lines <- 10000  # reduce if needed for your system

read_sample <- function(path, n = 10000) {
  if (!file.exists(path)) {
    message(paste("File not found:", path))
    return(character(0))
  }
  con <- file(path, "r", encoding = "UTF-8")
  on.exit(close(con))
  lines <- readLines(con, n = n, warn = FALSE, skipNul = TRUE)
  lines
}

blogs  <- read_sample(blogs_path, sample_lines)
news   <- read_sample(news_path, sample_lines)
twitter<- read_sample(twitter_path, sample_lines)

all_text <- c(blogs, news, twitter)
length(all_text)

## [1] 30000

Data cleaning and corpus preparation

# Create corpus and clean text
corpus <- VCorpus(VectorSource(all_text))

clean_corpus <- function(corp) {
  corp <- tm_map(corp, content_transformer(tolower))
  corp <- tm_map(corp, removePunctuation)
  corp <- tm_map(corp, removeNumbers)
  corp <- tm_map(corp, removeWords, stopwords("en"))
  corp <- tm_map(corp, stripWhitespace)
  corp
}

corpus_clean <- clean_corpus(corpus)
# Preview first cleaned document
if (length(corpus_clean) > 0) {
  cat(content(corpus_clean[[1]])[1:3], sep = "\n")
}

##  years thereafter oil fields platforms named pagan “gods”
## NA
## NA

Word Frequency Analysis

tdm <- TermDocumentMatrix(corpus_clean, control = list(wordLengths = c(1, Inf)))
m <- as.matrix(tdm)
freq <- sort(rowSums(m), decreasing = TRUE)
freq_df <- data.frame(term = names(freq), freq = as.integer(freq), row.names = NULL)

# Top 20 words
head(freq_df, 20)

Visualizations

# Bar plot of top 20 words
top_n <- 20
top_words <- freq_df[1:top_n, ]
ggplot(top_words, aes(x = reorder(term, freq), y = freq)) +
  geom_bar(stat = "identity", fill = "steelblue") +
  coord_flip() +
  labs(title = "Top 20 Words", x = "Word", y = "Frequency")

# Wordcloud
if (nrow(freq_df) > 50) {
  suppressWarnings(wordcloud(words = freq_df$term, freq = freq_df$freq,
                             min.freq = 2, max.words = 100, random.order = FALSE))
}

Conclusion

This report demonstrates a memory-safe workflow for the Capstone dataset.
It can be published directly to RPubs without Java/RWeka dependencies, while retaining the same milestone structure.

Knit the document to HTML and publish to RPubs to obtain your shareable link.

Milestone Report - Capstone Project

Nikhil Nilesh

October 11, 2025