This Milestone report is prepared for the Data Science
Capstone (Johns Hopkins University, Coursera).
It replicates the RPubs milestone structure and includes data loading,
cleaning, exploratory analysis, and visualization.
Dataset: Coursera Capstone dataset
(en_US.blogs.txt
, en_US.news.txt
,
en_US.twitter.txt
).
Make sure the files are placed in final/en_US/
or update
the file paths below.
# Paths to your dataset files
blogs_path <- "final/en_US/en_US.blogs.txt"
news_path <- "final/en_US/en_US.news.txt"
twitter_path<- "final/en_US/en_US.twitter.txt"
# Safe sampling to prevent memory issues
sample_lines <- 10000 # reduce if needed for your system
read_sample <- function(path, n = 10000) {
if (!file.exists(path)) {
message(paste("File not found:", path))
return(character(0))
}
con <- file(path, "r", encoding = "UTF-8")
on.exit(close(con))
lines <- readLines(con, n = n, warn = FALSE, skipNul = TRUE)
lines
}
blogs <- read_sample(blogs_path, sample_lines)
news <- read_sample(news_path, sample_lines)
twitter<- read_sample(twitter_path, sample_lines)
all_text <- c(blogs, news, twitter)
length(all_text)
## [1] 30000
# Create corpus and clean text
corpus <- VCorpus(VectorSource(all_text))
clean_corpus <- function(corp) {
corp <- tm_map(corp, content_transformer(tolower))
corp <- tm_map(corp, removePunctuation)
corp <- tm_map(corp, removeNumbers)
corp <- tm_map(corp, removeWords, stopwords("en"))
corp <- tm_map(corp, stripWhitespace)
corp
}
corpus_clean <- clean_corpus(corpus)
# Preview first cleaned document
if (length(corpus_clean) > 0) {
cat(content(corpus_clean[[1]])[1:3], sep = "\n")
}
## years thereafter oil fields platforms named pagan “gods”
## NA
## NA
tdm <- TermDocumentMatrix(corpus_clean, control = list(wordLengths = c(1, Inf)))
m <- as.matrix(tdm)
freq <- sort(rowSums(m), decreasing = TRUE)
freq_df <- data.frame(term = names(freq), freq = as.integer(freq), row.names = NULL)
# Top 20 words
head(freq_df, 20)
# Bar plot of top 20 words
top_n <- 20
top_words <- freq_df[1:top_n, ]
ggplot(top_words, aes(x = reorder(term, freq), y = freq)) +
geom_bar(stat = "identity", fill = "steelblue") +
coord_flip() +
labs(title = "Top 20 Words", x = "Word", y = "Frequency")
# Wordcloud
if (nrow(freq_df) > 50) {
suppressWarnings(wordcloud(words = freq_df$term, freq = freq_df$freq,
min.freq = 2, max.words = 100, random.order = FALSE))
}
This report demonstrates a memory-safe workflow for the Capstone
dataset.
It can be published directly to RPubs without Java/RWeka dependencies,
while retaining the same milestone structure.
Knit the document to HTML and publish to RPubs to obtain your shareable link.