This report corresponds to the milestone for the final project of the Data Science Specialization at Johns Hopkins University. My objective is to demonstrate that the text data has been successfully downloaded, loaded, and pre-processed, and to conduct an exploratory analysis of the corpus. In this report, I focus on three datasets—Twitter, News, and Blogs—and provide summary statistics, data cleaning steps, and visualizations that reveal the underlying structure of the data. Finally, I outline the plan for building a predictive text model and developing an interactive Shiny application.
First, I load the necessary packages for data handling and text processing.
library(tm)
library(stringr)
library(dplyr)
library(stringi)
library(knitr)
library(kableExtra)
library(ggplot2)
Next, I download and load the necessary text data in English provided for the course.
# Define the URL and local filename for the zip file
local_zip_file <- "Coursera-SwiftKey.zip"
url <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"
if (!file.exists(local_zip_file)) {
download.file(url, destfile = local_zip_file, mode = "wb")
unzip(local_zip_file)
}
# Load the text files
twitter <- readLines("final/en_US/en_US.twitter.txt", encoding = "UTF-8", skipNul = TRUE)
news <- readLines("final/en_US/en_US.news.txt", encoding = "UTF-8", skipNul = TRUE)
blogs <- readLines("final/en_US/en_US.blogs.txt", encoding = "UTF-8", skipNul = TRUE)
I generate summary statistics for each dataset, including file size, number of rows, total word count, and average words per line.
twitter_word_counts <- stri_count_words(twitter)
news_word_counts <- stri_count_words(news)
blogs_word_counts <- stri_count_words(blogs)
summary_table <- data.frame(
Dataset = c("Twitter", "News", "Blogs"),
File_Size_MB = c(
file.size("final/en_US/en_US.twitter.txt") / 1024 / 1024,
file.size("final/en_US/en_US.news.txt") / 1024 / 1024,
file.size("final/en_US/en_US.blogs.txt") / 1024 / 1024
),
Number_of_Rows = c(length(twitter), length(news), length(blogs)),
Total_Words = c(sum(twitter_word_counts), sum(news_word_counts), sum(blogs_word_counts)),
Mean_Words_Per_Line = c(mean(twitter_word_counts), mean(news_word_counts), mean(blogs_word_counts))
)
kable(summary_table, format = "html", caption = "Summary of Text Files") %>%
kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE, font_size = 12) %>%
column_spec(1, bold = TRUE, border_right = TRUE) %>%
column_spec(2:5, width = "3cm") %>%
row_spec(0, bold = TRUE, background = "#D3D3D3")
| Dataset | File_Size_MB | Number_of_Rows | Total_Words | Mean_Words_Per_Line |
|---|---|---|---|---|
| 159.3641 | 2360148 | 30093413 | 12.75065 | |
| News | 196.2775 | 77259 | 2674536 | 34.61779 |
| Blogs | 200.4242 | 899288 | 37546250 | 41.75109 |
In addition, I create a boxplot to examine the distribution of words per line in each dataset, based on a sample.
set.seed(123)
sample_twitter <- sample(twitter_word_counts, min(10000, length(twitter_word_counts)))
sample_news <- sample(news_word_counts, min(10000, length(news_word_counts)))
sample_blogs <- sample(blogs_word_counts, min(10000, length(blogs_word_counts)))
words_per_line_sample <- data.frame(
Dataset = rep(c("Twitter", "News", "Blogs"),
times = c(length(sample_twitter), length(sample_news), length(sample_blogs))),
Words = c(sample_twitter, sample_news, sample_blogs)
)
ggplot(words_per_line_sample, aes(x = Dataset, y = Words, fill = Dataset)) +
geom_boxplot(alpha = 0.8, outlier.color = "red", outlier.shape = 8) +
labs(title = "Distribution of Words per Line by Dataset", x = "Dataset", y = "Words per Line") +
theme_classic(base_size = 14) +
theme(legend.position = "none")
To prepare the data for the predictive text model, I first remove lines that are unlikely to be useful (i.e., those with fewer than 5 characters or only one word). Then, I take a random sample (1% of the data) for efficient processing and apply standard text cleaning techniques: converting text to lowercase, removing punctuation, numbers, and extra whitespace.
# Combine all text files into one vector
combined_text <- c(twitter, news, blogs)
# Remove lines with fewer than 5 characters or with only one word
cleaned_lines <- combined_text[nchar(combined_text) > 5]
cleaned_lines <- cleaned_lines[str_count(cleaned_lines, boundary("word")) > 1]
# Take a 1% random sample for processing (set seed for reproducibility)
set.seed(123)
sample_text <- cleaned_lines[sample(1:length(cleaned_lines), length(cleaned_lines) / 100)]
# Create a text corpus and apply cleaning transformations
corpus <- Corpus(VectorSource(sample_text))
corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, stripWhitespace)
# Create a Term-Document Matrix
tdm <- TermDocumentMatrix(corpus)
tdm_matrix <- as.matrix(tdm)
word_freqs <- sort(rowSums(tdm_matrix), decreasing = TRUE)
The bar plot below shows the top 20 most frequent words.
The following histogram displays the distribution of word frequencies in the sample corpus.
word_freq_df <- data.frame(Frequency = word_freqs)
ggplot(word_freq_df, aes(x = Frequency)) +
geom_histogram(bins = 50, fill = "#27AE60", color = "white", alpha = 0.9) +
labs(title = "Distribution of Word Frequencies", x = "Word Frequency", y = "Count") +
theme_classic(base_size = 14) +
theme(plot.title = element_text(face = "bold", hjust = 0.5))
This density plot provides insight into the distribution of words per line across datasets. The differences in density reflect variations in text structure and writing style.
ggplot(words_per_line_sample, aes(x = Words, fill = Dataset)) +
geom_density(alpha = 0.5) +
labs(title = "Density Plot of Words per Line", x = "Words per Line", y = "Density") +
theme_classic(base_size = 14) +
theme(plot.title = element_text(face = "bold", hjust = 0.5))
Based on the visualizations, I observe the following:
Based on the exploratory analysis, the next steps include:
This milestone report demonstrates the successful download, cleaning, and exploratory analysis of the text data. The summary statistics and enhanced visualizations—with improved aesthetics and insightful commentary—provide a solid foundation for developing the predictive text model and the interactive Shiny application. The next phase will focus on constructing, optimizing, and validating the model, with the ultimate goal of deploying a user-friendly tool for real-time text prediction.