Introdocution

This report corresponds to the milestone for the final project of the Data Science Specialization at Johns Hopkins University. My objective is to demonstrate that the text data has been successfully downloaded, loaded, and pre-processed, and to conduct an exploratory analysis of the corpus. In this report, I focus on three datasets—Twitter, News, and Blogs—and provide summary statistics, data cleaning steps, and visualizations that reveal the underlying structure of the data. Finally, I outline the plan for building a predictive text model and developing an interactive Shiny application.

Loading the data

First, I load the necessary packages for data handling and text processing.

library(tm)
library(stringr)
library(dplyr)
library(stringi)
library(knitr)
library(kableExtra)
library(ggplot2)

Next, I download and load the necessary text data in English provided for the course.

# Define the URL and local filename for the zip file
local_zip_file <- "Coursera-SwiftKey.zip"
url <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"

if (!file.exists(local_zip_file)) {
  download.file(url, destfile = local_zip_file, mode = "wb")
  unzip(local_zip_file)
} 

# Load the text files
twitter <- readLines("final/en_US/en_US.twitter.txt", encoding = "UTF-8", skipNul = TRUE)
news    <- readLines("final/en_US/en_US.news.txt", encoding = "UTF-8", skipNul = TRUE)
blogs   <- readLines("final/en_US/en_US.blogs.txt", encoding = "UTF-8", skipNul = TRUE)

Data Summary

I generate summary statistics for each dataset, including file size, number of rows, total word count, and average words per line.

twitter_word_counts <- stri_count_words(twitter)
news_word_counts    <- stri_count_words(news)
blogs_word_counts   <- stri_count_words(blogs)

summary_table <- data.frame(
  Dataset             = c("Twitter", "News", "Blogs"),
  File_Size_MB        = c(
    file.size("final/en_US/en_US.twitter.txt") / 1024 / 1024,
    file.size("final/en_US/en_US.news.txt") / 1024 / 1024,
    file.size("final/en_US/en_US.blogs.txt") / 1024 / 1024
  ),
  Number_of_Rows      = c(length(twitter), length(news), length(blogs)),
  Total_Words         = c(sum(twitter_word_counts), sum(news_word_counts), sum(blogs_word_counts)),
  Mean_Words_Per_Line = c(mean(twitter_word_counts), mean(news_word_counts), mean(blogs_word_counts))
)

kable(summary_table, format = "html", caption = "Summary of Text Files") %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE, font_size = 12) %>%
  column_spec(1, bold = TRUE, border_right = TRUE) %>%
  column_spec(2:5, width = "3cm") %>%
  row_spec(0, bold = TRUE, background = "#D3D3D3")
Summary of Text Files
Dataset File_Size_MB Number_of_Rows Total_Words Mean_Words_Per_Line
Twitter 159.3641 2360148 30093413 12.75065
News 196.2775 77259 2674536 34.61779
Blogs 200.4242 899288 37546250 41.75109

In addition, I create a boxplot to examine the distribution of words per line in each dataset, based on a sample.

set.seed(123)
sample_twitter <- sample(twitter_word_counts, min(10000, length(twitter_word_counts)))
sample_news    <- sample(news_word_counts, min(10000, length(news_word_counts)))
sample_blogs   <- sample(blogs_word_counts, min(10000, length(blogs_word_counts)))

words_per_line_sample <- data.frame(
  Dataset = rep(c("Twitter", "News", "Blogs"),
                times = c(length(sample_twitter), length(sample_news), length(sample_blogs))),
  Words = c(sample_twitter, sample_news, sample_blogs)
)

ggplot(words_per_line_sample, aes(x = Dataset, y = Words, fill = Dataset)) +
  geom_boxplot(alpha = 0.8, outlier.color = "red", outlier.shape = 8) +
  labs(title = "Distribution of Words per Line by Dataset", x = "Dataset", y = "Words per Line") +
  theme_classic(base_size = 14) +
  theme(legend.position = "none")

Data Cleaning and Preprocessing

To prepare the data for the predictive text model, I first remove lines that are unlikely to be useful (i.e., those with fewer than 5 characters or only one word). Then, I take a random sample (1% of the data) for efficient processing and apply standard text cleaning techniques: converting text to lowercase, removing punctuation, numbers, and extra whitespace.

# Combine all text files into one vector
combined_text <- c(twitter, news, blogs)

# Remove lines with fewer than 5 characters or with only one word
cleaned_lines <- combined_text[nchar(combined_text) > 5]
cleaned_lines <- cleaned_lines[str_count(cleaned_lines, boundary("word")) > 1]

# Take a 1% random sample for processing (set seed for reproducibility)
set.seed(123)
sample_text <- cleaned_lines[sample(1:length(cleaned_lines), length(cleaned_lines) / 100)]

# Create a text corpus and apply cleaning transformations
corpus <- Corpus(VectorSource(sample_text))
corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, stripWhitespace)

# Create a Term-Document Matrix
tdm <- TermDocumentMatrix(corpus)
tdm_matrix <- as.matrix(tdm)
word_freqs <- sort(rowSums(tdm_matrix), decreasing = TRUE)

Visualization

Top 20 Most Frequent Words

The bar plot below shows the top 20 most frequent words.

Distribution of Word Frequencies

The following histogram displays the distribution of word frequencies in the sample corpus.

word_freq_df <- data.frame(Frequency = word_freqs)
ggplot(word_freq_df, aes(x = Frequency)) +
  geom_histogram(bins = 50, fill = "#27AE60", color = "white", alpha = 0.9) +
  labs(title = "Distribution of Word Frequencies", x = "Word Frequency", y = "Count") +
  theme_classic(base_size = 14) +
  theme(plot.title = element_text(face = "bold", hjust = 0.5))

Density of Words per Line

This density plot provides insight into the distribution of words per line across datasets. The differences in density reflect variations in text structure and writing style.

ggplot(words_per_line_sample, aes(x = Words, fill = Dataset)) +
  geom_density(alpha = 0.5) +
  labs(title = "Density Plot of Words per Line", x = "Words per Line", y = "Density") +
  theme_classic(base_size = 14) +
  theme(plot.title = element_text(face = "bold", hjust = 0.5))

Insights

Based on the visualizations, I observe the following:

Plans for the Predictive Model and Shiny Application

Based on the exploratory analysis, the next steps include:

Conclusions

This milestone report demonstrates the successful download, cleaning, and exploratory analysis of the text data. The summary statistics and enhanced visualizations—with improved aesthetics and insightful commentary—provide a solid foundation for developing the predictive text model and the interactive Shiny application. The next phase will focus on constructing, optimizing, and validating the model, with the ultimate goal of deploying a user-friendly tool for real-time text prediction.