Introdocution

This report corresponds to the milestone for the final project of the Data Science Specialization at Johns Hopkins University. My objective is to demonstrate that the text data has been successfully downloaded, loaded, and pre-processed, and to conduct an exploratory analysis of the corpus. In this report, I focus on three datasets—Twitter, News, and Blogs—and provide summary statistics, data cleaning steps, and visualizations that reveal the underlying structure of the data. Finally, I outline the plan for building a predictive text model and developing an interactive Shiny application.

Loading the data

First, I load the necessary packages for data handling and text processing.

library(tm)
library(stringr)
library(dplyr)
library(stringi)
library(knitr)
library(kableExtra)
library(ggplot2)

Next, I download and load the necessary text data in English provided for the course.

# Define the URL and local filename for the zip file
local_zip_file <- "Coursera-SwiftKey.zip"
url <- "https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip"

if (!file.exists(local_zip_file)) {
  download.file(url, destfile = local_zip_file, mode = "wb")
  unzip(local_zip_file)
} 

# Load the text files
twitter <- readLines("final/en_US/en_US.twitter.txt", encoding = "UTF-8", skipNul = TRUE)
news    <- readLines("final/en_US/en_US.news.txt", encoding = "UTF-8", skipNul = TRUE)
blogs   <- readLines("final/en_US/en_US.blogs.txt", encoding = "UTF-8", skipNul = TRUE)

Data Summary

I generate summary statistics for each dataset, including file size, number of rows, total word count, and average words per line.

twitter_word_counts <- stri_count_words(twitter)
news_word_counts    <- stri_count_words(news)
blogs_word_counts   <- stri_count_words(blogs)

summary_table <- data.frame(
  Dataset             = c("Twitter", "News", "Blogs"),
  File_Size_MB        = c(
    file.size("final/en_US/en_US.twitter.txt") / 1024 / 1024,
    file.size("final/en_US/en_US.news.txt") / 1024 / 1024,
    file.size("final/en_US/en_US.blogs.txt") / 1024 / 1024
  ),
  Number_of_Rows      = c(length(twitter), length(news), length(blogs)),
  Total_Words         = c(sum(twitter_word_counts), sum(news_word_counts), sum(blogs_word_counts)),
  Mean_Words_Per_Line = c(mean(twitter_word_counts), mean(news_word_counts), mean(blogs_word_counts))
)

kable(summary_table, format = "html", caption = "Summary of Text Files") %>%
  kable_styling(bootstrap_options = c("striped", "hover"), full_width = FALSE, font_size = 12) %>%
  column_spec(1, bold = TRUE, border_right = TRUE) %>%
  column_spec(2:5, width = "3cm") %>%
  row_spec(0, bold = TRUE, background = "#D3D3D3")

Summary of Text Files
Dataset	File_Size_MB	Number_of_Rows	Total_Words	Mean_Words_Per_Line
Twitter	159.3641	2360148	30093413	12.75065
News	196.2775	77259	2674536	34.61779
Blogs	200.4242	899288	37546250	41.75109

In addition, I create a boxplot to examine the distribution of words per line in each dataset, based on a sample.

set.seed(123)
sample_twitter <- sample(twitter_word_counts, min(10000, length(twitter_word_counts)))
sample_news    <- sample(news_word_counts, min(10000, length(news_word_counts)))
sample_blogs   <- sample(blogs_word_counts, min(10000, length(blogs_word_counts)))

words_per_line_sample <- data.frame(
  Dataset = rep(c("Twitter", "News", "Blogs"),
                times = c(length(sample_twitter), length(sample_news), length(sample_blogs))),
  Words = c(sample_twitter, sample_news, sample_blogs)
)

ggplot(words_per_line_sample, aes(x = Dataset, y = Words, fill = Dataset)) +
  geom_boxplot(alpha = 0.8, outlier.color = "red", outlier.shape = 8) +
  labs(title = "Distribution of Words per Line by Dataset", x = "Dataset", y = "Words per Line") +
  theme_classic(base_size = 14) +
  theme(legend.position = "none")

Data Cleaning and Preprocessing

To prepare the data for the predictive text model, I first remove lines that are unlikely to be useful (i.e., those with fewer than 5 characters or only one word). Then, I take a random sample (1% of the data) for efficient processing and apply standard text cleaning techniques: converting text to lowercase, removing punctuation, numbers, and extra whitespace.

# Combine all text files into one vector
combined_text <- c(twitter, news, blogs)

# Remove lines with fewer than 5 characters or with only one word
cleaned_lines <- combined_text[nchar(combined_text) > 5]
cleaned_lines <- cleaned_lines[str_count(cleaned_lines, boundary("word")) > 1]

# Take a 1% random sample for processing (set seed for reproducibility)
set.seed(123)
sample_text <- cleaned_lines[sample(1:length(cleaned_lines), length(cleaned_lines) / 100)]

# Create a text corpus and apply cleaning transformations
corpus <- Corpus(VectorSource(sample_text))
corpus <- tm_map(corpus, content_transformer(tolower))
corpus <- tm_map(corpus, removePunctuation)
corpus <- tm_map(corpus, removeNumbers)
corpus <- tm_map(corpus, stripWhitespace)

# Create a Term-Document Matrix
tdm <- TermDocumentMatrix(corpus)
tdm_matrix <- as.matrix(tdm)
word_freqs <- sort(rowSums(tdm_matrix), decreasing = TRUE)

Visualization

Top 20 Most Frequent Words

The bar plot below shows the top 20 most frequent words.

Distribution of Word Frequencies

The following histogram displays the distribution of word frequencies in the sample corpus.

word_freq_df <- data.frame(Frequency = word_freqs)
ggplot(word_freq_df, aes(x = Frequency)) +
  geom_histogram(bins = 50, fill = "#27AE60", color = "white", alpha = 0.9) +
  labs(title = "Distribution of Word Frequencies", x = "Word Frequency", y = "Count") +
  theme_classic(base_size = 14) +
  theme(plot.title = element_text(face = "bold", hjust = 0.5))

Density of Words per Line

This density plot provides insight into the distribution of words per line across datasets. The differences in density reflect variations in text structure and writing style.

ggplot(words_per_line_sample, aes(x = Words, fill = Dataset)) +
  geom_density(alpha = 0.5) +
  labs(title = "Density Plot of Words per Line", x = "Words per Line", y = "Density") +
  theme_classic(base_size = 14) +
  theme(plot.title = element_text(face = "bold", hjust = 0.5))

Insights

Based on the visualizations, I observe the following:

The boxplot indicates that the number of words per line varies considerably among the datasets, suggesting differences in language style and content length.
The top words bar plot highlights common words that dominate the corpus, which will be important when building n-gram models.
The histogram of word frequencies shows a long-tail distribution, which is typical in text data and implies that many words appear infrequently.
The density plot reinforces the variability in text structure, guiding decisions on sampling and model parameterization.

Plans for the Predictive Model and Shiny Application

Based on the exploratory analysis, the next steps include:

Building N-gram Models: I will construct bigram and trigram models from the cleaned corpus to capture contextual word sequences.
Model Optimization: I plan to apply smoothing techniques and back-off strategies to address unseen n-grams and improve prediction accuracy.
Model Evaluation: The performance of the predictive text model will be assessed using a held-out test set to ensure reliability.
Shiny Application Development: I will develop an interactive Shiny application where a user inputs a phrase and receives a real-time prediction for the next word. The interface will be designed to be intuitive and accessible to non-technical stakeholders.
The insights obtained from the summary statistics and refined visualizations will guide the refinement of model parameters and feature selection for the final predictive algorithm.

Conclusions

This milestone report demonstrates the successful download, cleaning, and exploratory analysis of the text data. The summary statistics and enhanced visualizations—with improved aesthetics and insightful commentary—provide a solid foundation for developing the predictive text model and the interactive Shiny application. The next phase will focus on constructing, optimizing, and validating the model, with the ultimate goal of deploying a user-friendly tool for real-time text prediction.

Data Science Project Milestone Report

Gabriel Sotomayor

2025-03-01