Milestone Report – Data Science Capstone

1. Introduction

This Milestone Report is part of the Capstone Project for the Johns Hopkins Data Science Specialization. The final goal is to build a predictive text application similar to SwiftKey. In this report, I demonstrate that the data has been successfully loaded, cleaned, and explored, and I outline my plan for building the prediction model and Shiny app.

2. Data Summary

The data includes three English text files: blogs, news, and Twitter. Below is a summary of the number of lines and total word counts in each file.

library(stringr)
library(ggplot2)
library(tidytext)
library(dplyr)
library(tibble)
library(tidyr)

# Load data
blogs <- readLines("en_US.blogs.txt", warn = FALSE)
news <- readLines("en_US.news.txt", warn = FALSE)
twitter <- readLines("en_US.twitter.txt", warn = FALSE)

# Line and word counts
data_summary <- data.frame(
  Source = c("Blogs", "News", "Twitter"),
  Lines = c(length(blogs), length(news), length(twitter)),
  Words = c(
    sum(str_count(blogs, "\\S+")),
    sum(str_count(news, "\\S+")),
    sum(str_count(twitter, "\\S+"))
  )
)

data_summary

##    Source   Lines    Words
## 1   Blogs  899288 37334131
## 2    News   77259  2643969
## 3 Twitter 2360148 30373543

3. Word Count by Source

ggplot(data_summary, aes(x = Source, y = Words, fill = Source)) +
  geom_bar(stat = "identity") +
  labs(title = "Word Count by Source", x = "Source", y = "Word Count") +
  theme_minimal()

4. Most Frequent Words

We use a sample of 25,000 lines from each file, clean the text, remove stopwords, and visualize the top 20 most frequent words.

# Sample and clean text
sample_text <- c(blogs[1:25000], news[1:25000], twitter[1:25000])
sample_text <- tolower(sample_text)
sample_text <- gsub("[^a-z\\s]", "", sample_text)
sample_text <- gsub("\\s+", " ", sample_text)
sample_text <- trimws(sample_text)
sample_text <- sample_text[sample_text != ""]

# Tokenize and remove stopwords
data("stop_words")
tokens <- tibble(text = sample_text) %>%
  unnest_tokens(word, text) %>%
  anti_join(stop_words)

## Joining with `by = join_by(word)`

# Count top words
word_freq <- tokens %>%
  count(word, sort = TRUE) %>%
  filter(nchar(word) < 20)

# Plot
word_freq %>%
  slice_max(n, n = 20) %>%
  ggplot(aes(x = reorder(word, n), y = n)) +
  geom_col(fill = "steelblue") +
  coord_flip() +
  labs(title = "Top 20 Most Frequent Words (Cleaned)", x = "Word", y = "Frequency") +
  theme_minimal()

5. Most Frequent Bigrams

# Create and clean bigrams
bigrams <- tibble(text = sample_text) %>%
  unnest_tokens(bigram, text, token = "ngrams", n = 2) %>%
  filter(!is.na(bigram)) %>%
  separate(bigram, c("word1", "word2"), sep = " ") %>%
  filter(!word1 %in% stop_words$word,
         !word2 %in% stop_words$word) %>%
  unite(bigram, word1, word2, sep = " ") %>%
  count(bigram, sort = TRUE) %>%
  filter(nchar(bigram) < 25)

# Plot
bigrams %>%
  slice_max(n, n = 20) %>%
  ggplot(aes(x = reorder(bigram, n), y = n)) +
  geom_col(fill = "darkorange") +
  coord_flip() +
  labs(title = "Top 20 Bigrams (No Stopwords)", x = "Bigram", y = "Frequency") +
  theme_minimal()

6. Most Frequent Trigrams (Fixed)

To avoid empty plots, we use a larger sample and skip stopword filtering.

# Larger sample for trigrams
sample_text_trigrams <- c(blogs[1:40000], news[1:40000], twitter[1:40000])
sample_text_trigrams <- tolower(sample_text_trigrams)
sample_text_trigrams <- gsub("[^a-z\\s]", "", sample_text_trigrams)
sample_text_trigrams <- gsub("\\s+", " ", sample_text_trigrams)
sample_text_trigrams <- trimws(sample_text_trigrams)
sample_text_trigrams <- sample_text_trigrams[sample_text_trigrams != ""]

# Create and count trigrams (no stopword filtering)
trigrams <- tibble(text = sample_text_trigrams) %>%
  unnest_tokens(trigram, text, token = "ngrams", n = 3) %>%
  filter(!is.na(trigram)) %>%
  count(trigram, sort = TRUE) %>%
  filter(nchar(trigram) < 40)

# Plot
trigrams %>%
  slice_max(n, n = 20) %>%
  ggplot(aes(x = reorder(trigram, n), y = n)) +
  geom_col(fill = "purple") +
  coord_flip() +
  labs(title = "Top 20 Trigrams (Basic Clean)", x = "Trigram", y = "Frequency") +
  theme_minimal()

7. Plan for Prediction Algorithm and Shiny App

For the final deliverable, I plan to: - Build an n-gram model using bigrams, trigrams, and quadgrams - Use a backoff strategy: try longer n-grams first, back off to shorter ones - Apply smoothing (Laplace or similar) to handle unseen phrases - Minimize memory use by filtering low-frequency n-grams - Create a Shiny app that takes text input and predicts the next word

The Shiny app will be deployed on shinyapps.io and designed to run efficiently on limited resources.

8. Conclusion

The data has been loaded and explored.
Basic NLP tasks like tokenization and n-gram generation have been performed.
Insights were visualized with graphs.
A modeling strategy and deployment plan has been defined.

The next step is to build the predictive model and integrate it into a user-friendly Shiny app.