Coursera Capstone Project Milestone Report

Data Science Specialization from Johns Hopkins University

Author

Daniel Morales

Published

June 24, 2025

Introduction

This is the Milestone Report for the Capstone Project from Coursera and Johns Hopkins University Data Science Specialization. The goal for the Capstone Project is to create a Shiny App with a textbox that, using given data and like the keyboards from smartphones, produces three options for what the next typed word might be.

The goal for this Milestone Report is to show that we are able to download, explore and start to model with the data. This data is available to download here and we will be using the files in English, listed below:

  • en_US.blogs.txt
  • en_US.news.txt
  • en_US.twitter.txt

We are assuming that the data is already downloaded, unzipped and available in the active R directory.

Setup

We start loading the R packages needed and the data.

Show/hide code
library(dplyr)
library(ggplot2)
library(patchwork)
library(knitr)
library(stringi)
library(stringr)
library(tidyr)
library(tidytext)
library(tm)

blogs <- readLines(
  "en_US.blogs.txt", encoding = "UTF-8", skipNul = TRUE, warn = FALSE)
news <- readLines(
  "en_US.news.txt", encoding = "UTF-8", skipNul = TRUE, warn = FALSE)
twitter <- readLines(
  "en_US.twitter.txt", encoding = "UTF-8", skipNul = TRUE, warn = FALSE)

Data Summary

Let us start by investigating the data we will be using. Basic information about the files and their character count is provided below.

Show/hide code
char_count <- list(nchar(twitter), nchar(blogs), nchar(news))

eda_files_chars <- data.frame(
  "File Name" = c("en_US.blogs.txt", 
                  "en_US.news.txt",
                  "en_US.twitter.txt"),
  "File Size" = paste(round(
    file.info(c("en_US.blogs.txt",
                "en_US.news.txt",
                "en_US.twitter.txt"))$size / 2^20,
    digits = 1
  ), "MB"),
  "Line Count" = sapply(list(blogs,
                             news,
                             twitter), length),
  "Character Count" = sapply(char_count, sum),
  "Min CPL" = sapply(char_count, min),
  "Mean CPL" = round(sapply(char_count, mean), 1),
  "Max CPL" = sapply(char_count, max),
  check.names = FALSE
)

kable(eda_files_chars, format.args = list(big.mark = ","))
File Name File Size Line Count Character Count Min CPL Mean CPL Max CPL
en_US.blogs.txt 200.4 MB 899,288 162,096,241 2 68.7 140
en_US.news.txt 196.3 MB 1,010,206 206,824,505 1 230.0 40,833
en_US.twitter.txt 159.4 MB 2,360,148 203,214,543 1 201.2 11,384

As expected, the maximum number of characters per line on the database from Twitter is limited to 140, given the time when it was extracted. Now let us see some statistics on word counts and words per line.

Show/hide code
words_per_line <- lapply(list(blogs, news, twitter), stri_count_words)

eda_files_words <- data.frame(
  "File Name" = c("en_US.blogs.txt",
                  "en_US.news.txt",
                  "en_US.twitter.txt"),
  "Word Count" = sapply(words_per_line, sum),
  "Min WPL" = sapply(words_per_line, min),
  "Mean WPL" = round(sapply(words_per_line, mean)),
  "Max WPL" = sapply(words_per_line, max),
  check.names = FALSE
)

kable(eda_files_words, format.args = list(big.mark = ","))
File Name Word Count Min WPL Mean WPL Max WPL
en_US.blogs.txt 37,546,806 0 42 6,726
en_US.news.txt 34,761,151 1 34 1,796
en_US.twitter.txt 30,096,690 1 13 47

Now visualizing the distribution of words per line in each database.

Show/hide code
p1 <- ggplot(data.frame(blogs_wpl = words_per_line[[1]]), aes(x = blogs_wpl)) +
  geom_histogram(binwidth = 40, color = "black", fill = "lightblue") + 
  labs(title = "US Blogs", x = "Words per Line", y = "Frequency") +
  theme_bw() +
  theme(
    panel.background = element_rect(fill = "transparent", color = NA),
    plot.background = element_rect(fill = "transparent", color = NA),
    legend.background = element_rect(fill = "transparent", color = NA),
    legend.box.background = element_rect(fill = "transparent", color = NA)
  )

p2 <- ggplot(data.frame(blogs_wpl = words_per_line[[2]]), aes(x = blogs_wpl)) +
  geom_histogram(binwidth = 20, color = "black", fill = "lightblue") + 
  labs(title = "US News", x = "Words per Line", y = "Frequency") +
  theme_bw() +
  theme(
    panel.background = element_rect(fill = "transparent", color = NA),
    plot.background = element_rect(fill = "transparent", color = NA),
    legend.background = element_rect(fill = "transparent", color = NA),
    legend.box.background = element_rect(fill = "transparent", color = NA)
  )

p3 <- ggplot(data.frame(blogs_wpl = words_per_line[[3]]), aes(x = blogs_wpl)) +
  geom_histogram(binwidth = 2, color = "black", fill = "lightblue") + 
  labs(title = "US Twitter", x = "Words per Line", y = "Frequency") +
  theme_bw() +
  theme(
    panel.background = element_rect(fill = "transparent", color = NA),
    plot.background = element_rect(fill = "transparent", color = NA),
    legend.background = element_rect(fill = "transparent", color = NA),
    legend.box.background = element_rect(fill = "transparent", color = NA)
  )

p1 / p2 / p3

Preparing the Data

To prepare the data for modeling, we begin by drawing a random sample of 5,000 lines from each of the three text sources—blogs, news articles, and Twitter posts. This sampling step helps reduce computational cost while retaining diversity across different writing styles.

Next, we create a text corpus using the tm package’s VCorpus function, which structures the sampled data for further text processing. We then apply a series of preprocessing steps to clean the text and make it suitable for natural language processing:

  • Lowercasing all text to ensure consistency (e.g., “The” and “the” are treated the same).

  • Removing punctuation and numbers, which usually do not contribute meaningful information for word prediction.

  • Removing stopwords, such as “the”, “is”, and “and”, which are extremely common but add little value to the predictive model.

  • Stripping excess whitespace introduced by earlier transformations.

  • Removing profanity, using a predefined list of offensive terms obtained from Carnegie Mellon University’s resource.

These cleaning steps help reduce noise and standardize the text, preparing it for tokenization and n-gram modeling in the next stages.

Show/hide code
set.seed(314159)
sample_size <- 10000
data_sample <- c(sample(blogs, sample_size), 
                 sample(news, sample_size), 
                 sample(twitter, sample_size))

corpus <- VCorpus(VectorSource(data_sample))
bad_words <- readLines("https://www.cs.cmu.edu/~biglou/resources/bad-words.txt")

corpus_treated <- corpus |> 
  tm_map(content_transformer(tolower)) |>
  tm_map(content_transformer(removePunctuation)) |>
  tm_map(removeNumbers) |>
  tm_map(removeWords, stopwords("en")) |>
  tm_map(removeWords, bad_words) |>
  tm_map(stripWhitespace)

To illustrate the effect of the cleaning process, the table below shows a few examples of text entries before and after preprocessing.

Show/hide code
n_show = 5
kable(data.frame(
  Original = sapply(corpus$content[1:n_show], function(x) x$content),
  Cleaned = sapply(corpus_treated$content[1:n_show], function(x) x$content)
))
Original Cleaned
I’m not sure if I’ll get this entire treatment since it’s just my back that needs work, but I hope it’s what I get. I kind of can’t imagine a ‘smaller’ version of this. ’m sure ’ll get entire treatment since ’s just back needs work hope ’s get kind can’t imagine ‘smaller’ version
Here’s what I finally did for the frosting… heres finally frosting
PS ~ no news on whether or not Anna is going to the tourney… looks like I’ll have to wait till Friday to see if I’m going to see Swimmer. Also, my friend correctly guessed his name on the first try. Lucky guess… ps news whether anna going tourney… looks like ’ll wait till friday see ’m going see swimmer also friend correctly guessed name first try lucky guess…
The upmarket grocery retailer has matched prices on 1,000 branded lines since September 2010 and is now expanding the offer to 7,000 products. upmarket grocery retailer matched prices branded lines since september now expanding offer products
Bible Doctrines I doctrines

Exploratory Data Analysis

To better understand the structure and most common word patterns in the dataset, we tokenize the cleaned text into unigrams (single words), bigrams (two-word combinations), and trigrams (three-word combinations). This process helps reveal the most frequent terms and phrases, which will be valuable for building the predictive model later.

To ensure clarity and avoid visual clutter, we remove any missing values (NA) that may have resulted from the tokenization process. We also filter out rare combinations and display only the top 20 most frequent entries in each category.

This analysis gives us insight into the common language patterns used across the different text sources, and will serve as a foundation for training our n-gram model for word prediction.

Show/hide code
# Convert the cleaned text corpus into a tidy data frame
text_df <- data.frame(text = sapply(corpus_treated, as.character), 
                      stringsAsFactors = FALSE)

# Unigram
unigrams <- text_df |>
  unnest_tokens(output = word, input = text) |>
  filter(!is.na(word)) |>
  count(word, sort = TRUE) |>
  top_n(20, n)

ggplot(unigrams, aes(x = reorder(word, n), y = n)) +
  geom_col(color = "black", fill = "lightblue") +
  coord_flip() +
  labs(title = "Top 20 Unigrams", x = NULL, y = "Frequency") +
  theme_bw()

Show/hide code
bigrams <- text_df |>
  unnest_tokens(bigram, text, token = "ngrams", n = 2) |>
  filter(!is.na(bigram)) |>
  count(bigram, sort = TRUE) |>
  top_n(20, n)

ggplot(bigrams, aes(x = reorder(bigram, n), y = n)) +
  geom_col(color = "black", fill = "lightblue") +
  coord_flip() +
  labs(title = "Top 20 Bigrams", x = NULL, y = "Frequency") +
  theme_bw()

Show/hide code
trigrams <- text_df |>
  unnest_tokens(trigram, text, token = "ngrams", n = 3) |>
  filter(!is.na(trigram)) |>
  count(trigram, sort = TRUE) |>
  top_n(20, n)

ggplot(trigrams, aes(x = reorder(trigram, n), y = n)) +
  geom_col(color = "black", fill = "lightblue") +
  coord_flip() +
  labs(title = "Top 20 Trigrams", x = NULL, y = "Frequency") +
  theme_bw()

Naive Predictor Prototype

As a preliminary step toward building the final word prediction model, a simple, rule-based predictor was implemented using n-gram frequency tables. This naive predictor uses the cleaned and tokenized corpus to estimate the most likely next word based on the user’s most recent one or two words.

The prediction logic follows a basic back-off strategy:

  • If the user input ends with two or more words, the model looks up the most frequent trigrams and returns the top three most common continuations.

  • If only one word is provided, it falls back to bigrams.

  • If no matching bigrams or trigrams are found, the model defaults to suggesting the most common unigrams overall.

This early prototype serves as a proof of concept for the final app and demonstrates that meaningful word predictions can be generated from the n-gram structure of the data, even using a simple frequency-based approach. While it lacks the sophistication of a smoothed probabilistic model, it provides a functional baseline to test the predictive pipeline.

Show/hide code
# Preprocessing n-grams
unigrams <- text_df |>
  unnest_tokens(word, text) |>
  count(word, sort = TRUE)

bigrams <- text_df |>
  unnest_tokens(bigram, text, token = "ngrams", n = 2) |>
  separate(bigram, into = c("w1", "w2"), sep = " ") |>
  count(w1, w2, sort = TRUE)

trigrams <- text_df |>
  unnest_tokens(trigram, text, token = "ngrams", n = 3) |>
  separate(trigram, into = c("w1", "w2", "w3"), sep = " ") |>
  count(w1, w2, w3, sort = TRUE)

predict_next_word <- function(input_text) {
  # Clean input
  input_text <- tolower(input_text)
  input_text <- str_replace_all(input_text, "[^a-z\\s]", "")
  input_words <- str_split(input_text, "\\s+")[[1]]
  input_words <- tail(input_words[input_words != ""], 2)
  
  if (length(input_words) == 2) {
    preds <- trigrams |>
      filter(w1 == input_words[1], w2 == input_words[2]) |>
      arrange(desc(n)) |>
      pull(w3)
  } else if (length(input_words) == 1) {
    preds <- bigrams |>
      filter(w1 == input_words[1]) |>
      arrange(desc(n)) |>
      pull(w2)
  } else {
    preds <- unigrams |>
      arrange(desc(n)) |>
      pull(word)
  }
  
  return(head(preds, 3))
}

To illustrate the behavior of the initial naive prediction model, we tested the function with a set of common input phrases. The table below shows each input along with the top three word predictions generated by the model.

Show/hide code
examples <- c(
  "I love",
  "New York",
  "happy",
  "thank you",
  "the weather is",
  "looking forward",
  ""
)
predictions <- lapply(examples, predict_next_word)

kable(data.frame(
  "Input" = examples,
  "Prediction 1" = sapply(predictions, function(x) x[1]),
  "Prediction 2" = sapply(predictions, function(x) x[2]),
  "Prediction 3" = sapply(predictions, function(x) x[3]),
  check.names = FALSE
))
Input Prediction 1 Prediction 2 Prediction 3
I love NA NA NA
New York city times giants
happy birthday hour easter
thank you NA NA NA
the weather is NA NA NA
looking forward seeing tweets visit
said will one

Note that despite being able to handle an empty input, one key limitation of the naive prediction model is its inability to handle unseen inputs (phrases or word combinations that do not appear in the training data). Since the model relies entirely on matching sequences in the n-gram frequency tables, it cannot generate meaningful predictions for novel or rare contexts. This highlights the need for a more robust approach in the final model, incorporating smoothing techniques or fallback strategies that can generalize beyond the observed data.

Next steps

  • Refine the prediction algorithm using probabilistic methods, such as Stupid Backoff or Kneser-Ney smoothing, to improve accuracy and handling of unseen inputs.

  • Optimize performance by using efficient data structures (e.g., data.table) and pruning low-frequency n-grams.

  • Build and deploy the final Shiny app, which will allow users to enter text and receive three suggestions for the next word in real-time.

  • Evaluate the model’s effectiveness and consider user feedback for additional improvements.

These steps will bring the project closer to its final goal: a functional and responsive word prediction app similar to those used in mobile text input systems.