This milestone report describes the exploratory analysis and data preparation undertaken for the capstone project of the Johns Hopkins Data Science Specialization. The ultimate goal of the project is to develop a text prediction model that improves text input efficiency in English by guessing the next complete word based on the previous two words. This report briefly summarizes my approach to creating the prediction algorithm in a way that is understandable to a non-data scientist audience.

1. Load the data

The data used to build the predictive model consists of three files of English language text provided in the en_US directory of the capstone dataset. They contain text samples from blogs, news stories, and tweets, respectively:

After downloading the files, I use the wc command to view the number of lines, words, and characters in each file. These commands assumes the working directory has been set to the downloaded en_US directory.

wc_output <- system("wc en_US*.txt", intern=TRUE)
read.table(text=wc_output, col.names = c("Lines", "Words", "Characters", "Filename"))
##     Lines     Words Characters          Filename
## 1  899288  37334690  210160014   en_US.blogs.txt
## 2 1010242  34372720  205811889    en_US.news.txt
## 3 2360148  30374206  167105338 en_US.twitter.txt
## 4 4269678 102081616  583077241             total

The goal is to build a general text prediction model that works in many settings. However, I have decided to prioritize the more formal writing expected to appear in blogs and news stories over the often abbreviated forms in twitter/X posts, so the twitter file will not be used. To improve performance I create a data_subset that contains every tenth row of the blogs and news files in the text column, along with a file column to make it easier to check and troubleshoot at later stages.

Initial text cleaning splits hyphenated words, removes non-sentence-ending punctuation, removes extra spaces, and expands contractions like “I’m” and “don’t” based on a named vector of contractions and their expanded forms (contraction_vector). I do not convert to lowercase because this will happen later as part of tokenization.

clean_text <- function(text) {
  text <- str_replace_all(text, "-", " ") # split hyphenated words 
  text <- str_replace_all(text, "’", "'") # make all apostrophes non-curly for contraction matching
  text <- str_replace_all(text, "[^a-zA-Z0-9\\s\\.\\!\\?']", "") # remove most punctuation
  text <- str_squish(text)                # remove extra spaces
  text <- str_replace_all(text, contraction_vector) # expand contractions
  return(text)
}

data_subset <- data_subset %>%
        mutate(id = row_number()) %>%
        relocate(id) %>%
        mutate(text = clean_text(text))

The result is a partially cleaned data frame with a 10% sample of two original text files (190,954 rows). For now, spelling errors and profanity are kept in the data: removing them at this early stage could change word sequence and result in some unlikely trigrams later.

2. Exploratory analysis

The next step is to get to know the contents of data_subset$text. The code below count words and bigrams, then lists those that occur most often.

library(tidytext)
library(ggplot2)
library(tidyr)

word_counts <- data_subset %>%
  unnest_tokens(word, text) %>%
  count(word, sort = TRUE)

head(word_counts)
##   word      n
## 1  the 381962
## 2  and 198070
## 3   to 197452
## 4    a 178894
## 5   of 164822
## 6   in 128249
bigram_counts <- data_subset %>%
  unnest_tokens(bigram, text, token = "ngrams", n = 2) %>%
  count(bigram, sort = TRUE)

head(bigram_counts)
##   bigram     n
## 1 of the 37165
## 2 in the 33018
## 3 to the 16839
## 4 on the 14853
## 5  it is 14487
## 6   i am 13498

Words and bigrams that occur most often are what many projects would consider ‘stop words’ to ignore or remove. I will leave them, however, because they support the project purpose: selecting the desired word from a short list is faster than typing, even for very short words.

This graph shows how word frequency drops off considerably after the first few words.

top_words <- word_counts %>% slice_max(n, n = 50)

ggplot(top_words, aes(x = reorder(word, -n), y = n)) +
  geom_bar(stat = "identity", fill = "steelblue") +
  coord_flip() +
  labs(title = "Word Frequencies", x = "Words", y = "Frequency") +
  theme_minimal()

The graph illustrates that a relatively small number of words account for a large percentage of the data_subset text. The code below adds a cumulative_coverage variable to word_counts to show how many unique words are required to cover 50% and 90% of the text.

total_words <- sum(word_counts$n)

word_counts <- word_counts %>%
  mutate(
    cumulative_coverage = cumsum(n) / total_words,
    rank = row_number()
  )

word_counts %>%
  filter(cumulative_coverage >= 0.5 | cumulative_coverage >= 0.9) %>%
  group_by(cumulative_coverage >= 0.5, cumulative_coverage >= 0.9) %>%
  slice_head(n = 1) %>%
  ungroup() %>%
  select(c(word, cumulative_coverage, rank))
## # A tibble: 2 × 3
##   word         cumulative_coverage  rank
##   <chr>                      <dbl> <int>
## 1 1                          0.500   137
## 2 anticipation               0.900  8022

There are 149,146 unique “words” in the data_subset (nrow(word_counts)), but 90% of the text is covered by just 8,022 words (5.4% of the words). This means we can likely reduce the dataset later to improve model efficiency with little impact on predictive performance.

3. Create trigrams

Trigrams, or strings of three consecutive words, will be the building block for the predictive model. It is important to split the text into sentences before creating trigrams so that the trigrams do not cross sentence boundaries, which could result in unlikely or grammatically incorrect trigrams.

library(tokenizers)

split_into_sentences <- function(text) {
  tokenize_sentences(text) %>% unlist()
}
        
sentence_data <- data_subset %>%
  rowwise() %>%
  mutate(sentences = list(split_into_sentences(text))) %>%
  unnest(cols = c(sentences)) %>%
  group_by(id) %>% # Group by 'id' to reset sentence numbering for each text
  mutate(sentence_id = row_number()) %>% # Add a unique sentence identifier per row
  ungroup() # Ungroup after assigning sentence IDs

#Generate trigrams from `sentence_data$sentences` using the tidytext package, then group them
trigrams <- sentence_data %>%
  unnest_tokens(trigram, sentences, token = "ngrams", n = 3)

trigram_counts <- trigrams %>%
        count(trigram, sort=TRUE)

There are more than four million trigrams in the dataset at this point, many of which appear only a few times in the sample text. One way to reduce the dataset size and improve efficiency is to remove trigrams that appear less often. Limiting to trigrams that appear more than four times reduces the list to about 100K unique trigrams, a more manageable number for model building.

Numbers, spelling mistakes, and profanity were left in the data so that trigrams reflect the word order of the original text. At this point we can flag and remove that unwanted content. The code below splits trigrams into individual words, then identifies numerical content, profanity and spelling errors. Trigrams with any of these are removed completely.

library(hunspell)

# remove numbers
cleaned_trigrams <- trigram_counts_reduced %>%
        mutate(numbers = str_detect(trigram, "[0-9]")) %>%
        filter(numbers == FALSE) %>%
        select(!numbers)

# remove spelling errors
cleaned_trigrams <- cleaned_trigrams %>%
        separate_wider_delim(trigram, " ", names = c("w1","w2","w3")) %>%
        mutate(spell_1 = !hunspell_check(w1)) %>%
        mutate(spell_2 = !hunspell_check(w2)) %>%
        mutate(spell_3 = !hunspell_check(w3)) %>%
        filter(spell_1 == FALSE, spell_2 == FALSE, spell_3 == FALSE) %>%
        select(w1, w2, w3, n)

# remove profanity
profanity_list <- scan("fb_bad_words_list.txt", what = "character", sep = ",")
cleaned_trigrams <- cleaned_trigrams %>%
        mutate(prof_1 = w1 %in% profanity_list) %>%
        mutate(prof_2 = w2 %in% profanity_list) %>%
        mutate(prof_3 = w3 %in% profanity_list) %>%
        filter(prof_1 == FALSE, prof_2 == FALSE, prof_3 == FALSE) %>%
        select(w1, w2, w3, n)

The result is a list of lowercase trigrams that respect sentence structure and exclude mis-spelled words, profanity, and numerical content. This is the source data for model building. Before splitting the trigrams into training (70%), validation (15%), and testing (15%) sets, the counts need to be expanded.

library(caret)

expanded_trigrams <- cleaned_trigrams %>%
        uncount(n) 

set.seed(1017)
train_index <- createDataPartition(expanded_trigrams$w3, p = 0.7, list = FALSE)
train_data <- droplevels(expanded_trigrams[train_index, ])
temp_data <- expanded_trigrams[-train_index, ]

test_index <- createDataPartition(temp_data$w3, p = 0.5, list = FALSE)
test_data <- droplevels(temp_data[test_index, ])
val_data <- droplevels(temp_data[-test_index, ])

4. Model building

Model building and app creation are the next steps of the project. This report does not describe those processes in detail, but the overall plan is to test two approaches to building a model and select the option that represents the best trade-off between efficiency/size and accuracy.

It is likely that during model building I will need to refine the data preparation process, particularly if the models are too large or ineficcient to incorporate into a Shiny App. In implementation, the selected approach will be enhanced with smoothing to make sure the model does not break when it encounters text that is not in the training data.