This milestone report describes the exploratory analysis and data preparation undertaken for the capstone project of the Johns Hopkins Data Science Specialization. The ultimate goal of the project is to develop a text prediction model that improves text input efficiency in English by guessing the next complete word based on the previous two words. This report briefly summarizes my approach to creating the prediction algorithm in a way that is understandable to a non-data scientist audience.
The data used to build the predictive model consists of three files
of English language text provided in the en_US directory of
the capstone
dataset. They contain text samples from blogs, news stories, and
tweets, respectively:
en_US.blogs.txten_US.news.txten_US.twitter.txtAfter downloading the files, I use the wc command to
view the number of lines, words, and characters in each file. These
commands assumes the working directory has been set to the downloaded
en_US directory.
wc_output <- system("wc en_US*.txt", intern=TRUE)
read.table(text=wc_output, col.names = c("Lines", "Words", "Characters", "Filename"))
## Lines Words Characters Filename
## 1 899288 37334690 210160014 en_US.blogs.txt
## 2 1010242 34372720 205811889 en_US.news.txt
## 3 2360148 30374206 167105338 en_US.twitter.txt
## 4 4269678 102081616 583077241 total
The goal is to build a general text prediction model that works in
many settings. However, I have decided to prioritize the more formal
writing expected to appear in blogs and news stories over the often
abbreviated forms in twitter/X posts, so the twitter file will not be
used. To improve performance I create a data_subset that
contains every tenth row of the blogs and news files in the
text column, along with a file column to make
it easier to check and troubleshoot at later stages.
Initial text cleaning splits hyphenated words, removes
non-sentence-ending punctuation, removes extra spaces, and expands
contractions like “I’m” and “don’t” based on a named vector of
contractions and their expanded forms (contraction_vector).
I do not convert to lowercase because this will happen later as part of
tokenization.
clean_text <- function(text) {
text <- str_replace_all(text, "-", " ") # split hyphenated words
text <- str_replace_all(text, "’", "'") # make all apostrophes non-curly for contraction matching
text <- str_replace_all(text, "[^a-zA-Z0-9\\s\\.\\!\\?']", "") # remove most punctuation
text <- str_squish(text) # remove extra spaces
text <- str_replace_all(text, contraction_vector) # expand contractions
return(text)
}
data_subset <- data_subset %>%
mutate(id = row_number()) %>%
relocate(id) %>%
mutate(text = clean_text(text))
The result is a partially cleaned data frame with a 10% sample of two original text files (190,954 rows). For now, spelling errors and profanity are kept in the data: removing them at this early stage could change word sequence and result in some unlikely trigrams later.
The next step is to get to know the contents of
data_subset$text. The code below count words and bigrams,
then lists those that occur most often.
library(tidytext)
library(ggplot2)
library(tidyr)
word_counts <- data_subset %>%
unnest_tokens(word, text) %>%
count(word, sort = TRUE)
head(word_counts)
## word n
## 1 the 381962
## 2 and 198070
## 3 to 197452
## 4 a 178894
## 5 of 164822
## 6 in 128249
bigram_counts <- data_subset %>%
unnest_tokens(bigram, text, token = "ngrams", n = 2) %>%
count(bigram, sort = TRUE)
head(bigram_counts)
## bigram n
## 1 of the 37165
## 2 in the 33018
## 3 to the 16839
## 4 on the 14853
## 5 it is 14487
## 6 i am 13498
Words and bigrams that occur most often are what many projects would consider ‘stop words’ to ignore or remove. I will leave them, however, because they support the project purpose: selecting the desired word from a short list is faster than typing, even for very short words.
This graph shows how word frequency drops off considerably after the first few words.
top_words <- word_counts %>% slice_max(n, n = 50)
ggplot(top_words, aes(x = reorder(word, -n), y = n)) +
geom_bar(stat = "identity", fill = "steelblue") +
coord_flip() +
labs(title = "Word Frequencies", x = "Words", y = "Frequency") +
theme_minimal()
The graph illustrates that a relatively small number of words account
for a large percentage of the
data_subset text. The code
below adds a cumulative_coverage variable to
word_counts to show how many unique words are required to
cover 50% and 90% of the text.
total_words <- sum(word_counts$n)
word_counts <- word_counts %>%
mutate(
cumulative_coverage = cumsum(n) / total_words,
rank = row_number()
)
word_counts %>%
filter(cumulative_coverage >= 0.5 | cumulative_coverage >= 0.9) %>%
group_by(cumulative_coverage >= 0.5, cumulative_coverage >= 0.9) %>%
slice_head(n = 1) %>%
ungroup() %>%
select(c(word, cumulative_coverage, rank))
## # A tibble: 2 × 3
## word cumulative_coverage rank
## <chr> <dbl> <int>
## 1 1 0.500 137
## 2 anticipation 0.900 8022
There are 149,146 unique “words” in the data_subset
(nrow(word_counts)), but 90% of the text is covered by just
8,022 words (5.4% of the words). This means we can likely reduce the
dataset later to improve model efficiency with little impact on
predictive performance.
Trigrams, or strings of three consecutive words, will be the building block for the predictive model. It is important to split the text into sentences before creating trigrams so that the trigrams do not cross sentence boundaries, which could result in unlikely or grammatically incorrect trigrams.
library(tokenizers)
split_into_sentences <- function(text) {
tokenize_sentences(text) %>% unlist()
}
sentence_data <- data_subset %>%
rowwise() %>%
mutate(sentences = list(split_into_sentences(text))) %>%
unnest(cols = c(sentences)) %>%
group_by(id) %>% # Group by 'id' to reset sentence numbering for each text
mutate(sentence_id = row_number()) %>% # Add a unique sentence identifier per row
ungroup() # Ungroup after assigning sentence IDs
#Generate trigrams from `sentence_data$sentences` using the tidytext package, then group them
trigrams <- sentence_data %>%
unnest_tokens(trigram, sentences, token = "ngrams", n = 3)
trigram_counts <- trigrams %>%
count(trigram, sort=TRUE)
There are more than four million trigrams in the dataset at this point, many of which appear only a few times in the sample text. One way to reduce the dataset size and improve efficiency is to remove trigrams that appear less often. Limiting to trigrams that appear more than four times reduces the list to about 100K unique trigrams, a more manageable number for model building.
Numbers, spelling mistakes, and profanity were left in the data so that trigrams reflect the word order of the original text. At this point we can flag and remove that unwanted content. The code below splits trigrams into individual words, then identifies numerical content, profanity and spelling errors. Trigrams with any of these are removed completely.
library(hunspell)
# remove numbers
cleaned_trigrams <- trigram_counts_reduced %>%
mutate(numbers = str_detect(trigram, "[0-9]")) %>%
filter(numbers == FALSE) %>%
select(!numbers)
# remove spelling errors
cleaned_trigrams <- cleaned_trigrams %>%
separate_wider_delim(trigram, " ", names = c("w1","w2","w3")) %>%
mutate(spell_1 = !hunspell_check(w1)) %>%
mutate(spell_2 = !hunspell_check(w2)) %>%
mutate(spell_3 = !hunspell_check(w3)) %>%
filter(spell_1 == FALSE, spell_2 == FALSE, spell_3 == FALSE) %>%
select(w1, w2, w3, n)
# remove profanity
profanity_list <- scan("fb_bad_words_list.txt", what = "character", sep = ",")
cleaned_trigrams <- cleaned_trigrams %>%
mutate(prof_1 = w1 %in% profanity_list) %>%
mutate(prof_2 = w2 %in% profanity_list) %>%
mutate(prof_3 = w3 %in% profanity_list) %>%
filter(prof_1 == FALSE, prof_2 == FALSE, prof_3 == FALSE) %>%
select(w1, w2, w3, n)
The result is a list of lowercase trigrams that respect sentence structure and exclude mis-spelled words, profanity, and numerical content. This is the source data for model building. Before splitting the trigrams into training (70%), validation (15%), and testing (15%) sets, the counts need to be expanded.
library(caret)
expanded_trigrams <- cleaned_trigrams %>%
uncount(n)
set.seed(1017)
train_index <- createDataPartition(expanded_trigrams$w3, p = 0.7, list = FALSE)
train_data <- droplevels(expanded_trigrams[train_index, ])
temp_data <- expanded_trigrams[-train_index, ]
test_index <- createDataPartition(temp_data$w3, p = 0.5, list = FALSE)
test_data <- droplevels(temp_data[test_index, ])
val_data <- droplevels(temp_data[-test_index, ])
Model building and app creation are the next steps of the project. This report does not describe those processes in detail, but the overall plan is to test two approaches to building a model and select the option that represents the best trade-off between efficiency/size and accuracy.
caret package to build a machine
learning model (e.g. gradient boosting) that predicts the third word
based on the previous two words.It is likely that during model building I will need to refine the data preparation process, particularly if the models are too large or ineficcient to incorporate into a Shiny App. In implementation, the selected approach will be enhanced with smoothing to make sure the model does not break when it encounters text that is not in the training data.