Data Science Capstone: Milestone Report

Author

Anthony DiFiore

Load and Glimpse Data

Our first step is to load in the three text corpora and then combine them into a single tibble for easier processing. We’ll also generate a summary table that shows the number of rows and average word count per source, to get a better sense of the breadth and depth of each file.

Code

twitter <- readLines(here("data/raw", "en_US.twitter.txt"), encoding = "UTF-8", skipNul = TRUE)
blogs <- readLines(here("data/raw", "en_US.blogs.txt"), encoding = "UTF-8", skipNul = TRUE)
news <- readLines(here("data/raw", "en_US.news.txt"), encoding = "UTF-8", skipNul = TRUE)
blogs <- tibble(text = blogs) %>%
      mutate(doc_id = row_number()) %>%
      relocate(doc_id)
news <- tibble(text = news) %>%
      mutate(doc_id = row_number()) %>%
      relocate(doc_id)
twitter <- tibble(text = twitter) %>%
      mutate(doc_id = row_number()) %>%
      relocate(doc_id)
dat <- bind_rows(twitter, blogs, news)

summary_table <- tibble(
  Source = c("Twitter", "Blogs", "News"),
  Rows = c(nrow(twitter), nrow(blogs), nrow(news)),
  Avg_Word_Count = c(mean(str_count(twitter$text, "\\s+") + 1),
                     mean(str_count(blogs$text, "\\s+") + 1),
                     mean(str_count(news$text, "\\s+") + 1))
) %>%
  formattable(list(
    Rows = color_tile("lightblue", "lightgreen"),
    Avg_Word_Count = color_tile("lightpink", "lightyellow")
  ))

summary_table

Source	Rows	Avg_Word_Count
Twitter	2360148	12.86936
Blogs	899288	41.51514
News	1010242	34.02406

Code

head(dat$text, 10)

 [1] "How are you? Btw thanks for the RT. You gonna be in DC anytime soon? Love to see you. Been way, way too long."  
 [2] "When you meet someone special... you'll know. Your heart will beat more rapidly and you'll smile for no reason."
 [3] "they've decided its more fun if I don't."                                                                       
 [4] "So Tired D; Played Lazer Tag & Ran A LOT D; Ughh Going To Sleep Like In 5 Minutes ;)"                           
 [5] "Words from a complete stranger! Made my birthday even better :)"                                                
 [6] "First Cubs game ever! Wrigley field is gorgeous. This is perfect. Go Cubs Go!"                                  
 [7] "i no! i get another day off from skool due to the wonderful snow (: and THIS wakes me up...damn thing"          
 [8] "I'm coo... Jus at work hella tired r u ever in cali"                                                            
 [9] "The new sundrop commercial ...hehe love at first sight"                                                         
[10] "we need to reconnect THIS WEEK"

Tokenize Words

In order to better understand the text data, we’ll first tokenize the text into indviduals tokens. I chose to split the text on whitespace in order to preserve punctuation, which can provide meaningful context later on when this text is used to train an autoregressive model in next word prediction.

Code

toks <- dat %>%
      unnest_tokens(word, text, token = stringr::str_split, pattern = " ")
toksCT_original <- toks %>%
      count(word, sort = TRUE)
summary(toksCT_original)

     word                 n          
 Length:2219410     Min.   :      1  
 Class :character   1st Qu.:      1  
 Mode  :character   Median :      1  
                    Mean   :     46  
                    3rd Qu.:      2  
                    Max.   :4704703

Remove Profanity and Domain-Specific Words

The token frequency count summary above reveals that there are over 2,000,000 unique tokens in the dataset. In order to have any hope of training a useful language model, we’ll need to significantly reduce the vocabulary size. First, we’ll trim the amount of tokens by targeting profanity and domain-specific words, as well as running some basic custom cleaning functions to perform tasks like removing non-ASCII characters and collapsing repeated punctuation.

Code

toks <- clean_word_column(toks)

domain_words <- filter_domain_words(toks)
toks <- toks %>%
      anti_join(domain_words, by = "word")

toks <- toks %>% mutate(word = replace(word, word == "rt", ""))
toks <- collapse_punctuation(toks)
toks <- clean_whitespace(toks)

toks <- toks %>%
  mutate(word = case_when(
    str_detect(word, "^[[:punct:]]+$") ~ NA_character_,
    TRUE ~ word
  ))

profanity <- read_csv(here("profanity_words.csv")) %>%
      distinct()

toks <- toks %>%
      mutate(word = case_when(
            word %in% profanity$word ~ NA_character_,
            TRUE ~ word
      ))

Token Cleaning Strategies

Next, we’ll use a set of fine-tuned functions meant to target specific cleaning tasks that were identified during inspection of the token frequency counts. These functions are designed to carefully clean and preserve tokens that might otherwise be accidentally removed by more general cleaning methods.

Code

toks <- toks %>%
  mutate(word = remove_edge_quotes(word)) %>%
  mutate(word = replace_qm_excl_with_period(word)) %>%
  mutate(word = replace_letter_hyphen_with_space(word)) %>%
  mutate(word = remove_nonfinal_periods(word)) %>%
  mutate(word = remove_punct_only(word)) %>%
  mutate(word = replace_colon_semicolon_with_comma(word)) %>%
  mutate(word = remove_nonessential_one_char(word))

toksCT <- toks %>%
      count(word, sort = TRUE)

two_char_words <- toksCT %>%
      filter(nchar(word) == 2)
glimpse(two_char_words)

tokens_2char_to_remove <- two_char_words %>%
  filter(str_detect(word, "[^A-Za-z0-9]")) %>%
  pull(word)

toks <- toks %>%
  mutate(
    word = if_else(word %in% tokens_2char_to_remove, "", word)
  )
toksCT <- toks %>%
      count(word, sort = TRUE)

Reduce Vocabulary

The token cleaning strategies implemented above have already made a significant dent in the vocabulary size, but we can go a step further by filtering out all infrequent tokens. For this analysis, we’ll replace any tokens that appear less than 25 times in the dataset with NA values. The NA values are important because when we reconstruct the sentences later on, we can easily filter out sentences that had too many words removed and therefore no longer make grammatical sense.

Code

count_empty <- toks %>%
      filter(word == "") %>%
      nrow()
print(paste("Number of empty strings in toks:", count_empty))

[1] "Number of empty strings in toks: 371080"

Code

toks <- toks %>%
      mutate(word = na_if(word, ""))
toksCT <- toks %>%
      count(word, sort = TRUE)
tokens_to_remove <- toksCT %>%
  filter(n < 25) %>%
  select(word)

toks <- toks %>%
  mutate(word = if_else(word %in% tokens_to_remove$word, "", word))
toks <- toks %>%
      mutate(word = na_if(word, ""))
toksCT <- toks %>%
      count(word, sort = TRUE)

p1 <- tibble(stage = c("Before Cleaning", "After Cleaning"),
            vocab_size = c(nrow(toksCT_original), nrow(toksCT))) %>%
  ggplot(aes(x = stage, y = vocab_size, fill = stage)) +
  geom_col() +
  geom_text(aes(label = scales::comma(vocab_size)), vjust = -0.5) +
  scale_fill_manual(values = c("#E76F51", "#2A9D8F")) +
  labs(title = "Vocabulary Reduction", y = "Unique Words") +
  theme_minimal()
p1

Reconstruct Sentences and Remove Duplicates

As you’ll see above, the token cleaning steps have significantly reduced the size of the vocabulary from over 2 million unique tokens to just under 100,000, which is feasible for Sentencepiece, which I plan to use for subword tokenization later on. First, we’ll reconstruct the original sentences from the cleaned tokens, and then remove any duplicates in the dataset.

Code

sen <- toks %>%
      group_by(doc_id) %>%
      summarize(text = paste(word, collapse = " ")) %>%
      ungroup()

duplicates <- sen %>%
      get_dupes(text)
head(duplicates)

Code

summary(duplicates)

     text             dupe_count          doc_id       
 Length:68657       Min.   :   2.00   Min.   : 935035  
 Class :character   1st Qu.:   2.00   1st Qu.:1346102  
 Mode  :character   Median :   5.00   Median :1685057  
                    Mean   :  67.08   Mean   :1684434  
                    3rd Qu.:  23.00   3rd Qu.:2021271  
                    Max.   :1184.00   Max.   :2360147

Code

sen <- sen %>%
      anti_join(duplicates, by = "text")

sen <- sen %>%
  mutate(text = sub("(?<!\\.)$", ".", text, perl = TRUE))

Filter out Sentences with High NA Counts

After removing over 68,000 duplicate rows from the dataset, we’re ready to tackle the NA values in the reconstructed sentences. As mentioend earlier, these are important because they tell us which sentences have had words removed during preprocessing. Sentences with too many NA values are likely to be grammatically incorrect and will hinder any chance of a language model making useful next word predictions, so we’ll filter out sentences that have more than 5 NA values.

Code

count_NA <- "NA"
sen <- sen %>%
      mutate(na_count = str_count(text, count_NA))
summary(sen$na_count)

   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  0.000   0.000   1.000   1.691   2.000 770.000

Code

no_NA <- sen %>%
      filter(na_count == 0) %>%
      select(-na_count)
low_NA <- sen %>%
      filter(na_count > 0 & na_count < 6) %>%
      select(-na_count)

sen <- bind_rows(no_NA, low_NA) %>%
      select(text)

Split Data into One Row per Sentence

Since I plan on using a Seq2Seq model to train next word prediction, it’s important that each row in the dataset corresponds to a single sentence. Therefore, we’ll split the text into individual sentences based on periods, and then analyze the distribution of sentence lengths in terms of word count. We’ll use a ridge plot to help visualize the new distribution of sentence lengths.

Code

sen <- sen %>%
  separate_rows(text, sep = "\\.") %>%
  mutate(text = trimws(text)) %>%
  filter(text != "") %>%
  mutate(doc_id = row_number()) %>%
  relocate(doc_id)

sen$word_count <- str_count(sen$text, "\\s+") + 1

sen %>%
  ggplot(aes(x = word_count, y = "Density", fill = after_stat(x))) +
  geom_density_ridges_gradient(scale = 3, rel_min_height = 0.01) +
  scale_fill_viridis_c(option = "plasma") +
  xlim(0, 50) +
  labs(title = "Distribution of Sentence Lengths",
       x = "Words per Sentence", y = "") +
  theme_minimal()

Code

sentences <- sen %>%
      filter(word_count > 5 & word_count <= 50) %>%
      select(text) %>%
      mutate(doc_id = row_number()) %>%
      relocate(doc_id)

sentences <- sentences %>%
  mutate(text = sub("(?<!\\.)$", ".", text, perl = TRUE))
head(sentences$text, 15)

 [1] "you gonna be in dc anytime soon."                                                                                                                                                                           
 [2] "in the years thereafter, most of the oil fields and platforms were named after pagan gods."                                                                                                                 
 [3] "your heart will beat more rapidly and you'll smile for no reason."                                                                                                                                          
 [4] "it would die of old age."                                                                                                                                                                                   
 [5] "workers had been making cars there since the onset of mass automotive production in the 1920s."                                                                                                             
 [6] "looking for a new band to blog for the month."                                                                                                                                                              
 [7] "the one thing that was astounding though was the support from the marshals they were all phenomenal for being out in the rain for so long and remaining so cheery and supportive."                          
 [8] "owing to the nature of the course and very few closed roads meant that supporters who knew the area were able to skip around the course and a few people were seen about 5 or 6 times which was also super."
 [9] "let your hair down, it looks better."                                                                                                                                                                       
[10] "just seems they thought up that idea over sunday brunch and thought it was swell."                                                                                                                          
[11] "well, i'm off to clean up my craft space since it's a mess after working on my projects for this fun hop."                                                                                                  
[12] "have a wonderful week everyone and i hope to have a project or two to share very soon."                                                                                                                     
[13] "after the blue wall was painted i decided to brighten up the dining room table and chairs."                                                                                                                 
[14] "three of the five chairs were painted the same cream color as the interior trim, and a bold green replaced the mustard yellow on the table base."                                                           
[15] "thus began my frantic sprint down the corridor, pulling my reluctant suitcase as it bobbed off of anything in its path."

Next Steps

After careful cleaning, the dataset is now ready for subword tokenization using Sentencepiece, which is designed to handle large vocabularies. My next steps will be to train a Sentencepiece tokenizer on the cleaned sentences, and then use the tokenized data to train a Seq2Seq (or “Sequence to Sequence”) encoder-decoder attention model for text generation, with the purpose of predicting the next word in a sequence based on the preceding words.