Data Science Capstone

Abstract

In this project, we build a word prediction algorithm and an online app for the Coursera Data Science Capstone course. We’ll use an n-gram model to estimate the probability of the last word of a n-gram given the previous words, and to assign probabilities to entire sequences. The project is fully completed in R, using its NLP infrastructure.

Data

The project utilizes the Swiftkey corpus, which includes different types of texts (blogs, news, twitter) in several languages (English, German, Russian, and Finnish). For our prediction App, we only used the English corpus. The dataset was collected from publicly available sources by a web crawler.

Method

The intuition of an n-gram model is that instead of computing the probability of a word given its entire history, we approximate the history by just the last few words (Jurafsky and Martin 2021). The assumption that the probability of a word depends only on the previous word is called a Markov assumption.

Load and Prepare Data

We read in the data in a “tidy” format using the vroom library. At this stage, we want to keep the information about the text type (blog, news, twitter) which we copy from the file name.

## read in blogs, twitter, and news data
library(vroom)
library(stringr)
library(dplyr)
list_of_files <- list.files(path = "./final/en_US/", recursive = TRUE, pattern = "\\.txt$", full.names = TRUE)
docs.df <- vroom(list_of_files, id = "FileName", delim = "\n", col_names = F)
##  rename and mutate
docs.df$FileName <- str_remove_all(docs.df$FileName, "./final/en_US//en_US.")
docs.df$FileName <- str_remove_all(docs.df$FileName, ".txt")
docs.tidy <- docs.df %>% rename(text = X1, type = FileName) %>% mutate(type = factor(type))

Summary and Line Counts

This summary gives a general idea about our data set; it comprises more than 3 mln lines, most of them coming from twitter (where, however, these lines contain fewer words).

docs.tidy

## Warning: One or more parsing issues, see `problems()` for details

## # A tibble: 3,322,964 × 2
##    type  text                                                                   
##    <fct> <chr>                                                                  
##  1 blogs In the years thereafter, most of the Oil fields and platforms were nam…
##  2 blogs We love you Mr. Brown.                                                 
##  3 blogs Chad has been awesome with the kids and holding down the fort while I …
##  4 blogs so anyways, i am going to share some home decor inspiration that i hav…
##  5 blogs With graduation season right around the corner, Nancy has whipped up a…
##  6 blogs If you have an alternative argument, let's hear it! :)                 
##  7 blogs If I were a bear,                                                      
##  8 blogs Other friends have similar stories, of how they were treated brusquely…
##  9 blogs Although our beloved Cantab can’t claim the international recognition …
## 10 blogs Peter Schiff: Hard to tell. It will look pretty bad for most Americans…
## # … with 3,322,954 more rows

summary(docs.tidy)

##       type             text          
##  blogs  : 681062   Length:3322964    
##  news   : 628797   Class :character  
##  twitter:2013105   Mode  :character

Sampling

As our training set, we shall only use a part (1/10) of the dataset, from each type (blogs, news, twits) proportionally.

set.seed(1234)
training <- docs.tidy %>% group_by(type) %>% sample_frac(0.1) %>% ungroup()
training

## # A tibble: 332,296 × 2
##    type  text                                                                   
##    <fct> <chr>                                                                  
##  1 blogs "Fast Triggering Speed Up to 12 frames per second."                    
##  2 blogs "I should have advertised this recipe on TV."                          
##  3 blogs "— My foot is about one inch long (about 2½ cm). I probably have the b…
##  4 blogs "You listening to Gotye again?"                                        
##  5 blogs "One of the things we tend to shy away from when we \"eat healthier\" …
##  6 blogs "And like every great writer, having dedicated so much of her life to …
##  7 blogs "it reminded me of many of alexandria's beautiful and thoughtful photo…
##  8 blogs "Thank you very much for visiting this blog. It has been close to a ye…
##  9 blogs "While I enjoyed the mudbugs…I must say that I feel like I got tricked…
## 10 blogs "Mix everything together and put into a large greased baking tin. Spri…
## # … with 332,286 more rows

Tokenizing

The training set was tokenized into bigrams and trigrams using the tidytext library. The benefit of using n-grams compared to words is that n-grams capture word order that would otherwise be lost (Hvitfeldt and Silge 2021).

library(tidytext)
bigram <- training %>% unnest_tokens(bigram, text, token = "ngrams", n=2)
trigram <- training %>% unnest_tokens(trigram, text, token = "ngrams", n=3)
quadrugram <- training %>%  unnest_tokens(quadrugram, text, token = "ngrams", n=4)

Separating N-grams

To filter profanity, stop-words and non-word characters, we first need to separate n-grams, then apply filter to each word in the n-gram, as described in (Silge and Robinson 2017). This is a computationally demanding procedure, and we save the output to use it later.

## separate 2-grams, 3-grams, and 4-grams
library(tidyr)
bigram.sep <- bigram %>% separate(bigram, c("word1", "word2"), sep = " ") 
save(bigram.sep, file = "./data/BigramSep.Rdata")

trigram.sep <- trigram %>% separate(trigram, c("word1", "word2", "word3"), sep = " ")
save(trigram.sep, file = "./data/TrigramSep.Rdata")

quadrugram.sep <- quadrugram %>% separate(quadrugram, c("word1", "word2", "word3", "word4"), sep = " ")
save(quadrugram.sep, file = "./data/QuadrugramSep.Rdata")

Filtering N-grams

The code which covers profanity filters is hidden.

## load stop-words, profanity and non-word chararcters
data("stop_words")
non.word <- grepl("[^a-zA-Z]", bigram.sep$word1)
non.word <- bigram.sep$word1[non.word]

## remove stop-words, profanity and non-word characters
bigram.sep.clean <- bigram.sep %>% filter(!word1 %in% stop_words$word, !word2 %in% stop_words$word) %>% filter(!word1 %in% profanity, !word2 %in% profanity) %>% filter(!word1 %in% non.word, !word2 %in% non.word)

trigram.sep.clean <- trigram.sep %>% filter(!word1 %in% stop_words$word, !word2 %in% stop_words$word, !word3 %in% stop_words$word) %>% filter(!word1 %in% profanity, !word2 %in% profanity, !word3 %in% profanity) %>% filter(!word1 %in% non.word, !word2 %in% non.word, !word3 %in% non.word)

quadrugram.sep.clean <- quadrugram.sep %>% filter(!word1 %in% stop_words$word, !word2 %in% stop_words$word, !word3 %in% stop_words$word, !word4 %in% stop_words$word) %>% filter(!word1 %in% profanity, !word2 %in% profanity, !word3 %in% profanity, !word4 %in% profanity) %>% filter(!word1 %in% non.word, !word2 %in% non.word, !word3 %in% non.word, !word4 %in% non.word)

N-gram Count

A frequency table of unique and repeated n-grams (n from 2 to 4) is created.

bigram.count <- bigram.sep.clean %>% count(word1, word2, sort = TRUE)
trigram.count <- trigram.sep.clean %>% count(word1, word2, word3, sort = TRUE)
quadrugram.count <- quadrugram.sep.clean %>% count(word1, word2, word3, word4, sort = TRUE)

N-grams Prediction

based on the available input, will be fed to the algorithm to find top 3 most frequent matches of n-grams that are 1 word larger than the input in the frequency table. An example follows.

predicted <- quadrugram.count %>% filter(word1 == "extra", word2 == "virgin", word3 == "olive") %>% slice_head(n =3) %>% select(word4)
predicted

## # A tibble: 3 × 1
##   word4  
##   <chr>  
## 1 oil    
## 2 divided
## 3 oils

When no matches are found, the last word of the input will be removed and the resulting shorter n-gram will be fed to the algorithm. The Process can be repeated till we are left with a single word to be fed to the algorithm.

predicted <- quadrugram.count %>% filter(word1 == "chinese", word2 == "food", word3 == "restaurant") %>% slice_head(n =3) %>% select(word4)
predicted

## # A tibble: 0 × 1
## # … with 1 variable: word4 <chr>

if(nrow(predicted) == 0) {
  predicted <- trigram.count %>% filter(word1 == "chinese", word2 == "food") %>% slice_head(n =3) %>% select(word3)
}
predicted

## # A tibble: 3 × 1
##   word3   
##   <chr>   
## 1 birthday
## 2 blurs   
## 3 current

if(nrow(predicted) == 0) {
  predicted <- bigram.count %>% filter(word1 == "chinese") %>% slice_head(n =3) %>% select(word2)
}
predicted

## # A tibble: 3 × 1
##   word3   
##   <chr>   
## 1 birthday
## 2 blurs   
## 3 current

References

Hvitfeldt, Emil, and Julia Silge. 2021. Supervised Machine Learning for Text Analysis in r. Chapman; Hall/CRC.

Jurafsky, Daniel, and James H. Martin. 2021. “N-Gram Language Models.” In. https://web.stanford.edu/~jurafsky/slp3/3.pdf.

Silge, Julia, and David Robinson. 2017. Text Mining with r: A Tidy Approach. " O’Reilly Media, Inc.".