In this project, we build a word prediction algorithm and an online app for the Coursera Data Science Capstone course. We’ll use an n-gram model to estimate the probability of the last word of a n-gram given the previous words, and to assign probabilities to entire sequences. The project is fully completed in R, using its NLP infrastructure.
The project utilizes the Swiftkey corpus, which includes different types of texts (blogs, news, twitter) in several languages (English, German, Russian, and Finnish). For our prediction App, we only used the English corpus. The dataset was collected from publicly available sources by a web crawler.
The intuition of an n-gram model is that instead of computing the probability of a word given its entire history, we approximate the history by just the last few words (Jurafsky and Martin 2021). The assumption that the probability of a word depends only on the previous word is called a Markov assumption.
We read in the data in a “tidy” format using the vroom library. At this stage, we want to keep the information about the text type (blog, news, twitter) which we copy from the file name.
## read in blogs, twitter, and news data
library(vroom)
library(stringr)
library(dplyr)
list_of_files <- list.files(path = "./final/en_US/", recursive = TRUE, pattern = "\\.txt$", full.names = TRUE)
docs.df <- vroom(list_of_files, id = "FileName", delim = "\n", col_names = F)
## rename and mutate
docs.df$FileName <- str_remove_all(docs.df$FileName, "./final/en_US//en_US.")
docs.df$FileName <- str_remove_all(docs.df$FileName, ".txt")
docs.tidy <- docs.df %>% rename(text = X1, type = FileName) %>% mutate(type = factor(type))
This summary gives a general idea about our data set; it comprises more than 3 mln lines, most of them coming from twitter (where, however, these lines contain fewer words).
docs.tidy
## Warning: One or more parsing issues, see `problems()` for details
## # A tibble: 3,322,964 × 2
## type text
## <fct> <chr>
## 1 blogs In the years thereafter, most of the Oil fields and platforms were nam…
## 2 blogs We love you Mr. Brown.
## 3 blogs Chad has been awesome with the kids and holding down the fort while I …
## 4 blogs so anyways, i am going to share some home decor inspiration that i hav…
## 5 blogs With graduation season right around the corner, Nancy has whipped up a…
## 6 blogs If you have an alternative argument, let's hear it! :)
## 7 blogs If I were a bear,
## 8 blogs Other friends have similar stories, of how they were treated brusquely…
## 9 blogs Although our beloved Cantab can’t claim the international recognition …
## 10 blogs Peter Schiff: Hard to tell. It will look pretty bad for most Americans…
## # … with 3,322,954 more rows
summary(docs.tidy)
## type text
## blogs : 681062 Length:3322964
## news : 628797 Class :character
## twitter:2013105 Mode :character
As our training set, we shall only use a part (1/10) of the dataset, from each type (blogs, news, twits) proportionally.
set.seed(1234)
training <- docs.tidy %>% group_by(type) %>% sample_frac(0.1) %>% ungroup()
training
## # A tibble: 332,296 × 2
## type text
## <fct> <chr>
## 1 blogs "Fast Triggering Speed Up to 12 frames per second."
## 2 blogs "I should have advertised this recipe on TV."
## 3 blogs "— My foot is about one inch long (about 2½ cm). I probably have the b…
## 4 blogs "You listening to Gotye again?"
## 5 blogs "One of the things we tend to shy away from when we \"eat healthier\" …
## 6 blogs "And like every great writer, having dedicated so much of her life to …
## 7 blogs "it reminded me of many of alexandria's beautiful and thoughtful photo…
## 8 blogs "Thank you very much for visiting this blog. It has been close to a ye…
## 9 blogs "While I enjoyed the mudbugs…I must say that I feel like I got tricked…
## 10 blogs "Mix everything together and put into a large greased baking tin. Spri…
## # … with 332,286 more rows
The training set was tokenized into bigrams and trigrams using the tidytext library. The benefit of using n-grams compared to words is that n-grams capture word order that would otherwise be lost (Hvitfeldt and Silge 2021).
library(tidytext)
bigram <- training %>% unnest_tokens(bigram, text, token = "ngrams", n=2)
trigram <- training %>% unnest_tokens(trigram, text, token = "ngrams", n=3)
quadrugram <- training %>% unnest_tokens(quadrugram, text, token = "ngrams", n=4)
To filter profanity, stop-words and non-word characters, we first need to separate n-grams, then apply filter to each word in the n-gram, as described in (Silge and Robinson 2017). This is a computationally demanding procedure, and we save the output to use it later.
## separate 2-grams, 3-grams, and 4-grams
library(tidyr)
bigram.sep <- bigram %>% separate(bigram, c("word1", "word2"), sep = " ")
save(bigram.sep, file = "./data/BigramSep.Rdata")
trigram.sep <- trigram %>% separate(trigram, c("word1", "word2", "word3"), sep = " ")
save(trigram.sep, file = "./data/TrigramSep.Rdata")
quadrugram.sep <- quadrugram %>% separate(quadrugram, c("word1", "word2", "word3", "word4"), sep = " ")
save(quadrugram.sep, file = "./data/QuadrugramSep.Rdata")
The code which covers profanity filters is hidden.
## load stop-words, profanity and non-word chararcters
data("stop_words")
non.word <- grepl("[^a-zA-Z]", bigram.sep$word1)
non.word <- bigram.sep$word1[non.word]
## remove stop-words, profanity and non-word characters
bigram.sep.clean <- bigram.sep %>% filter(!word1 %in% stop_words$word, !word2 %in% stop_words$word) %>% filter(!word1 %in% profanity, !word2 %in% profanity) %>% filter(!word1 %in% non.word, !word2 %in% non.word)
trigram.sep.clean <- trigram.sep %>% filter(!word1 %in% stop_words$word, !word2 %in% stop_words$word, !word3 %in% stop_words$word) %>% filter(!word1 %in% profanity, !word2 %in% profanity, !word3 %in% profanity) %>% filter(!word1 %in% non.word, !word2 %in% non.word, !word3 %in% non.word)
quadrugram.sep.clean <- quadrugram.sep %>% filter(!word1 %in% stop_words$word, !word2 %in% stop_words$word, !word3 %in% stop_words$word, !word4 %in% stop_words$word) %>% filter(!word1 %in% profanity, !word2 %in% profanity, !word3 %in% profanity, !word4 %in% profanity) %>% filter(!word1 %in% non.word, !word2 %in% non.word, !word3 %in% non.word, !word4 %in% non.word)
A frequency table of unique and repeated n-grams (n from 2 to 4) is created.
bigram.count <- bigram.sep.clean %>% count(word1, word2, sort = TRUE)
trigram.count <- trigram.sep.clean %>% count(word1, word2, word3, sort = TRUE)
quadrugram.count <- quadrugram.sep.clean %>% count(word1, word2, word3, word4, sort = TRUE)
based on the available input, will be fed to the algorithm to find top 3 most frequent matches of n-grams that are 1 word larger than the input in the frequency table. An example follows.
predicted <- quadrugram.count %>% filter(word1 == "extra", word2 == "virgin", word3 == "olive") %>% slice_head(n =3) %>% select(word4)
predicted
## # A tibble: 3 × 1
## word4
## <chr>
## 1 oil
## 2 divided
## 3 oils
When no matches are found, the last word of the input will be removed and the resulting shorter n-gram will be fed to the algorithm. The Process can be repeated till we are left with a single word to be fed to the algorithm.
predicted <- quadrugram.count %>% filter(word1 == "chinese", word2 == "food", word3 == "restaurant") %>% slice_head(n =3) %>% select(word4)
predicted
## # A tibble: 0 × 1
## # … with 1 variable: word4 <chr>
if(nrow(predicted) == 0) {
predicted <- trigram.count %>% filter(word1 == "chinese", word2 == "food") %>% slice_head(n =3) %>% select(word3)
}
predicted
## # A tibble: 3 × 1
## word3
## <chr>
## 1 birthday
## 2 blurs
## 3 current
if(nrow(predicted) == 0) {
predicted <- bigram.count %>% filter(word1 == "chinese") %>% slice_head(n =3) %>% select(word2)
}
predicted
## # A tibble: 3 × 1
## word3
## <chr>
## 1 birthday
## 2 blurs
## 3 current