Motivation

In this project we are creating an algorithm that will predict the third word in a three word series (trigram), given just the first two words. This could be used as a text typing assistant by predicting what the human operator intends to type and suggesting it ahead of time.

Summary

In the below code I go over the initial import and processing of the raw data. My initial model will be a Katz Back-off model, which requires three lists based on the raw data:

By the end of this document I build the necessary lists and lay out a path forward to the initial model.

Data Import and Processing

Import

We are building this algorithm off of data scraped from the web from three general sources: blogs, news, and twitter feeds. We don’t need all of this data to build a proof-of-principle algorithm, so in the below code I reduce the data to 10% of its original size (example below, code for all the extractions in raw Rmd doc).

# This code was used to split the data into a training set 10% the original size
set.seed(45882)
con <- file("en_US.blogs.txt", "r")
out <- file("blogs_train.txt", "w")
buffer <- character()
counter <- 0
# randomly select lines of input
for(i in 1:899288){# Depends on length of file, use BASH$wc -l
        line <- readLines(con, 1) # Read one line at a time
        if(as.logical(rbinom(1,1,0.1))){# 10% chance of TRUE
                counter <- counter+1
                buffer[counter] <- line
        }
}
write(buffer, out)
close(con)
close(out)

After sampling 10% of the data, I then imported the samples and tokenized them based on individual words. This will allow us to do some initial processing on the documents (code for import is trivial and can be seen in raw Rmd doc).

Let’s check the size of these imported files to see how many lines of text we are dealing with:
Lines
Blogs 89391
News 101154
Tweets 235337

Remove non-English paragraphs

This dataset is largely English text, but some non-English text (ex. “高等教育”) lurks in the data. Here I use the cld2 package to detect English versus non- English text. (Notes on cld2: This package is robust to misspelled words, but it does require at least three English words to properly detect English (otherwise, NA) and in my testing I noticed that it can sometimes overcall English when mixed with non- English words.)

In the below code I detect the language and keep anything that is English. This also throws out some English lines that are very short and hence not useful for this analysis anyway. First, how many languages are detected?

lang <- detect_language(dataBlogs$text)
summary(as.factor(lang))
   af    ar    bs    ca   ceb    cs    cy    da    de    el    en    es 
    6     1     1     2    13     7     4    37    28     2 83585    27 
   et    eu    fi    fr    ga    gd    gl    hr    ht    hu    id    is 
    5     2     3    60     5     4     5     2     1     1    11     8 
   it    ja    jw    ko    ku    lg    lt    mg    ms    mt    ne    nl 
   26    11     5     1     5     3     2     6    11     1     1    12 
   no    ny    pl    pt    ro    ru    rw    sk    sq    st    su    sv 
   29     9     4    13     5     1    15     2     1     8     3     7 
   sw    tl    tr    ur    uz    zh  NA's 
    6     8     3     1     4     2  5366 

The above summary output shows the number of lines categorized as a particular language, English’s two letter acronym is ‘en’ here. A lot of languages are being detected. Is this accurate?

lang_ex <- bind_cols(
    dataBlogs %>%
    filter(lang == "cy") %>%
    select(text) %>% 
    rename("Welsh" = text),
    dataBlogs %>%
    filter(lang == "fr") %>%
    select(text) %>% 
    rename("French" = text) %>%
    slice(1:4),
    dataBlogs %>%
    filter(lang == "ja") %>%
    select(text) %>% 
    rename("Japanese" = text) %>%
    slice(1:4)
)
kable(lang_ex) %>% kable_styling(bootstrap_options = "striped")
Welsh French Japanese
Guthrie: really?
  • Lancôme Trésor in Love
ブッダ:ビールが必要だね。
Ocenol – See Oleyl Alcohol. 07-06 Paris, France – Zenith 私は日本人の素晴らしいところの1つは、デリカシーのあるところだと思います。日本人の持つ繊細さ、優美さや思いやりというのは、アメリカで言われるそれとは質の異なったものに私には感じられます。
  1. Billy Budd (Vauxhall)
S4-239 Classic Scall Border petites / S4-201 Deckle Rec sm 題して「ザ☆SUSHI」(笑)。アメリカのメーカーのSideshow Stampsですが和風なスタンプが時々出てくるんです♪それがとっても嬉しいです。
Walsall, Staffordshire.
  1. La Femme Parallel
部屋を出る前に車輪付きのベットに載せられた所。まだ何が起きているか分かっていなかった時。

The above example text shows that the detection seems to fail on very short inputs (ex. all the “Welsh” lines) and seems to do well for longer inputs. The miscalled short inputs will not be very useful for our algorithm anyway, so this is no great loss.

Below is my code for removing the non-English items from the datasets.

dataBlogs <- dataBlogs %>% 
        filter(lang == "en")
lang <- detect_language(dataNews$text)
dataNews <- dataNews %>% 
        filter(lang == "en")
lang <- detect_language(dataTweets$text)
dataTweets <- dataTweets %>% 
        filter(lang == "en")

Tokenization

This is trivial and can be skipped. Tokenization is the process whereby words are converted from text to word-grams that are n-words long. My code for tokenization follows along with the total number of words:

# Convert the data into a list of single words
dataBlogs <- dataBlogs %>% 
        unnest_tokens(words, text, drop = TRUE)
dataNews <- dataNews %>% 
        unnest_tokens(words, text, drop = TRUE)
dataTweets <- dataTweets %>% 
        unnest_tokens(words, text, drop = TRUE)
# How many words?
raw_words <- data.frame(c(
    dim(dataBlogs)[1],
    dim(dataNews)[1],
    dim(dataTweets)[1])
)
rownames(raw_words) <- c("Blogs", "News", "Tweets")
colnames(raw_words) <- c("Words")
kable(raw_words, caption = "Total number of words in each data set") %>% 
        kable_styling(bootstrap_options = "striped", full_width = F)
Total number of words in each data set
Words
Blogs 3703312
News 3456766
Tweets 2968386

Spellcheck

Not all the words have been spelled correctly, and misspelled words could add a huge amount of redundant data to our model. Before building our model, I used the package ‘hunspell’ to detect and correct misspellings. My code defaults to the top suggested correction by hunspell. It is possible that this is incorrect for some words, but in my limited testing it seems to work well most of the time.

Worth noting that because we are excluding misspelled words, our final model will not predict well on misspelled words. We will have to pass any input into the same spellchecker before predicting on it.

# First, make a list of correct words (TRUE/FALSE), subset the list from the 
# data, make corrections and pass the corrections back to the data frame.
# Note that 'hunspell_suggest' creates a list containing multiple suggestions 
# for each incorrect word. The first suggestion is often the best one, so I 
# extract it for my calculations
correct <- hunspell_check(dataBlogs$words)
misspelled <- dataBlogs[!correct, 2]
uniq_misspelled <- unique(misspelled$words) # Large redundancy reduction

# Spell checking is computationally very expensive. Use PARALLEL
# The below code creates a wierd double list. The double map extracts the first 
# element of each sublist. Takes about three mins to run.
suggestions <- map(map(
        mclapply(uniq_misspelled, hunspell_suggest, mc.cores = 6),
        1),1)
# Some words don't appear in the spelling dictionary and are given NULL values
# NULL messes up downstream processing and needs to be replaced with NA
counter <- 0
for(element in suggestions){
    counter <- counter + 1 # indexing
    if(is.null(element)){
        suggestions[counter] <- NA # replacement
    }
}
suggestions <- unlist(suggestions) # easier for downstream processing
Let’s see how well the spell checker works. Here I print a comparison of the first ten original and corrected words:
1 2 3 4 5 6 7 8 9 10
Original april i’ll i’ve i’m tolkien russian virginia woolf craccc lovin
Correction April I’ll I’ve I’m Tolkien Russian Virginia Woolf crack loin

Wow. “lovin” became “loin”. Otherwise, not too bad. In some hidden code I change “loin” to “loving”. I don’t have time to go through all > 47,000 elements by hand, so just note that this is not a perfect spell checker.

Below is an example of my code for replacing misspelled words in the original data frame:

# match the misspelled words to the unique words (same index as suggestions index)
corrected_index <- match(misspelled$words, uniq_misspelled)
misspelled$words <- suggestions[corrected_index]

# feed corrections back into the original data frame
dataBlogs[!correct, 2] <- misspelled$words

# Because of the NULL correction up above, some lines now have NA values for 
# words. These lines need to be entirely removed.
na_lines <- dataBlogs[is.na(dataBlogs$words), "linenumber"]
dataBlogs <- dataBlogs %>%
        filter(!(linenumber %in% na_lines))

Some of the word corrections involved breaking compound words up (ex. “dogteeth” = “dog teeth”). That means my “words” data are now corrupted by some phrases. To correct this I redid tokenization of the data. An example of code:

dataBlogs <- dataBlogs %>% # Based off of a Julia Silge post on StackOverFlow
        nest(words) %>% 
        mutate(text = map(data, unlist),
               text = map_chr(text, paste, collapse = " ")) %>% 
        select(-data)
dataBlogs <- dataBlogs %>% 
        unnest_tokens(words, text, drop = TRUE)

Stopword and profanity removal

We need to remove common/stop words, such as ‘the’, from our analysis because these words could oversaturate our predictions and are not as useful a prediction. We are also electing to police profanity. Below are the sources for my stopword and profanity removal, as well as an example of my code.

Source of stopword list: https://github.com/stopwords-iso/stopwords-iso
Source of profanity list: https://github.com/LDNOOBW/List-of-Dirty-Naughty-Obscene-and-Otherwise-Bad-Words/blob/master/en

# Combine stopwords and profane words, then remove matching words 
# from the data by an anti join.
profane <- read_table("en_Profanity.txt", col_names = FALSE)
remove <- c(pull(get_stopwords(), 1), pull(profane, 1)) # combine first columns
remove <- c(remove, c("I'll", "I'm", "I'd", "I've")) # missing uppercase
remove <- as_tibble(remove)
names(remove) <- "words"

dataBlogs <- dataBlogs %>% 
        anti_join(remove, by = "words")
How many words are left in each dataset now? Let’s compare to what we started with:
Original Wordcount Processed Wordcount Fraction Remaining
Blogs 3703312 1931532 0.5215688
News 3456766 2043056 0.5910310
Tweets 2968386 1704574 0.5742427

Unigrams (individual words)

Now we are ready to count up how many times each word appears in our data.

# Combine all the data sets together and calculate word counts
unigram_counts <- bind_rows(dataBlogs, dataNews, dataTweets) %>% 
        count(words, sort = TRUE)

Let’s try to get a sense of the most common themes and the overall structure of the data. Below are two plots. First a wordcloud of the top 100 words (more common words are larger), then a histogram of word frequencies.

The wordcloud seems fine, nothing too odd. The histogram shows that the most frequently used words make up an extremely small amount of the total lexicon. Another way to look at this is “How many words does it take to capture 50% of the word counts? 90%?” 50% of the word counts are accounted for in the top 1000 words (1.19% of all the words), and 90% of the word counts can be found in just the top 12,500 words (14.84% of all the words).

Lastly, let’s check the head and tail of our unigram counts to make sure there aren’t any obvious errors.

check <- bind_cols(unigram_counts[1:10,],tail(unigram_counts,10)) %>% 
        rename(unigramsHead = words, n_head = n, unigramsTail = words1, 
               n_tail = n1)
kable(check) %>% 
        kable_styling(bootstrap_options = "striped", full_width = F)
unigramsHead n_head unigramsTail n_tail
said 30453 zonked 1
just 30352 zookeepers 1
one 29590 zoologist 1
like 27102 zoologists 1
can 24591 zoology’s 1
get 23138 zorn 1
time 22616 zurich’s 1
new 19837 zygote 1
now 17950 zygote’s 1
day 17888 zygotes 1

Bigrams

Next, we need a list of bigram counts for all the data. My code is below, but it can largely be skipped over because it is nearly the same as for the word counts. The biggest difference is in the options of ‘unnest_tokens’.

# nest the data back
dataBlogs <- dataBlogs %>% 
        nest(words) %>% 
        mutate(text = map(data, unlist),
               text = map_chr(text, paste, collapse = " ")) %>% 
        select(-data)
dataNews <- dataNews %>% 
        nest(words) %>% 
        mutate(text = map(data, unlist),
               text = map_chr(text, paste, collapse = " ")) %>% 
        select(-data)
dataTweets <- dataTweets %>% 
        nest(words) %>% 
        mutate(text = map(data, unlist),
               text = map_chr(text, paste, collapse = " ")) %>% 
        select(-data)
# 2-grams
bigram_counts <- bind_rows(dataBlogs %>% 
        unnest_tokens(grams2, text, drop = TRUE, token = "ngrams", n = 2),
        dataNews %>% 
        unnest_tokens(grams2, text, drop = TRUE, token = "ngrams", n = 2),
        dataTweets %>% 
        unnest_tokens(grams2, text, drop = TRUE, token = "ngrams", n = 2)) %>%
        count(grams2, sort = TRUE) %>%
        filter(!is.na(grams2))

Here are a wordcloud and histogram for the bigrams:

Again, the wordcloud looks sensible. Now that we are considering actual phrases of speech a couple interesting patterns are emerging (“new york” and several references to time “years ago”, “first time”, “last week”).

The histogram looks largely as before. 50% of the bigram counts are accounted for in the top 636,000 bigrams (19.43% of all the words), and 90% of the bigram counts can be found in the top 2,750,000 phrases (84.03% of all the words). The bigram counts are much more spread out than the individual word counts are.

Lastly, lets check the head and tail of our data for any obvious errors:
bigramsHead n_head bigramsTail n_tail
right now 2501 zydeco music 1
year old 2002 zydeco washington 1
new york 1988 zygote still 1
last year 1885 zygote’s somebody 1
last night 1581 zygotes well 1
high school 1473 zygotic mark 1
years ago 1387 zymurgy man 1
last week 1316 zymurgy milometer 1
feel like 1305 zyrtec clarinet 1
first time 1264 zyrtec makes 1
Note:
‘zymurgy man’ is not a mascot for yeast, but an error spell
checking ‘kyrgyz’. In ‘zyrtec clarinet’, ‘clarinet’ should likely be ‘Clarinex’,
a competing product to ‘zyrtec’

Trigrams

The last bit of data we need for our model is the count of trigrams. My code for calculating trigrams is below. This did not change much from the bigram code so it can be skipped over if you like.

# 3-grams
trigram_counts <- bind_rows(dataBlogs %>% 
        unnest_tokens(grams3, text, drop = TRUE, token = "ngrams", n = 3),
        dataNews %>% 
        unnest_tokens(grams3, text, drop = TRUE, token = "ngrams", n = 3),
        dataTweets %>% 
        unnest_tokens(grams3, text, drop = TRUE, token = "ngrams", n = 3)) %>%
        count(grams3, sort = TRUE) %>%
        filter(!is.na(grams3))

Again the histogram is highly skewed. 50% of the trigram counts are accounted for in the top 2,230,000 trigrams (47.81% of all the words), and 90% of the trigram counts can be found in the top 4,180,000 phrases (89.62% of all the words). The trigram counts are slightly more spread out than the bigram counts are.

Let’s check for any obvious problems in the actual counts:

trigramsHead n_head trigramsTail n_tail
let us know 245 zygotes well megabyte 1
new york city 236 zygotic mark wolf 1
happy new year 194 zygotic wolf family 1
two years ago 176 zygotic wolf minnesota 1
happy mother’s day 169 zymurgy man attacked 1
happy mothers day 161 zymurgy milometer reading 1
new york times 145 zyrtec allegra benedictory 1
cinch de mayo 141 zyrtec allegra d 1
7 30 pm 139 zyrtec clarinet children 1
president barack obama 130 zyrtec makes sleepy 1

This output shows a couple more quirks of the spell checker. Because each word is checked out of context, “mothers” in “mother’s day” was given a pass. Otherwise, “happy mother’s day” would be the most common trigram. “Cinco” is apparently not in the spell checker’s lexicon.

Katz Back-Off Prediction of ngrams

Under Construction
This is where I will build a basic prediction model that can estimate the liklihood that a particular word will follow two other words (predicts the third word in a trigram). I have not yet implemented the algorithm, but the theory extended to bigrams is below, if you are interested.

Also, see: Katz. (1987) “Estimation of probabilities from sparse data for the language model component of a speech recognizer” in IEEE Transactions on Acoustics, Speech, and Signal Processing

An overview of the Katz Back-Off model

There are two sets of words defined in the Katz Back-Off model. For a set of bigrams these are the two sets:
\(A(w_{i-1}) = \{ w:Count(w_{i-1},w) > 0\}\)
\(B(w_{i-1}) = \{ w:Count(w_{i-1},w) = 0\}\)
where A is filled with bigrams that exist in our data and B is filled with bigrams that were not observed. ‘\(w\)’ means the word of interest and ‘\(w_{i-1}\)’ means the first word in the bigram.

The probability of seeing a given bigram in a new data set is calculated thusly:
\(P(w_i | w_{i-1}) = Discount(w_{i-1},w_i)\frac{Count(w_{i-1},w_i)}{Count(w_{i-1})} if~w_i|w_{i-1}\in A\)

Where the Discounting equation can take a few different forms. Here is the Good-Turing form: \((Counts+1)\frac{N_{c+1}}{N_{c}}\) for c counts of our target bigram, \(N_c\) is the number of n-grams with the same number of counts and \(N_{c+1}\) is the number of n-grams with c+1 counts.

If the bigram does not exist in the set of observed bigrams, then
\(\alpha =\) \(\frac{1-\sum_{w_i \in A(w_{i-1})} Discount(w_{i-1},w_i)\frac{Count(w_{i-1},w_i)}{Count(w_{i-1})}}{\sum_{w_i \in B(w_{i-1})} P(w_i)}\)

\(P(w_i | w_{i-1}) = \alpha * P(w_{i})\) where the numerator defines how the probabilities are redistributed and the denominator normalizes to the sum of probabilities of words \(w_i\), given those words occur as the second word in unobserved bigrams.