Executive summary

The task is to build a predictive language model using 1, 2 or 3 words. A sound approach would be to look at the grammatical structure of phrases. However, such task is not trivial as words can be different parts of speech depending on context. Language is difficult also because it is changing, and in practice, there are, for example, abbreviations, variations and misspellings. It is easy to simplify, but it is important to note that usual rules such as someone’s can be someone is, or someone has, do not always apply; for example, there is also the possessive form as in someone’s things. Another example: often chat includes text “your amazing” and it is clear that what was meant is “you’re amazing” but it is not always the case.

We evaluate the new model using different metrics such as accuracy for the first, second, and third words. We also evaluate our model for efficiency using timing software. The app actually performs thousands of replacements, which would not be practical without optimizations.

The task is suitable one for a data scientist because data sources almost always need to be cleaned. The databases require quite a bit of cleaning. Besides misspellings, foreign language text, web addresses, e-mail addresses, numbers, etc., there are numerous characters from different character sets.

There seem to a few authors who twitter often and who have unique writing styles such as prepending an a- to the beginning of almost any word, or preaching using archaic words such as whilst or giveth. One author created a new word cos which she used in countless ways. Many authors write in colloquial or guttural speech (including from different cultures) using expressions that only sound like but hardly resemble the actual words they represent. For instance, ne can stand for any (or the intercardinal direction northeast, the state of Nebraska, etc.). On top of all that, there are non-standard abbreviations which are poorly documented and/or used in multiple ways. Moreover, many expressions are shortened words by omitting certain letters and of course it is kind of hard to look up that which is missing! For example, il can stand for hail (most likely), the state of Illinois, the feeling ill (a typo), a person’s name as in il sung na, or even the gutteral version of i’ll. Even common words such as wit can actually mean with.

Hence, automating translations is non-trivial. It is a necessary task as the results are highly dependent on how the data is prepared. On the one hand, we need to keep enough of the text so as not to spoil the usefulness of the prediction methodology. On the other hand, we do not want to see too many predictions, or worse, invalid predictions.

Prior work

The goal is to build a simple model for the relationship between words. We completed the first steps in building a predictive text mining application, while exploring simple models and discovering more complicated modeling techniques. Summarizing, our prior work involved building a basic n-gram model, using the exploratory analysis, to predict the next word based on the previous 1, 2, or 3 words.

In this context, an n-gram model is a type of probabilistic language model for predicting the next word given a so-called n-gram, which is just a phrase consisting of n words (Wikipedia contributors 2021b). Since phrases are not randomly ordered, it is natural to ask: what is the next word given a phrase? Answering this question appears at first glance to be plausible if we can answer the mathematical analog: what is the conditional probability that we find such word appearing after such phrase. Considering just the fact that languages constantly change, it turns out computing the conditional probability exactly is a hard problem to solve. It might be surprising, but we can approximate the conditional probability that takes into account all of the words in the phrase, by just the conditional probability of the last (n-1) words.

In particular, bigrams (2-grams) and trigrams(3-grams) can be used to predict the next word, given one word and two words, respectively. The assumption that the probability of a word depends only on the previous (n - 1) words is called a Markov assumption (see Jurafsky and Martin 2020, 3–4). Naturally, the predictive power will depend on the size and frequencies of the n-grams.

The probabilities may be estimated using the so-called maximum likelihood estimation (MLE) by getting counts from a corpus and normalizing (see Jurafsky and Martin 2020, 4). Let C(p) denote the normalized count of the phrase p. Let P denote the approximate probability. We may calculate the MLEs as follows:

2-gram: \(P(xy \mid x) = \displaystyle \frac{C(xy)}{C(x)}\)

3-gram: \(P(xyz \mid xy) = \displaystyle \frac{C(xyz)}{C(xy)}\)

4-gram: \(P(wxyz \mid wxy) = \displaystyle \frac{C(wxyz)}{C(wxy)}\)

Google’s simpler Stupid Backoff (SB) method with a constant backoff parameter of 0.4 has been used in cases when there is no history for the given prefix (see Brants et al. 2007). There are other techniques such as adding one to all of the counts, or the so-called Simple Good–Turing technique (see Sampson 2017; and Gale and Sampson 1995)

In our algorithm described next, we use the idea that if we fail to find desired n-gram then we skip one by ignoring information a word. Another approach may be developed using Katz’s back-off model (see Wikipedia contributors 2021a). Three of our steps involve skip one. Which words do we skip? We do not skip words arbitrarily or in some mathematical pattern. We choose to omit “words” for which we do not expect to make useful predictions from the database provided. For instance, there are numerous references to movies, but we do not have evidence to show the database has enough data to answer relevant questions about movies. In particular, we do not expect to predict “Betty,” “Lou,” or the number 56. Mainly we prefer to remove items which we do not want to predict, such as numbers, email addresses, profanity, and proper names. We Perform the following steps:

  1. If \(wxyz_1\) is in 4-grams, predict \(z_1\) using highest observed ranking such 4-gram; otherwise, proceed to next step.
  2. If \(xyz_2\) is in 3-grams, predict \(z_2\) using highest observed ranking such 3-gram; otherwise, proceed to next step.
  3. If \(wyz_3\) is in 3-grams, predict \(z_3\) using highest observed ranking such 3-gram; otherwise, proceed to next step.
  4. If \(yz_4\) is in 2-grams, predict \(z_4\) using highest observed ranking such 2-gram; otherwise, proceed to next step.
  5. If \(wxz_5\) is in 3-grams, predict \(z_5\) using highest observed ranking such 3-gram; otherwise, proceed to next step.
  6. If \(xz_6\) is in 2-grams, predict \(z_6\) using highest observed ranking such 2-gram; otherwise, predict end of phrase.

What’s new?

The output! In our original attempt, we focused on finding the best completion. This approach was sound because it was necessary to demonstrate the methodology. Suppose, however, we return a word completion and then, the user say’s: “well oh yeah, that too, but no, I don’t like that word , give me another completion!” So we will return a list of possible completions whenever such info is readily available. What does this mean? We do not intend to do an exhaustive search! However, upon searching for a particular n-gram, if we find multiple results, we then return them approximately in ranked order. The words are ranked so that the first item in the result is ‘our’ best guess at completion.

What else is new? We do not want to return bad words such as misspelled words. So we clean the text more rigorously. Well, wait a minute, didn’t we already do this task? Yes, but we were conservative. Also, previously we disallowed hyphenated words and words with apostrophes, which we now allow. Previously, we used the package quanteda to form tokens. However, this approach is problematic in view of hyphenated words and words’ with apostrophes. So we form tokens directly.

Load libraries

library("tidyverse")
library("lubridate")
library("lemon")
library("devtools")
library("formatR")
library("knitr")

Data processing

Set working directory.

Download data and unzip.

if (!file.exists("final")) {
    download.file("https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip", 
        destfile = "Coursera-SwiftKey.zip")
    if (file.exists("Coursera-SwiftKey.zip")) {
        unzip("Coursera-SwiftKey.zip")
        file.remove("Coursera-SwiftKey.zip")
    }
}

Read in text files and summarize data

countWords <- function(Text) {
    return(as.list(sapply(Text, str_count, "\\w+")))
}

countCharacters <- function(Text) {
    return(as.list(sapply(Text, nchar)))
}

text2df <- function(Text, NWords, NCharacters, Size) {
    MeanWordsPerLine = as.numeric(sapply(NWords, mean))
    MeanCharactersPerLine = as.numeric(sapply(NCharacters, mean))
    CharactersPerWord = as.numeric(MeanCharactersPerLine)/as.numeric(MeanWordsPerLine)
    return(data.frame(Name = as.factor(c("blogs", "news", "Twitter")), Size = Size, 
        Length = as.numeric(sapply(Text, length)), Words = as.numeric(sapply(NWords, 
            sum)), Characters = as.numeric(sapply(NCharacters, sum)), MeanWords = MeanWordsPerLine, 
        MaxWords = as.numeric(sapply(NWords, max)), MeanCharacters = MeanCharactersPerLine, 
        WordLength = CharactersPerWord))
}

Filename = c("./final/en_US/en_US.blogs.txt", "./final/en_US/en_US.news.txt", "./final/en_US/en_US.twitter.txt")
Size = as.numeric(sapply(Filename, file.size))  # original file size, treated as constant 

Texts = lapply(Filename, readLines, encoding = "UTF-8", skipNul = TRUE)

NWords = countWords(Texts)

NCharacters = countCharacters(Texts)

df_summary <- text2df(Texts, NWords, NCharacters, Size)

save(df_summary, file = "completed data/df_summary.rda")
save(NCharacters, file = "completed data/NCharacters.rda")
save(NWords, file = "completed data/NWords.rda")

require(varhandle)
varhandle::rm.all.but(keep = c("NWords", "NCharacters", "df_summary"))
load(file = "completed data/df_summary.rda")

head(df_summary)
Name Size Length Words Characters MeanWords MaxWords MeanCharacters WordLength
blogs 210160014 899288 38309620 206824505 42.59995 6851 229.98695 5.398762
news 205811889 77259 2741594 15639408 35.48576 1522 202.42830 5.704495
twitter 167105338 2360148 31003544 162096241 13.13627 47 68.68054 5.228313

Visually view the number of lines and words per file.

par(mfrow = c(1, 3))
barplot(df_summary$Length/1000, names.arg = df_summary$Name, main = "Thousands of Lines", 
    col = c("blue", "green", "purple"), beside = TRUE)
barplot(df_summary$Words/1e+06, names.arg = df_summary$Name, main = "Millions of Words", 
    col = c("blue", "green", "purple"), beside = TRUE)
barplot(df_summary$Characters/1e+06, names.arg = df_summary$Name, main = "Millions of Characters", 
    col = c("blue", "green", "purple"), beside = TRUE)

Look distribution of the length of lines.

load(file = "completed data/NCharacters.rda")

s <- sqrt(df_summary$Length)

par(mfrow = c(3, 1))
hist(NCharacters[[1]], main = "Frequency of Line Lengths for Blogs", xlim = c(0, 
    800), xlab = "", ylab = "", col = "blue", breaks = s[1])
hist(NCharacters[[2]], main = "Frequency of Line Lengths for News", xlim = c(0, 600), 
    xlab = "", ylab = "Frequency", col = "green", breaks = s[2])
hist(NCharacters[[3]], main = "Frequency of Line Lengths for Twitter", xlim = c(0, 
    180), xlab = "Length", ylab = "", col = "purple", breaks = s[3])

rm(NCharacters, s)

Look distribution of the number of words. The code to create the file NWords.rda is omitted.

Let’s examine the frequencies of words.

df <- read.csv(file = "completed data/WordFreq.csv")

require("Hmisc")
Hmisc::describe(df$freq)
## df$freq 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##   297913        0     2864    0.827    96.29    187.8        1        1 
##      .25      .50      .75      .90      .95 
##        1        1        4       20       68 
## 
## lowest :      1      2      3      4      5, highest: 548259 617015 723572 788560 937242
df <- df[order(df$freq, decreasing = TRUE), ]
print(df[1:20, ], row.names = FALSE)
##  word   freq
##   the 937242
##    to 788560
##     i 723572
##     a 617015
##   you 548259
##   and 438425
##   for 385197
##    in 380293
##    of 359571
##    is 358748
##    it 295902
##    my 291870
##    on 277952
##  that 234832
##    me 202236
##    be 187884
##    at 186706
##  with 173447
##  your 171207
##  have 168642
f <- df[50:1500, ]

ggplot(f, aes(x = 1:nrow(f), y = freq)) + geom_bar(stat = "identity", width = 0.5) + 
    theme(axis.text.x = element_text(angle = 90, vjust = 0.5)) + labs(title = "Frequency of Less Common Words by Rank", 
    x = "Index of Words", y = "count")

tail(f, 20)
##           word freq
## 1481      hill 1836
## 1482       ive 1836
## 1483  decision 1835
## 1484    period 1834
## 1485 tonight's 1833
## 1486        ad 1831
## 1487     merry 1829
## 1488      pray 1828
## 1489 instagram 1826
## 1490    losing 1826
## 1491    member 1826
## 1492        ma 1825
## 1493  sometime 1823
## 1494  response 1820
## 1495  bringing 1817
## 1496    paying 1816
## 1497   schools 1816
## 1498     third 1814
## 1499       yum 1814
## 1500     bacon 1810
rm(df, f)

In view of the quantity of data and the distribution of line lengths, we would rather use only the Twitter dataset as the other datasets do not seem suitable to answer the type of question we are considering. In particular, the news dataset is relatively small. The distribution of line lengths for the blogs dataset is too skewed towards short lines that we do not expect to build an n-gram model with good predictive power. Furthermore, it is not only intuitive but also supported by the literature that we can expect better results if we test using data from similar sources.

Nevertheless, all datasets are similar in terms of application area and we can expect to make better predictions using a larger database. So we decide to combine them.

Data cleaning

We assume independence so that each word depends only on the last n − 1 words in accordance with a Markov model (see references at end of this document). We do not expect to predict URLs, email addresses, words containing numbers or foreign characters, profanity, or stylistic punctuation.

We clean the data to ignore content which is not useful or desired for prediction:

The idea of the this task is quite ambitious undertaking. Put simply, we attempt to transform any text into a ‘standard form.’ This is difficult precisely because language is difficult, with all of the abbreviations, misspellings, contractions, etc. There are too many possible grammatical errors to check all of them, e.g., was advice used where advise should have been used or vice versa, etc. And we could not do this without building upon much prior work, and many iterations.

In particular, we convert text to UTF-8, some interjections, spell check and clean the text, except for profanity. In addition, this function importantly splits lines to avoid creating n-grams from the end of one phrase and the beginning of another. Actually, this function does quite a bit, with a very brief summary here:

We also fix apostrophe/quote problems. For example, if the original text is

           idea'----joe
           

In addition, we replace any letters repeated more than twice by one, so

      spppppaaaaaaaaaaawn -> spawn

We remove common interjections as they do not provide useful structural information, and they are not useful predictions. However, as always we do not just hastily remove them! Why not? Well, keep in mind, there are numerous grammatical errors, esp. punctuation. Nevertheless, we can infer punctuation from some interjections, and in such cases split the line to minimize forming n-grams from two different phrases, which will improve predictions (small compared to using punctuation as we have done)

The goal of the ‘standard’ part is that even when we have failed to make all possible corrections, there is reasonable expectation that our predictions will be meaningful precisely because we used the same conversion ‘form.’

Discovered in text which is unicode for ‘PRIVATE USE TWO’ quote. We convert to UTF-8 and replace unconventional quote with standard one.

We omit the cleaning code.

We need to define what is profanity. Taking into account words have multiple meanings, we take a conservative view by considering a word as profane if

For example breast appears in the list we have chosen but that word is in Webster’s dictionary, and so we would not want to say it is a swear word. We use a couple lists in order to be more complete.

There are no absolutes, and some words that appear in a dictionary are often used almost exclusively in an offensive manner. This is subjective, but we make an effort to remove such offensive terms.

It is helpful to do cleaning iteratively because the text is so bad that it becomes more comprehensible with each iteration. We avoid dropping so many lines that it would adversely affect predictions. On the other hand, by removing ‘bad’ lines, we expect to be able to make better predictions.

We replace all items that are skipped by ‘X.’

We also look our prior spelling corrections and words that we consider to add to dictionary for the application being developed.

Forming n-grams

We do not utilize the notion of a corpus which is commonly used for text mining in R because we do not want tokens to be damaged, e.g. there are problems with hyphenation. We form tokens using strsplit only. We do utilize library quanteda to implement the ‘frequency’ part of our algorithm (see Puschmann 2019), i.e. to form the n-grams. We need to process them specially with the X’s we used to replace text.

require("quanteda")

get_tokens <- function(filename) {
    as.tokens(strsplit(readLines(paste0(filename, ".txt")), " "))
}

ngram <- function(c, n) {
    d <- textstat_frequency(dfm(tokens_ngrams(c, n = n, concatenator = " "), tolower = FALSE))
    d <- data.frame(ngram = d$feature, count = d$frequency)
}

setwd("~/R/drbique.github.io/Capstone")

ntokens <- get_tokens("qtext")

df4 <- ngram(ntokens, 4L)
df3 <- ngram(ntokens, 3L)
df2 <- ngram(ntokens, 2L)

save(df2, df3, df4, file = "n-grams2-4.rda")

rm(ntokens)

Next, we need to fix the n-grams! This means basically two things:

  1. Remove X
  2. Do a bit of grammar checking to reduce bad completions

The latter step means when the last word is “a” we do not want to see “a” as a prediction! We do not consider exhaustive checking, and focus on a small number of common words.

The former step is more tricky because upon removing x from an n-gram, we have at most an (n-1)-gram.

We omit the subsequent code which includes substantial testing.

Let’s make a picture representation of the 4-grams.

load("df4_0.rda")
df <- data.frame(word = paste0(df4_0$w, " ", df4_0$x, " ", df4_0$y, " ", df4_0$z), 
    freq = df4_0$count)
df <- df[order(-df[, 2]), ]
df <- df[df$freq >= 7, ]

library(wordcloud2)
wc <- wordcloud2(data = df, size = 1, fontFamily = "alternategothic")

library(webshot)
# webshot::install_phantomjs(force=TRUE)

library(htmlwidgets)

saveWidget(wc, "wc4.html", selfcontained = F)

library(webshot)
webshot("wc4.html", "wc4.png", delay = 10, vwidth = 480, vheight = 480)

References

Brants, Thorsten, Ashok C. Popat, Peng Xu, Franz J. Och, and Jeffrey Dean. 2007. “Large Language Models in Machine Translation.” In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and ComputationalNatural Language Learning, 858–67. Association for Computational Linguistics. https://www.aclweb.org/anthology/D07-1090.pdf.
Gale, William A., and Geoffrey Sampson. 1995. “Good–Turing Frequency Estimation Without Tears.” Journal of Quantitative Linguistics 2 (3): 217–37. https://doi.org/10.1080/09296179508590051.
Jurafsky, Daniel, and James H. Martin. 2020. “N-Gram Language Models.” Speech and Language Processing. https://web.stanford.edu/~jurafsky/slp3/3.pdf.
Puschmann, Cornelius. 2019. “Advancing Text Mining with R and Quanteda.” Methods Bites. October 2019. https://www.mzes.uni-mannheim.de/socialsciencedatalab/article/advancing-text-mining/.
Sampson, Geoffrey. 2017. “Good–Turing Frequency Estimation.” November 2017. https://www.grsampson.net/RGoodTur.html.
Wikipedia contributors. 2021a. “Katz’s Back-Off Model.” Wikipedia, The Free Encyclopedia. January 2021. https://en.wikipedia.org/w/index.php?title=Katz%27s_back-off_model&oldid=991317475.
———. 2021b. “N-Gram.” Wikipedia, The Free Encyclopedia. January 2021. https://en.wikipedia.org/w/index.php?title=N-gram&oldid=998158976.