Chat Assistant App Report

Executive summary

The task is to build a predictive language model using 1, 2 or 3 words. A sound approach would be to look at the grammatical structure of phrases. However, such task is not trivial as words can be different parts of speech depending on context. Language is difficult also because it is changing, and in practice, there are, for example, abbreviations, variations and misspellings. It is easy to simplify, but it is important to note that usual rules such as someone’s can be someone is, or someone has, do not always apply; for example, there is also the possessive form as in someone’s things. Another example: often chat includes text “your amazing” and it is clear that what was meant is “you’re amazing” but it is not always the case.

We evaluate the new model using different metrics such as accuracy for the first, second, and third words. We also evaluate our model for efficiency using timing software. The app actually performs thousands of replacements, which would not be practical without optimizations.

The task is suitable one for a data scientist because data sources almost always need to be cleaned. The databases require quite a bit of cleaning. Besides misspellings, foreign language text, web addresses, e-mail addresses, numbers, etc., there are numerous characters from different character sets.

There seem to a few authors who twitter often and who have unique writing styles such as prepending an a- to the beginning of almost any word, or preaching using archaic words such as whilst or giveth. One author created a new word cos which she used in countless ways. Many authors write in colloquial or guttural speech (including from different cultures) using expressions that only sound like but hardly resemble the actual words they represent. For instance, ne can stand for any (or the intercardinal direction northeast, the state of Nebraska, etc.). On top of all that, there are non-standard abbreviations which are poorly documented and/or used in multiple ways. Moreover, many expressions are shortened words by omitting certain letters and of course it is kind of hard to look up that which is missing! For example, il can stand for hail (most likely), the state of Illinois, the feeling ill (a typo), a person’s name as in il sung na, or even the gutteral version of i’ll. Even common words such as wit can actually mean with.

Hence, automating translations is non-trivial. It is a necessary task as the results are highly dependent on how the data is prepared. On the one hand, we need to keep enough of the text so as not to spoil the usefulness of the prediction methodology. On the other hand, we do not want to see too many predictions, or worse, invalid predictions.

Prior work

The goal is to build a simple model for the relationship between words. We completed the first steps in building a predictive text mining application, while exploring simple models and discovering more complicated modeling techniques. Summarizing, our prior work involved building a basic n-gram model, using the exploratory analysis, to predict the next word based on the previous 1, 2, or 3 words.

In this context, an n-gram model is a type of probabilistic language model for predicting the next word given a so-called n-gram, which is just a phrase consisting of n words (Wikipedia contributors 2021b). Since phrases are not randomly ordered, it is natural to ask: what is the next word given a phrase? Answering this question appears at first glance to be plausible if we can answer the mathematical analog: what is the conditional probability that we find such word appearing after such phrase. Considering just the fact that languages constantly change, it turns out computing the conditional probability exactly is a hard problem to solve. It might be surprising, but we can approximate the conditional probability that takes into account all of the words in the phrase, by just the conditional probability of the last (n-1) words.

In particular, bigrams (2-grams) and trigrams(3-grams) can be used to predict the next word, given one word and two words, respectively. The assumption that the probability of a word depends only on the previous (n - 1) words is called a Markov assumption (see Jurafsky and Martin 2020, 3–4). Naturally, the predictive power will depend on the size and frequencies of the n-grams.

The probabilities may be estimated using the so-called maximum likelihood estimation (MLE) by getting counts from a corpus and normalizing (see Jurafsky and Martin 2020, 4). Let C(p) denote the normalized count of the phrase p. Let P denote the approximate probability. We may calculate the MLEs as follows:

2-gram: \(P(xy \mid x) = \displaystyle \frac{C(xy)}{C(x)}\)

3-gram: \(P(xyz \mid xy) = \displaystyle \frac{C(xyz)}{C(xy)}\)

4-gram: \(P(wxyz \mid wxy) = \displaystyle \frac{C(wxyz)}{C(wxy)}\)

Google’s simpler Stupid Backoff (SB) method with a constant backoff parameter of 0.4 has been used in cases when there is no history for the given prefix (see Brants et al. 2007). There are other techniques such as adding one to all of the counts, or the so-called Simple Good–Turing technique (see Sampson 2017; and Gale and Sampson 1995)

In our algorithm described next, we use the idea that if we fail to find desired n-gram then we skip one by ignoring information a word. Another approach may be developed using Katz’s back-off model (see Wikipedia contributors 2021a). Three of our steps involve skip one. Which words do we skip? We do not skip words arbitrarily or in some mathematical pattern. We choose to omit “words” for which we do not expect to make useful predictions from the database provided. For instance, there are numerous references to movies, but we do not have evidence to show the database has enough data to answer relevant questions about movies. In particular, we do not expect to predict “Betty,” “Lou,” or the number 56. Mainly we prefer to remove items which we do not want to predict, such as numbers, email addresses, profanity, and proper names. We Perform the following steps:

If \(wxyz_1\) is in 4-grams, predict \(z_1\) using highest observed ranking such 4-gram; otherwise, proceed to next step.
If \(xyz_2\) is in 3-grams, predict \(z_2\) using highest observed ranking such 3-gram; otherwise, proceed to next step.
If \(wyz_3\) is in 3-grams, predict \(z_3\) using highest observed ranking such 3-gram; otherwise, proceed to next step.
If \(yz_4\) is in 2-grams, predict \(z_4\) using highest observed ranking such 2-gram; otherwise, proceed to next step.
If \(wxz_5\) is in 3-grams, predict \(z_5\) using highest observed ranking such 3-gram; otherwise, proceed to next step.
If \(xz_6\) is in 2-grams, predict \(z_6\) using highest observed ranking such 2-gram; otherwise, predict end of phrase.

What’s new?

The output! In our original attempt, we focused on finding the best completion. This approach was sound because it was necessary to demonstrate the methodology. Suppose, however, we return a word completion and then, the user say’s: “well oh yeah, that too, but no, I don’t like that word , give me another completion!” So we will return a list of possible completions whenever such info is readily available. What does this mean? We do not intend to do an exhaustive search! However, upon searching for a particular n-gram, if we find multiple results, we then return them approximately in ranked order. The words are ranked so that the first item in the result is ‘our’ best guess at completion.

What else is new? We do not want to return bad words such as misspelled words. So we clean the text more rigorously. Well, wait a minute, didn’t we already do this task? Yes, but we were conservative. Also, previously we disallowed hyphenated words and words with apostrophes, which we now allow. Previously, we used the package quanteda to form tokens. However, this approach is problematic in view of hyphenated words and words’ with apostrophes. So we form tokens directly.

Load libraries

library("tidyverse")
library("lubridate")
library("lemon")
library("devtools")
library("formatR")
library("knitr")

Data processing

Set working directory.

Download data and unzip.

if (!file.exists("final")) {
    download.file("https://d396qusza40orc.cloudfront.net/dsscapstone/dataset/Coursera-SwiftKey.zip", 
        destfile = "Coursera-SwiftKey.zip")
    if (file.exists("Coursera-SwiftKey.zip")) {
        unzip("Coursera-SwiftKey.zip")
        file.remove("Coursera-SwiftKey.zip")
    }
}

Read in text files and summarize data

countWords <- function(Text) {
    return(as.list(sapply(Text, str_count, "\\w+")))
}

countCharacters <- function(Text) {
    return(as.list(sapply(Text, nchar)))
}

text2df <- function(Text, NWords, NCharacters, Size) {
    MeanWordsPerLine = as.numeric(sapply(NWords, mean))
    MeanCharactersPerLine = as.numeric(sapply(NCharacters, mean))
    CharactersPerWord = as.numeric(MeanCharactersPerLine)/as.numeric(MeanWordsPerLine)
    return(data.frame(Name = as.factor(c("blogs", "news", "Twitter")), Size = Size, 
        Length = as.numeric(sapply(Text, length)), Words = as.numeric(sapply(NWords, 
            sum)), Characters = as.numeric(sapply(NCharacters, sum)), MeanWords = MeanWordsPerLine, 
        MaxWords = as.numeric(sapply(NWords, max)), MeanCharacters = MeanCharactersPerLine, 
        WordLength = CharactersPerWord))
}

Filename = c("./final/en_US/en_US.blogs.txt", "./final/en_US/en_US.news.txt", "./final/en_US/en_US.twitter.txt")
Size = as.numeric(sapply(Filename, file.size))  # original file size, treated as constant 

Texts = lapply(Filename, readLines, encoding = "UTF-8", skipNul = TRUE)

NWords = countWords(Texts)

NCharacters = countCharacters(Texts)

df_summary <- text2df(Texts, NWords, NCharacters, Size)

save(df_summary, file = "completed data/df_summary.rda")
save(NCharacters, file = "completed data/NCharacters.rda")
save(NWords, file = "completed data/NWords.rda")

require(varhandle)
varhandle::rm.all.but(keep = c("NWords", "NCharacters", "df_summary"))

load(file = "completed data/df_summary.rda")

head(df_summary)

Name	Size	Length	Words	Characters	MeanWords	MaxWords	MeanCharacters	WordLength
blogs	210160014	899288	38309620	206824505	42.59995	6851	229.98695	5.398762
news	205811889	77259	2741594	15639408	35.48576	1522	202.42830	5.704495
twitter	167105338	2360148	31003544	162096241	13.13627	47	68.68054	5.228313

Visually view the number of lines and words per file.

par(mfrow = c(1, 3))
barplot(df_summary$Length/1000, names.arg = df_summary$Name, main = "Thousands of Lines", 
    col = c("blue", "green", "purple"), beside = TRUE)
barplot(df_summary$Words/1e+06, names.arg = df_summary$Name, main = "Millions of Words", 
    col = c("blue", "green", "purple"), beside = TRUE)
barplot(df_summary$Characters/1e+06, names.arg = df_summary$Name, main = "Millions of Characters", 
    col = c("blue", "green", "purple"), beside = TRUE)

Look distribution of the length of lines.

load(file = "completed data/NCharacters.rda")

s <- sqrt(df_summary$Length)

par(mfrow = c(3, 1))
hist(NCharacters[[1]], main = "Frequency of Line Lengths for Blogs", xlim = c(0, 
    800), xlab = "", ylab = "", col = "blue", breaks = s[1])
hist(NCharacters[[2]], main = "Frequency of Line Lengths for News", xlim = c(0, 600), 
    xlab = "", ylab = "Frequency", col = "green", breaks = s[2])
hist(NCharacters[[3]], main = "Frequency of Line Lengths for Twitter", xlim = c(0, 
    180), xlab = "Length", ylab = "", col = "purple", breaks = s[3])

rm(NCharacters, s)

Look distribution of the number of words. The code to create the file NWords.rda is omitted.

Let’s examine the frequencies of words.

df <- read.csv(file = "completed data/WordFreq.csv")

require("Hmisc")
Hmisc::describe(df$freq)

## df$freq 
##        n  missing distinct     Info     Mean      Gmd      .05      .10 
##   297913        0     2864    0.827    96.29    187.8        1        1 
##      .25      .50      .75      .90      .95 
##        1        1        4       20       68 
## 
## lowest :      1      2      3      4      5, highest: 548259 617015 723572 788560 937242

df <- df[order(df$freq, decreasing = TRUE), ]
print(df[1:20, ], row.names = FALSE)

##  word   freq
##   the 937242
##    to 788560
##     i 723572
##     a 617015
##   you 548259
##   and 438425
##   for 385197
##    in 380293
##    of 359571
##    is 358748
##    it 295902
##    my 291870
##    on 277952
##  that 234832
##    me 202236
##    be 187884
##    at 186706
##  with 173447
##  your 171207
##  have 168642

f <- df[50:1500, ]

ggplot(f, aes(x = 1:nrow(f), y = freq)) + geom_bar(stat = "identity", width = 0.5) + 
    theme(axis.text.x = element_text(angle = 90, vjust = 0.5)) + labs(title = "Frequency of Less Common Words by Rank", 
    x = "Index of Words", y = "count")

tail(f, 20)

##           word freq
## 1481      hill 1836
## 1482       ive 1836
## 1483  decision 1835
## 1484    period 1834
## 1485 tonight's 1833
## 1486        ad 1831
## 1487     merry 1829
## 1488      pray 1828
## 1489 instagram 1826
## 1490    losing 1826
## 1491    member 1826
## 1492        ma 1825
## 1493  sometime 1823
## 1494  response 1820
## 1495  bringing 1817
## 1496    paying 1816
## 1497   schools 1816
## 1498     third 1814
## 1499       yum 1814
## 1500     bacon 1810

rm(df, f)

In view of the quantity of data and the distribution of line lengths, we would rather use only the Twitter dataset as the other datasets do not seem suitable to answer the type of question we are considering. In particular, the news dataset is relatively small. The distribution of line lengths for the blogs dataset is too skewed towards short lines that we do not expect to build an n-gram model with good predictive power. Furthermore, it is not only intuitive but also supported by the literature that we can expect better results if we test using data from similar sources.

Nevertheless, all datasets are similar in terms of application area and we can expect to make better predictions using a larger database. So we decide to combine them.

Data cleaning

We assume independence so that each word depends only on the last n − 1 words in accordance with a Markov model (see references at end of this document). We do not expect to predict URLs, email addresses, words containing numbers or foreign characters, profanity, or stylistic punctuation.

We clean the data to ignore content which is not useful or desired for prediction:

remove words containing a digit
remove words containing @ symbol such as email addresses
remove tags in angled brackets <> such as HTML tags that may contain URLs
remove URLs
remove words containing such as folder names
remove punctuation and excessive hyphens
Americanize a number of british variations
transform to lower case
remove words containing letters other than a-z such as a foreign character
remove extra white space
remove profanity

The idea of the this task is quite ambitious undertaking. Put simply, we attempt to transform any text into a ‘standard form.’ This is difficult precisely because language is difficult, with all of the abbreviations, misspellings, contractions, etc. There are too many possible grammatical errors to check all of them, e.g., was advice used where advise should have been used or vice versa, etc. And we could not do this without building upon much prior work, and many iterations.

In particular, we convert text to UTF-8, some interjections, spell check and clean the text, except for profanity. In addition, this function importantly splits lines to avoid creating n-grams from the end of one phrase and the beginning of another. Actually, this function does quite a bit, with a very brief summary here:

converts to UTF-8 character set and fixes quotes
remove lines containing strings which appear to be poor grammar, hate speech including some vulgarity
removes tags and things in nested parentheses and such
removes web addresses, twitter RT tags, and long hashtags
removes foreign words with foreign characters and anything containing such characters
removes many chat symbols
allow & only in certain forms such as AT&T
allow numbers only in certain forms such as 3m . . . Removing is never perfect. For example removing entirely such things as ‘(it is a fun project)’ is not really a good thing, but most of the time what is inside parentheses is just a single word or thing, which are not helpful. Additional profanity is removed later. Removing matching quotes is difficult when there are contractions and possessives inside, and there are also mistakes.

We also fix apostrophe/quote problems. For example, if the original text is

           idea'----joe

In addition, we replace any letters repeated more than twice by one, so

      spppppaaaaaaaaaaawn -> spawn

We remove common interjections as they do not provide useful structural information, and they are not useful predictions. However, as always we do not just hastily remove them! Why not? Well, keep in mind, there are numerous grammatical errors, esp. punctuation. Nevertheless, we can infer punctuation from some interjections, and in such cases split the line to minimize forming n-grams from two different phrases, which will improve predictions (small compared to using punctuation as we have done)

The goal of the ‘standard’ part is that even when we have failed to make all possible corrections, there is reasonable expectation that our predictions will be meaningful precisely because we used the same conversion ‘form.’

Discovered in text which is unicode for ‘PRIVATE USE TWO’ quote. We convert to UTF-8 and replace unconventional quote with standard one.

We omit the cleaning code.

We need to define what is profanity. Taking into account words have multiple meanings, we take a conservative view by considering a word as profane if

it appears in what appears as a comprehensive list of such words, and
is not in a dictionary for the language.

For example breast appears in the list we have chosen but that word is in Webster’s dictionary, and so we would not want to say it is a swear word. We use a couple lists in order to be more complete.

There are no absolutes, and some words that appear in a dictionary are often used almost exclusively in an offensive manner. This is subjective, but we make an effort to remove such offensive terms.

It is helpful to do cleaning iteratively because the text is so bad that it becomes more comprehensible with each iteration. We avoid dropping so many lines that it would adversely affect predictions. On the other hand, by removing ‘bad’ lines, we expect to be able to make better predictions.

We replace all items that are skipped by ‘X.’

We also look our prior spelling corrections and words that we consider to add to dictionary for the application being developed.

Forming n-grams

We do not utilize the notion of a corpus which is commonly used for text mining in R because we do not want tokens to be damaged, e.g. there are problems with hyphenation. We form tokens using strsplit only. We do utilize library quanteda to implement the ‘frequency’ part of our algorithm (see Puschmann 2019), i.e. to form the n-grams. We need to process them specially with the X’s we used to replace text.

require("quanteda")

get_tokens <- function(filename) {
    as.tokens(strsplit(readLines(paste0(filename, ".txt")), " "))
}

ngram <- function(c, n) {
    d <- textstat_frequency(dfm(tokens_ngrams(c, n = n, concatenator = " "), tolower = FALSE))
    d <- data.frame(ngram = d$feature, count = d$frequency)
}

setwd("~/R/drbique.github.io/Capstone")

ntokens <- get_tokens("qtext")

df4 <- ngram(ntokens, 4L)
df3 <- ngram(ntokens, 3L)
df2 <- ngram(ntokens, 2L)

save(df2, df3, df4, file = "n-grams2-4.rda")

rm(ntokens)

Next, we need to fix the n-grams! This means basically two things:

Remove X
Do a bit of grammar checking to reduce bad completions

The latter step means when the last word is “a” we do not want to see “a” as a prediction! We do not consider exhaustive checking, and focus on a small number of common words.

The former step is more tricky because upon removing x from an n-gram, we have at most an (n-1)-gram.

We omit the subsequent code which includes substantial testing.

Let’s make a picture representation of the 4-grams.

load("df4_0.rda")
df <- data.frame(word = paste0(df4_0$w, " ", df4_0$x, " ", df4_0$y, " ", df4_0$z), 
    freq = df4_0$count)
df <- df[order(-df[, 2]), ]
df <- df[df$freq >= 7, ]

library(wordcloud2)
wc <- wordcloud2(data = df, size = 1, fontFamily = "alternategothic")

library(webshot)
# webshot::install_phantomjs(force=TRUE)

library(htmlwidgets)

saveWidget(wc, "wc4.html", selfcontained = F)

library(webshot)
webshot("wc4.html", "wc4.png", delay = 10, vwidth = 480, vheight = 480)

References

Brants, Thorsten, Ashok C. Popat, Peng Xu, Franz J. Och, and Jeffrey Dean. 2007. “Large Language Models in Machine Translation.” In Proceedings of the 2007 Joint Conference on Empirical Methods in Natural Language Processing and ComputationalNatural Language Learning, 858–67. Association for Computational Linguistics. https://www.aclweb.org/anthology/D07-1090.pdf.

Gale, William A., and Geoffrey Sampson. 1995. “Good–Turing Frequency Estimation Without Tears.” Journal of Quantitative Linguistics 2 (3): 217–37. https://doi.org/10.1080/09296179508590051.

Jurafsky, Daniel, and James H. Martin. 2020. “N-Gram Language Models.” Speech and Language Processing. https://web.stanford.edu/~jurafsky/slp3/3.pdf.

Puschmann, Cornelius. 2019. “Advancing Text Mining with R and Quanteda.” Methods Bites. October 2019. https://www.mzes.uni-mannheim.de/socialsciencedatalab/article/advancing-text-mining/.

Sampson, Geoffrey. 2017. “Good–Turing Frequency Estimation.” November 2017. https://www.grsampson.net/RGoodTur.html.

Wikipedia contributors. 2021a. “Katz’s Back-Off Model.” Wikipedia, The Free Encyclopedia. January 2021. https://en.wikipedia.org/w/index.php?title=Katz%27s_back-off_model&oldid=991317475.

———. 2021b. “N-Gram.” Wikipedia, The Free Encyclopedia. January 2021. https://en.wikipedia.org/w/index.php?title=N-gram&oldid=998158976.