Word Prediction Assignment

Management summary

The goal that we are working towards is development of a prediction model that suggests 3 words that the user may want to type next based on what the user has already typed in. The solution devised should balance memory usage by the model as well as the speed with which the suggestions are offered. It should be possible to use the prediction model on a mobile phone for instance. This document details the exploratory analysis of the data that will be used for training such a model as well as the first ideas on the modeling itself.

Creating a training and test set

We read in the American blogs, news and twitter data sets and create smaller training and testing data sets from them. These smaller data sets are much easier to handle and still form a representative sample of the underlying data sets from which inferences can be made regarding the underlying data. From each data set we select at random 1/100 of the lines for training purposes and 1/1000 of the lines for testing purposes. We’ve also made sure that these lines do not overlap. We ignore for now the provided data sets from other languages.

Some statistics of the source files
Source.file	Number.of.lines	Number.of.words	Average.number.of.words.per.line
en_US.blogs.txt	899288	37334131	41.5152109224186
en_US.news.txt	1010242	34372530	34.0240556223162
en_US.twitter.txt	2360148	30373543	12.8693382787859

Reading in the smaller dataset and cleaning it up

We read in the training sets and start cleaning them up for further processing. Cleaning them up consists of:

Getting rid of words with non-ASCII characters in them (which is possible for now since we are focusing on the American data sets)
Removing url’s from the text
Getting rid of punctuation
Changing all text to lower case
Removing numbers from the text

We leave in the stop words (like ‘the’, ‘a’ and ‘to’) as we would also like to be able to predict those.

docs <- VCorpus(DirSource("./final/en_US/training"))

## First we get rid of any words containing non-ASCII letters (which also drops foreign language 
## words that contain non-ASCII letters)
toASCII <- content_transformer(function(x) {
    characters.unlist <- unlist(strsplit(x, split=" "))
    characters.non.ASCII <- iconv(characters.unlist, "UTF-8", "ASCII", sub="<nonASCII>")
    wordsWithNonASCII <- grep("<nonASCII>", characters.non.ASCII)
    return(paste(characters.non.ASCII[-wordsWithNonASCII], collapse = " "))
})
docs <- tm_map(docs, toASCII)

## Next we remove all url's in the text
removeURL <- content_transformer(function(x) {
    gsub("(http[^ ]*)|(ftp[^ ]*)|(www\\.[^ ]*)", "", x)
})
docs <- tm_map(docs, removeURL)

## Next we eliminate punctuation, colons, hyphens, quotes etc.
docs <- tm_map(docs, removePunctuation, preserve_intra_word_dashes = FALSE)
toSpace <- content_transformer(function(x, pattern) {return (gsub(pattern, " ", x))})
toNon <- content_transformer(function(x, pattern) {return (gsub(pattern, "", x))})

## We transform the text to lower case and remove numbers
docs <- tm_map(docs, content_transformer(tolower))
docs <- tm_map(docs, removeNumbers)

Profanity filtering

Now that we’ve cleaned up the document we will remove any words that may be offensive so we won’t suggest any of these to the end user. Also here, for now we focus on the American data set.

## A list of english bad words was to be found on the Carnegie Mellon site.
badWordsList <- readLines("https://www.cs.cmu.edu/~biglou/resources/bad-words.txt")
docs <- tm_map(docs, removeWords, badWordsList)
docs <- tm_map(docs, stripWhitespace)

Exploratory analysis

Now the training data set is ready for exploratory analysis. To do this we will transform the data sets into so-called ngrams. These represent either single words in the data sets (i.e. n = 1 ngrams) or a combination of 2, 3 or 4 consecutive words in the data sets (i.e. n = 2, n = 3 or n = 4 ngrams respectively). We can then look at how often these ngrams occur in the texts. We use these frequencies to analyze the structure of the texts. The exploratory analysis shows that:

The relative frequencies of the top 30 single words (n = 1 ngrams) is much higher than those for the top 30 two consecutive words (n = 2 ngrams). The more consecutive words we are taking into account the lower the relative frequencies are.
The range of frequencies is smaller for more consecutive words (higher ngrams).
The frequencies of ngrams are left skewed. This means that low frequencies occur very often while higher frequencies occur very infrequently. For instance, there are a lot of ngrams that occur less than 10 times. While there are relatively few ngrams that occur hundred or thousands of times.
Percentually the least frequent ngrams (those that occur only once in the data sets) represent a larger part of the total number of instances for higher ngrams.
For example, if we drop the most infrequent n = 2 ngrams (with frequency equal to 1) we would still cover 57% of all n = 2 ngram instances. If we drop the most infrequent n = 3 ngrams (with frequency equal to 1) we could only still cover 17% of all n = 3 ngram instances.
The most frequent unigram in the training data is the word “the”. Approximately 2% of the bigrams and 4% of the trigrams end on the word “the”.
Of course the source of data for training purposes should be taken into account in conjunction with the usage patterns for the prediction. If the predictive algorithm is intended for usage within for instance a whatsapp-like context, then the sources of data for training should be relevant for that kind of context. For now we assume a whatsapp-like context for which the provided data sets are adequate.

## We create a unigram, bigram, trigram and quadgram document term matrices and start investigating the word frequencies
unigram_tdm <- TermDocumentMatrix(docs, control = list(wordLengths = c(1, Inf)))
phrases_unigram <- data.frame(n = 1, phrase = rownames(unigram_tdm), frequency = slam::row_sums(unigram_tdm), stringsAsFactors = FALSE)
phrases_unigram$relativeFrequency <- (phrases_unigram$frequency / sum(phrases_unigram$frequency)) * 100

BigramTokenizer <- function(x) {
    unlist(lapply(ngrams(words(x), 2), paste, collapse = " "), use.names = FALSE)
}
bigram_tdm <- TermDocumentMatrix(docs, control = list(tokenize = BigramTokenizer, wordLengths = c(1, Inf)))
phrases_bigram <- data.frame(n = 2, phrase = rownames(bigram_tdm), frequency = slam::row_sums(bigram_tdm), stringsAsFactors = FALSE)
phrases_bigram$relativeFrequency <- (phrases_bigram$frequency / sum(phrases_bigram$frequency)) * 100


TrigramTokenizer <- function(x) {
    unlist(lapply(ngrams(words(x), 3), paste, collapse = " "), use.names = FALSE)
}
trigram_tdm <- TermDocumentMatrix(docs, control = list(tokenize = TrigramTokenizer, wordLengths = c(1, Inf)))
phrases_trigram <- data.frame(n = 3, phrase = rownames(trigram_tdm), frequency = slam::row_sums(trigram_tdm), stringsAsFactors = FALSE)
phrases_trigram$relativeFrequency <- (phrases_trigram$frequency / sum(phrases_trigram$frequency)) * 100


QuadgramTokenizer <- function(x) {
    unlist(lapply(ngrams(words(x), 4), paste, collapse = " "), use.names = FALSE)
}
quadgram_tdm <- TermDocumentMatrix(docs, control = list(tokenize = QuadgramTokenizer, wordLengths = c(1, Inf)))
phrases_quadgram <- data.frame(n = 4, phrase = rownames(quadgram_tdm), frequency = slam::row_sums(quadgram_tdm), stringsAsFactors = FALSE)
phrases_quadgram$relativeFrequency <- (phrases_quadgram$frequency / sum(phrases_quadgram$frequency)) * 100

## Next we investigate the frequencies
phrases <- rbind(phrases_unigram, phrases_bigram, phrases_trigram, phrases_quadgram)
phrases <- phrases[order(phrases$n, phrases$relativeFrequency, decreasing = TRUE),]
percentagesDataframe <- data.frame(n = integer(), frequency = integer(), percentageDrop = numeric())

## We create four plots for the top 30 most frequent ngrams per n.
for (i in 1:4) {
    nphrases <- phrases[phrases$n == i,]
    ## We also determine the percentage of instance covered when dropping frequencies
    percentagesWhenDroppingFrequencies <- round(((sum(nphrases$frequency)-cumsum((table(nphrases$frequency))*as.integer(unlist(dimnames(table(nphrases$frequency))))))/sum(nphrases$frequency))*100,2)
    df <- data.frame(n = i, frequency = as.integer(names(percentagesWhenDroppingFrequencies)), percentageDrop = percentagesWhenDroppingFrequencies)
    percentagesDataframe <- rbind(percentagesDataframe, df)

    nphrases <- nphrases[1:30,]
    nphrases$phrase <- factor(nphrases$phrase, nphrases$phrase)
    
    print(ggplot(nphrases, aes(phrase, relativeFrequency)) + geom_bar(stat = "identity", colour="lightblue", fill = "darkblue") + theme(axis.text.x = element_text(angle = 90, hjust = 1), plot.title = element_text(size = 10, face="bold")) + xlab("ngram") + ylab("Relative frequency as a %") + ggtitle(paste("Relative frequency distribution of top 30 most frequent n =", i, "ngrams (out of", nrow(phrases[phrases$n == i,]), "ngrams)")))
}

## Next we plot the percentual drop in instances that could be predicted as we drop low frequency numbers per n
ggplot(data = percentagesDataframe, aes(x = frequency, y = percentageDrop, col=I("darkblue"))) + geom_point() + facet_grid(. ~ n) + ggtitle("Percentual drop of instances covered when dropping frequencies per n value") + ylab("Percentage of instances covered") + xlab("Frequency (frequency == 1 already dropped)")

## How many bigrams have as the second word the most frequent unigram (which is "the")?
## We could use this information to simply default "the"
mostFrequentUnigram <- phrases_unigram[order(phrases_unigram$frequency, decreasing = TRUE),][1, "phrase"]

for (i in 2:4) {
    nphrases <- phrases[phrases$n == i,]
    numberOfngramsEnding <- nrow(nphrases[grepl(paste(mostFrequentUnigram, "$", sep = ""), nphrases$phrase),])
    print(paste("Percentage of n = ", i, " ngrams ending on the most frequent unigram ", mostFrequentUnigram, " = ", round((numberOfngramsEnding / nrow(nphrases))*100, 2), "%", sep = ""))
}

## [1] "Percentage of n = 2 ngrams ending on the most frequent unigram the = 1.39%"
## [1] "Percentage of n = 3 ngrams ending on the most frequent unigram the = 3.85%"
## [1] "Percentage of n = 4 ngrams ending on the most frequent unigram the = 4.69%"

Modelling ideas

The goal that we are working towards is development of a prediction model that suggests 3 words that the user may want to type next based on what the user has already typed in. The frequencies with which ngrams are observed in the training data can be used as the basis for this prediction model. Furthermore, the solution devised should balance memory usage by the algorithm as well as the speed with which the suggestions are offered. It should be possible to use the prediction model on a mobile phone for instance. The prediction algorithm should upon a user entering a word respond in a subsecond way with 3 suggestions. Intensive computational work is therefore in a production situation prohibitive. This leads to the following modelling ideas:

We use the relative frequency of occurrence for each ngram within the same level of n.
A higher relative frequency of occurrence means a more likely suggestion.
Since we only need top 3 suggestions, we drop suggestions with relative frequencies that are ranked lower. This saves memory and improves performance.
We store the ngrams in either one data structure or in one data structure per n level (we will need to test this to see what works best). For now we use a simple data frame.
The type of data structure is something that could be experimented with.
In terms of matching strategy we have to test what will work best, but we start with first trying to match the most number of consecutive words (first we try to match a n = 4 ngram, then a n = 3 ngram and then a n = 1 ngram)

## First we define a function that returns the top 3 suggestions (if there are any) for a given ngram
trimmedPhrases2 <- trimmedPhrases[trimmedPhrases$n == 2,]
trimmedPhrases3 <- trimmedPhrases[trimmedPhrases$n == 3,]
trimmedPhrases4 <- trimmedPhrases[trimmedPhrases$n == 4,]
trimmedPhrasesList <- list(trimmedPhrases2, trimmedPhrases3, trimmedPhrases4)

suggestNextWord <- function(aPhrase) {
    n <- length(unlist(strsplit(aPhrase, " ")))
    found <- FALSE
    
    rightLengthPhrases <- trimmedPhrasesList[[n]]
    matchString <- paste("^", aPhrase, " ", sep = "")
    return(rightLengthPhrases[grep(matchString, rightLengthPhrases$phrase), c("phrase", "relativeFrequency")])
}


## We calculate the memory size that is occupied by the data structure that holds the ngrams.
print(object.size(trimmedPhrasesList))

## 2348816 bytes

## Next we give it a test
kable(suggestNextWord("a"), row.names = FALSE, align = "c", digits = 4)

phrase	relativeFrequency
a little	0.0934
a few	0.0584
a lot	0.0467

kable(suggestNextWord("a lot"), row.names = FALSE, align = "c", digits = 4)

phrase	relativeFrequency
a lot of	0.0351
a lot for	0.0117

kable(suggestNextWord("going to be"), row.names = FALSE, align = "c", digits = 4)

phrase	relativeFrequency
going to be a	0.0234
going to be shoved	0.0117
going to be the	0.0117