Management summary

The goal that we are working towards is development of a prediction model that suggests 3 words that the user may want to type next based on what the user has already typed in. The solution devised should balance memory usage by the model as well as the speed with which the suggestions are offered. It should be possible to use the prediction model on a mobile phone for instance. This document details the exploratory analysis of the data that will be used for training such a model as well as the first ideas on the modeling itself.

Creating a training and test set

We read in the American blogs, news and twitter data sets and create smaller training and testing data sets from them. These smaller data sets are much easier to handle and still form a representative sample of the underlying data sets from which inferences can be made regarding the underlying data. From each data set we select at random 1/100 of the lines for training purposes and 1/1000 of the lines for testing purposes. We’ve also made sure that these lines do not overlap. We ignore for now the provided data sets from other languages.

Some statistics of the source files
Source.file Number.of.lines Number.of.words Average.number.of.words.per.line
en_US.blogs.txt 899288 37334131 41.5152109224186
en_US.news.txt 1010242 34372530 34.0240556223162
en_US.twitter.txt 2360148 30373543 12.8693382787859

Reading in the smaller dataset and cleaning it up

We read in the training sets and start cleaning them up for further processing. Cleaning them up consists of:

We leave in the stop words (like ‘the’, ‘a’ and ‘to’) as we would also like to be able to predict those.

docs <- VCorpus(DirSource("./final/en_US/training"))

## First we get rid of any words containing non-ASCII letters (which also drops foreign language 
## words that contain non-ASCII letters)
toASCII <- content_transformer(function(x) {
    characters.unlist <- unlist(strsplit(x, split=" "))
    characters.non.ASCII <- iconv(characters.unlist, "UTF-8", "ASCII", sub="<nonASCII>")
    wordsWithNonASCII <- grep("<nonASCII>", characters.non.ASCII)
    return(paste(characters.non.ASCII[-wordsWithNonASCII], collapse = " "))
})
docs <- tm_map(docs, toASCII)

## Next we remove all url's in the text
removeURL <- content_transformer(function(x) {
    gsub("(http[^ ]*)|(ftp[^ ]*)|(www\\.[^ ]*)", "", x)
})
docs <- tm_map(docs, removeURL)

## Next we eliminate punctuation, colons, hyphens, quotes etc.
docs <- tm_map(docs, removePunctuation, preserve_intra_word_dashes = FALSE)
toSpace <- content_transformer(function(x, pattern) {return (gsub(pattern, " ", x))})
toNon <- content_transformer(function(x, pattern) {return (gsub(pattern, "", x))})

## We transform the text to lower case and remove numbers
docs <- tm_map(docs, content_transformer(tolower))
docs <- tm_map(docs, removeNumbers)

Profanity filtering

Now that we’ve cleaned up the document we will remove any words that may be offensive so we won’t suggest any of these to the end user. Also here, for now we focus on the American data set.

## A list of english bad words was to be found on the Carnegie Mellon site.
badWordsList <- readLines("https://www.cs.cmu.edu/~biglou/resources/bad-words.txt")
docs <- tm_map(docs, removeWords, badWordsList)
docs <- tm_map(docs, stripWhitespace)

Exploratory analysis

Now the training data set is ready for exploratory analysis. To do this we will transform the data sets into so-called ngrams. These represent either single words in the data sets (i.e. n = 1 ngrams) or a combination of 2, 3 or 4 consecutive words in the data sets (i.e. n = 2, n = 3 or n = 4 ngrams respectively). We can then look at how often these ngrams occur in the texts. We use these frequencies to analyze the structure of the texts. The exploratory analysis shows that:

## We create a unigram, bigram, trigram and quadgram document term matrices and start investigating the word frequencies
unigram_tdm <- TermDocumentMatrix(docs, control = list(wordLengths = c(1, Inf)))
phrases_unigram <- data.frame(n = 1, phrase = rownames(unigram_tdm), frequency = slam::row_sums(unigram_tdm), stringsAsFactors = FALSE)
phrases_unigram$relativeFrequency <- (phrases_unigram$frequency / sum(phrases_unigram$frequency)) * 100

BigramTokenizer <- function(x) {
    unlist(lapply(ngrams(words(x), 2), paste, collapse = " "), use.names = FALSE)
}
bigram_tdm <- TermDocumentMatrix(docs, control = list(tokenize = BigramTokenizer, wordLengths = c(1, Inf)))
phrases_bigram <- data.frame(n = 2, phrase = rownames(bigram_tdm), frequency = slam::row_sums(bigram_tdm), stringsAsFactors = FALSE)
phrases_bigram$relativeFrequency <- (phrases_bigram$frequency / sum(phrases_bigram$frequency)) * 100


TrigramTokenizer <- function(x) {
    unlist(lapply(ngrams(words(x), 3), paste, collapse = " "), use.names = FALSE)
}
trigram_tdm <- TermDocumentMatrix(docs, control = list(tokenize = TrigramTokenizer, wordLengths = c(1, Inf)))
phrases_trigram <- data.frame(n = 3, phrase = rownames(trigram_tdm), frequency = slam::row_sums(trigram_tdm), stringsAsFactors = FALSE)
phrases_trigram$relativeFrequency <- (phrases_trigram$frequency / sum(phrases_trigram$frequency)) * 100


QuadgramTokenizer <- function(x) {
    unlist(lapply(ngrams(words(x), 4), paste, collapse = " "), use.names = FALSE)
}
quadgram_tdm <- TermDocumentMatrix(docs, control = list(tokenize = QuadgramTokenizer, wordLengths = c(1, Inf)))
phrases_quadgram <- data.frame(n = 4, phrase = rownames(quadgram_tdm), frequency = slam::row_sums(quadgram_tdm), stringsAsFactors = FALSE)
phrases_quadgram$relativeFrequency <- (phrases_quadgram$frequency / sum(phrases_quadgram$frequency)) * 100

## Next we investigate the frequencies
phrases <- rbind(phrases_unigram, phrases_bigram, phrases_trigram, phrases_quadgram)
phrases <- phrases[order(phrases$n, phrases$relativeFrequency, decreasing = TRUE),]
percentagesDataframe <- data.frame(n = integer(), frequency = integer(), percentageDrop = numeric())

## We create four plots for the top 30 most frequent ngrams per n.
for (i in 1:4) {
    nphrases <- phrases[phrases$n == i,]
    ## We also determine the percentage of instance covered when dropping frequencies
    percentagesWhenDroppingFrequencies <- round(((sum(nphrases$frequency)-cumsum((table(nphrases$frequency))*as.integer(unlist(dimnames(table(nphrases$frequency))))))/sum(nphrases$frequency))*100,2)
    df <- data.frame(n = i, frequency = as.integer(names(percentagesWhenDroppingFrequencies)), percentageDrop = percentagesWhenDroppingFrequencies)
    percentagesDataframe <- rbind(percentagesDataframe, df)

    nphrases <- nphrases[1:30,]
    nphrases$phrase <- factor(nphrases$phrase, nphrases$phrase)
    
    print(ggplot(nphrases, aes(phrase, relativeFrequency)) + geom_bar(stat = "identity", colour="lightblue", fill = "darkblue") + theme(axis.text.x = element_text(angle = 90, hjust = 1), plot.title = element_text(size = 10, face="bold")) + xlab("ngram") + ylab("Relative frequency as a %") + ggtitle(paste("Relative frequency distribution of top 30 most frequent n =", i, "ngrams (out of", nrow(phrases[phrases$n == i,]), "ngrams)")))
}

## Next we plot the percentual drop in instances that could be predicted as we drop low frequency numbers per n
ggplot(data = percentagesDataframe, aes(x = frequency, y = percentageDrop, col=I("darkblue"))) + geom_point() + facet_grid(. ~ n) + ggtitle("Percentual drop of instances covered when dropping frequencies per n value") + ylab("Percentage of instances covered") + xlab("Frequency (frequency == 1 already dropped)")

## How many bigrams have as the second word the most frequent unigram (which is "the")?
## We could use this information to simply default "the"
mostFrequentUnigram <- phrases_unigram[order(phrases_unigram$frequency, decreasing = TRUE),][1, "phrase"]

for (i in 2:4) {
    nphrases <- phrases[phrases$n == i,]
    numberOfngramsEnding <- nrow(nphrases[grepl(paste(mostFrequentUnigram, "$", sep = ""), nphrases$phrase),])
    print(paste("Percentage of n = ", i, " ngrams ending on the most frequent unigram ", mostFrequentUnigram, " = ", round((numberOfngramsEnding / nrow(nphrases))*100, 2), "%", sep = ""))
}
## [1] "Percentage of n = 2 ngrams ending on the most frequent unigram the = 1.39%"
## [1] "Percentage of n = 3 ngrams ending on the most frequent unigram the = 3.85%"
## [1] "Percentage of n = 4 ngrams ending on the most frequent unigram the = 4.69%"

Modelling ideas

The goal that we are working towards is development of a prediction model that suggests 3 words that the user may want to type next based on what the user has already typed in. The frequencies with which ngrams are observed in the training data can be used as the basis for this prediction model. Furthermore, the solution devised should balance memory usage by the algorithm as well as the speed with which the suggestions are offered. It should be possible to use the prediction model on a mobile phone for instance. The prediction algorithm should upon a user entering a word respond in a subsecond way with 3 suggestions. Intensive computational work is therefore in a production situation prohibitive. This leads to the following modelling ideas:

## First we define a function that returns the top 3 suggestions (if there are any) for a given ngram
trimmedPhrases2 <- trimmedPhrases[trimmedPhrases$n == 2,]
trimmedPhrases3 <- trimmedPhrases[trimmedPhrases$n == 3,]
trimmedPhrases4 <- trimmedPhrases[trimmedPhrases$n == 4,]
trimmedPhrasesList <- list(trimmedPhrases2, trimmedPhrases3, trimmedPhrases4)

suggestNextWord <- function(aPhrase) {
    n <- length(unlist(strsplit(aPhrase, " ")))
    found <- FALSE
    
    rightLengthPhrases <- trimmedPhrasesList[[n]]
    matchString <- paste("^", aPhrase, " ", sep = "")
    return(rightLengthPhrases[grep(matchString, rightLengthPhrases$phrase), c("phrase", "relativeFrequency")])
}


## We calculate the memory size that is occupied by the data structure that holds the ngrams.
print(object.size(trimmedPhrasesList))
## 2348816 bytes
## Next we give it a test
kable(suggestNextWord("a"), row.names = FALSE, align = "c", digits = 4)
phrase relativeFrequency
a little 0.0934
a few 0.0584
a lot 0.0467
kable(suggestNextWord("a lot"), row.names = FALSE, align = "c", digits = 4)
phrase relativeFrequency
a lot of 0.0351
a lot for 0.0117
kable(suggestNextWord("going to be"), row.names = FALSE, align = "c", digits = 4)
phrase relativeFrequency
going to be a 0.0234
going to be shoved 0.0117
going to be the 0.0117