The goal that we are working towards is development of a prediction model that suggests 3 words that the user may want to type next based on what the user has already typed in. The solution devised should balance memory usage by the model as well as the speed with which the suggestions are offered. It should be possible to use the prediction model on a mobile phone for instance. This document details the exploratory analysis of the data that will be used for training such a model as well as the first ideas on the modeling itself.
We read in the American blogs, news and twitter data sets and create smaller training and testing data sets from them. These smaller data sets are much easier to handle and still form a representative sample of the underlying data sets from which inferences can be made regarding the underlying data. From each data set we select at random 1/100 of the lines for training purposes and 1/1000 of the lines for testing purposes. We’ve also made sure that these lines do not overlap. We ignore for now the provided data sets from other languages.
| Source.file | Number.of.lines | Number.of.words | Average.number.of.words.per.line |
|---|---|---|---|
| en_US.blogs.txt | 899288 | 37334131 | 41.5152109224186 |
| en_US.news.txt | 1010242 | 34372530 | 34.0240556223162 |
| en_US.twitter.txt | 2360148 | 30373543 | 12.8693382787859 |
We read in the training sets and start cleaning them up for further processing. Cleaning them up consists of:
We leave in the stop words (like ‘the’, ‘a’ and ‘to’) as we would also like to be able to predict those.
docs <- VCorpus(DirSource("./final/en_US/training"))
## First we get rid of any words containing non-ASCII letters (which also drops foreign language
## words that contain non-ASCII letters)
toASCII <- content_transformer(function(x) {
characters.unlist <- unlist(strsplit(x, split=" "))
characters.non.ASCII <- iconv(characters.unlist, "UTF-8", "ASCII", sub="<nonASCII>")
wordsWithNonASCII <- grep("<nonASCII>", characters.non.ASCII)
return(paste(characters.non.ASCII[-wordsWithNonASCII], collapse = " "))
})
docs <- tm_map(docs, toASCII)
## Next we remove all url's in the text
removeURL <- content_transformer(function(x) {
gsub("(http[^ ]*)|(ftp[^ ]*)|(www\\.[^ ]*)", "", x)
})
docs <- tm_map(docs, removeURL)
## Next we eliminate punctuation, colons, hyphens, quotes etc.
docs <- tm_map(docs, removePunctuation, preserve_intra_word_dashes = FALSE)
toSpace <- content_transformer(function(x, pattern) {return (gsub(pattern, " ", x))})
toNon <- content_transformer(function(x, pattern) {return (gsub(pattern, "", x))})
## We transform the text to lower case and remove numbers
docs <- tm_map(docs, content_transformer(tolower))
docs <- tm_map(docs, removeNumbers)
Now that we’ve cleaned up the document we will remove any words that may be offensive so we won’t suggest any of these to the end user. Also here, for now we focus on the American data set.
## A list of english bad words was to be found on the Carnegie Mellon site.
badWordsList <- readLines("https://www.cs.cmu.edu/~biglou/resources/bad-words.txt")
docs <- tm_map(docs, removeWords, badWordsList)
docs <- tm_map(docs, stripWhitespace)
Now the training data set is ready for exploratory analysis. To do this we will transform the data sets into so-called ngrams. These represent either single words in the data sets (i.e. n = 1 ngrams) or a combination of 2, 3 or 4 consecutive words in the data sets (i.e. n = 2, n = 3 or n = 4 ngrams respectively). We can then look at how often these ngrams occur in the texts. We use these frequencies to analyze the structure of the texts. The exploratory analysis shows that:
## We create a unigram, bigram, trigram and quadgram document term matrices and start investigating the word frequencies
unigram_tdm <- TermDocumentMatrix(docs, control = list(wordLengths = c(1, Inf)))
phrases_unigram <- data.frame(n = 1, phrase = rownames(unigram_tdm), frequency = slam::row_sums(unigram_tdm), stringsAsFactors = FALSE)
phrases_unigram$relativeFrequency <- (phrases_unigram$frequency / sum(phrases_unigram$frequency)) * 100
BigramTokenizer <- function(x) {
unlist(lapply(ngrams(words(x), 2), paste, collapse = " "), use.names = FALSE)
}
bigram_tdm <- TermDocumentMatrix(docs, control = list(tokenize = BigramTokenizer, wordLengths = c(1, Inf)))
phrases_bigram <- data.frame(n = 2, phrase = rownames(bigram_tdm), frequency = slam::row_sums(bigram_tdm), stringsAsFactors = FALSE)
phrases_bigram$relativeFrequency <- (phrases_bigram$frequency / sum(phrases_bigram$frequency)) * 100
TrigramTokenizer <- function(x) {
unlist(lapply(ngrams(words(x), 3), paste, collapse = " "), use.names = FALSE)
}
trigram_tdm <- TermDocumentMatrix(docs, control = list(tokenize = TrigramTokenizer, wordLengths = c(1, Inf)))
phrases_trigram <- data.frame(n = 3, phrase = rownames(trigram_tdm), frequency = slam::row_sums(trigram_tdm), stringsAsFactors = FALSE)
phrases_trigram$relativeFrequency <- (phrases_trigram$frequency / sum(phrases_trigram$frequency)) * 100
QuadgramTokenizer <- function(x) {
unlist(lapply(ngrams(words(x), 4), paste, collapse = " "), use.names = FALSE)
}
quadgram_tdm <- TermDocumentMatrix(docs, control = list(tokenize = QuadgramTokenizer, wordLengths = c(1, Inf)))
phrases_quadgram <- data.frame(n = 4, phrase = rownames(quadgram_tdm), frequency = slam::row_sums(quadgram_tdm), stringsAsFactors = FALSE)
phrases_quadgram$relativeFrequency <- (phrases_quadgram$frequency / sum(phrases_quadgram$frequency)) * 100
## Next we investigate the frequencies
phrases <- rbind(phrases_unigram, phrases_bigram, phrases_trigram, phrases_quadgram)
phrases <- phrases[order(phrases$n, phrases$relativeFrequency, decreasing = TRUE),]
percentagesDataframe <- data.frame(n = integer(), frequency = integer(), percentageDrop = numeric())
## We create four plots for the top 30 most frequent ngrams per n.
for (i in 1:4) {
nphrases <- phrases[phrases$n == i,]
## We also determine the percentage of instance covered when dropping frequencies
percentagesWhenDroppingFrequencies <- round(((sum(nphrases$frequency)-cumsum((table(nphrases$frequency))*as.integer(unlist(dimnames(table(nphrases$frequency))))))/sum(nphrases$frequency))*100,2)
df <- data.frame(n = i, frequency = as.integer(names(percentagesWhenDroppingFrequencies)), percentageDrop = percentagesWhenDroppingFrequencies)
percentagesDataframe <- rbind(percentagesDataframe, df)
nphrases <- nphrases[1:30,]
nphrases$phrase <- factor(nphrases$phrase, nphrases$phrase)
print(ggplot(nphrases, aes(phrase, relativeFrequency)) + geom_bar(stat = "identity", colour="lightblue", fill = "darkblue") + theme(axis.text.x = element_text(angle = 90, hjust = 1), plot.title = element_text(size = 10, face="bold")) + xlab("ngram") + ylab("Relative frequency as a %") + ggtitle(paste("Relative frequency distribution of top 30 most frequent n =", i, "ngrams (out of", nrow(phrases[phrases$n == i,]), "ngrams)")))
}
## Next we plot the percentual drop in instances that could be predicted as we drop low frequency numbers per n
ggplot(data = percentagesDataframe, aes(x = frequency, y = percentageDrop, col=I("darkblue"))) + geom_point() + facet_grid(. ~ n) + ggtitle("Percentual drop of instances covered when dropping frequencies per n value") + ylab("Percentage of instances covered") + xlab("Frequency (frequency == 1 already dropped)")
## How many bigrams have as the second word the most frequent unigram (which is "the")?
## We could use this information to simply default "the"
mostFrequentUnigram <- phrases_unigram[order(phrases_unigram$frequency, decreasing = TRUE),][1, "phrase"]
for (i in 2:4) {
nphrases <- phrases[phrases$n == i,]
numberOfngramsEnding <- nrow(nphrases[grepl(paste(mostFrequentUnigram, "$", sep = ""), nphrases$phrase),])
print(paste("Percentage of n = ", i, " ngrams ending on the most frequent unigram ", mostFrequentUnigram, " = ", round((numberOfngramsEnding / nrow(nphrases))*100, 2), "%", sep = ""))
}
## [1] "Percentage of n = 2 ngrams ending on the most frequent unigram the = 1.39%"
## [1] "Percentage of n = 3 ngrams ending on the most frequent unigram the = 3.85%"
## [1] "Percentage of n = 4 ngrams ending on the most frequent unigram the = 4.69%"
The goal that we are working towards is development of a prediction model that suggests 3 words that the user may want to type next based on what the user has already typed in. The frequencies with which ngrams are observed in the training data can be used as the basis for this prediction model. Furthermore, the solution devised should balance memory usage by the algorithm as well as the speed with which the suggestions are offered. It should be possible to use the prediction model on a mobile phone for instance. The prediction algorithm should upon a user entering a word respond in a subsecond way with 3 suggestions. Intensive computational work is therefore in a production situation prohibitive. This leads to the following modelling ideas:
## First we define a function that returns the top 3 suggestions (if there are any) for a given ngram
trimmedPhrases2 <- trimmedPhrases[trimmedPhrases$n == 2,]
trimmedPhrases3 <- trimmedPhrases[trimmedPhrases$n == 3,]
trimmedPhrases4 <- trimmedPhrases[trimmedPhrases$n == 4,]
trimmedPhrasesList <- list(trimmedPhrases2, trimmedPhrases3, trimmedPhrases4)
suggestNextWord <- function(aPhrase) {
n <- length(unlist(strsplit(aPhrase, " ")))
found <- FALSE
rightLengthPhrases <- trimmedPhrasesList[[n]]
matchString <- paste("^", aPhrase, " ", sep = "")
return(rightLengthPhrases[grep(matchString, rightLengthPhrases$phrase), c("phrase", "relativeFrequency")])
}
## We calculate the memory size that is occupied by the data structure that holds the ngrams.
print(object.size(trimmedPhrasesList))
## 2348816 bytes
## Next we give it a test
kable(suggestNextWord("a"), row.names = FALSE, align = "c", digits = 4)
| phrase | relativeFrequency |
|---|---|
| a little | 0.0934 |
| a few | 0.0584 |
| a lot | 0.0467 |
kable(suggestNextWord("a lot"), row.names = FALSE, align = "c", digits = 4)
| phrase | relativeFrequency |
|---|---|
| a lot of | 0.0351 |
| a lot for | 0.0117 |
kable(suggestNextWord("going to be"), row.names = FALSE, align = "c", digits = 4)
| phrase | relativeFrequency |
|---|---|
| going to be a | 0.0234 |
| going to be shoved | 0.0117 |
| going to be the | 0.0117 |