Coursera - Predicting Possible Next Words

knitr::opts_chunk$set(cache=TRUE)

library(plyr)      # function mapvalues()
library(tidyverse) # data frame manipulation
library(stringr)   # string manipulation
library(stringi)   # string manipulation
library(tau)       # generate N-grams
library(knitr)     # function kable()
library(cowplot)   # plotting several gg

Understanding the Problem

Mobile devices are no longer a necessity, they’ve become a primal need. But lacking a full-size keyboard, text entry on touch screen devices in particular can be rather annoying. Natural Language Processing (NLP) which is a field that covers computer understanding and manipulation of human language offers possibilities to build models that tackle the problem described above.

But how can we build models which predict the word that people may have in mind when writing? The basic idea is to learn from something called a corpus, which essentially is a collection of words or groups of words, within the language. Based on the information provided by corpus one can then build models that assign a probability to each possible next word. One of those language models (LM) is the so called N-gram model which is build around the idea that the probability of some future word depends only on a limited history of preceding words (Markov assumption).

While this may sound complicated at first sight, the actual process is quiet straight forward. Once a “nice and clean” corpus is provided, one uses N-gram models to calculate the frequencies of words and group of words within the corpus. The resulting N-grams along with the calculated probabilities are then stored and form the basis for the word prediction algorithm. In essence, the actual prediction mechanism eventually just goes through the N-gram tables and based on the given input identifies adequate N-grams and suggests the ones with the highest probability.

Note, that this paper is not a summary but rather an in depth documentation of all steps and consideration which were required to train the final algorithm and apply it on a web application. Besides, whenever repetitive steps were required example code is shown only for the first element. Regarding model paramaters my aim is to build an algorithm thats runs in less then 0.5 seconds and requires a maximum of 250 Mbyte of backround data.

Getting and Sampling the Text Corpus

Coursera provided a fairly large dataset which consists of several hundred thousand news articles, blog entries and Twitter messages. So first things first, lets read in the data. I use the command readLines() because it is rather fast (reading the Twitter messages for example takes 8 seconds) and it matches the structure of the txt files. I set encoding to UTF-8 because otherwise some special characters are not displayed properly.

# read data
twitter   <- readLines("en_US.twitter.txt", encoding = "UTF-8")
news      <- readLines("en_US.news.txt", encoding = "UTF-8")
blog      <- readLines("en_US.blogs.txt", encoding = "UTF-8")
profanity <- readLines("swearWords.txt", encoding = "UTF-8")

# example to get a grasp of the data
twitter[1]

## [1] "How are you? Btw thanks for the RT. You gonna be in DC anytime soon? Love to see you. Been way, way too long."

# mean characters across data sources
print(c(mean(nchar(twitter)), mean(nchar(news)), mean(nchar(blog))))

## [1]  68.68045 202.42830 229.98695

# number of blog, twitter, and news text blocks
print(c(length(twitter), length(news), length(blog)))

## [1] 2360148   77259  899288

Having data from those 3 different sources is quiet nice because used words and word patterns may be rather different across them. However, because the data is fairly large as seen above and also because I need to consider the number of N-grams later on, sampling seems to be a valid approach. After sampling I merge the vectors and further break everything down into sentences. Note, that I use more twitter entries because they contain on average much less characters and thus sentences compared to the other text sources.

# sample data
set.seed(2017)
twitter.sample <- sample(twitter, 200000, replace = FALSE)
news.sample    <- sample(news, 50000, replace = FALSE)
blog.sample    <- sample(blog, 100000, replace = FALSE)
corpus         <- c(twitter.sample, news.sample, blog.sample)

# to deal with smiley and stuff like that
corpus <- iconv(corpus, "UTF-8", "ASCII", "?")

# split into single sentences (as good as possible)
corpus <- unlist(strsplit(corpus , "[\\.\\!\\?\\:]+"))

# number of sentences
length(corpus)

## [1] 904558

Running the code above leaves me with a manageable amount of sentences for comparing some different approaches. I will increase the size of the corpus later on once the code is written. Getting around one million eventually would be good.

Normalizing the Text Corpus

Let’s remind ourself that ideally, for N-Gram tokenization later on, one needs a nice and clean corpus which consists of nothing else but “valid” words. Hence, the first and arguably one of the most important steps before building any language model is normalizing the text corpus. I wrote the function norm.text() for that purpose which preforms a vast number of actions:

convert everything to lowercase
remove emails
remove urls
remove punctuation and symbols
remove numbers
remove useless white space
remove twitter language
remove profanity
dealing with apostrophes as special punctuation
removing some odd patterns which I encountered

Note that even after running the code below, the corpus is far from perfect. For example, there probably are a lot of typos as well as what one may call “chat words” which strictly speaking are not real words but are nevertheless widely used on Internet (e.g. lol). Also there still are stop words as well as words with similar word stem. While what one may call base normalization is I would say rather mandatory and done by norm.text(), checking the other stuff is arguably best done over accuracy measures later on.

norm.text <- function(string, swearwords){
                # lower case
                string <- tolower(string)
                # e-mail
                string <- str_replace_all(string, "\\S+@\\S+", "") 
                # URLs
                string <- str_replace_all(string, "http[[:alnum:]]*", "")
                # hashtags
                string <- str_replace_all(string, "#[[:alnum:]]*", "")
                string <- str_replace_all(string, "# [[:alnum:]]*", "")
                # @
                string <- str_replace_all(string, "@[[:alnum:]]*", "")
                string <- str_replace_all(string, "@ [[:alnum:]]*", "")
                # twitter language
                string <- str_replace_all(string, "RT", "")
                string <- str_replace_all(string, "PM", "")
                string <- str_replace_all(string, "rt", "")
                string <- str_replace_all(string, "pm", "")
                # apostrophes 
                string <- str_replace_all(string, "'ll", " will")
                string <- str_replace_all(string, "'d", " would")
                string <- str_replace_all(string, "can't", "cannot")
                string <- str_replace_all(string, "n't", " not")
                string <- str_replace_all(string, "'re", " are")
                string <- str_replace_all(string, "'m", " am")
                string <- str_replace_all(string, "n'", " and")
                #'s-genitive
                string <- str_replace_all(string, "'s", " ")
                string <- str_replace_all(string, "s'", " ")
                # everything that is not number,letter, whitespace or '
                string <- str_replace_all(string, "[^[:alnum:]]", " ")
                # digits
                string <- str_replace_all(string, "[:digit:]", "")
                # more then one whitespace
                string <- str_replace_all(string, "\\s+", " ")
                # trim whitespace
                string <- str_trim(string, side = c("both"))
                # deal with "don t und dont"
                string <- str_replace_all(string, "don t", "do not")
                string <- str_replace_all(string, "dont", "do not")
                # deal with "u s for usa"
                string <- str_replace_all(string, "u s", "usa")
                # profanity (for loop takes a while)
                for (i in swearwords){ 
                string <- str_replace_all(string, i, "")
                }
return(string)
}
# normalize text
corpus.clean <- norm.text(corpus, profanity)

Splitting the Data

Next, I split the data into a training and a test set, so that I can check the accuracy of my algorithm later on.

set.seed(2007)
idx <- sample.int(length(corpus.clean),size=length(corpus.clean)*0.1, replace=FALSE) 
testing  <- corpus.clean[idx]
training  <- corpus.clean[-idx]

Tokanization

In lexical analysis, tokenization is the process of breaking a stream of text up into words, phrases, symbols, or other meaningful elements called tokens. The list of tokens becomes input for further processing. Eventually, I want to break the text down into words, to emit N-grams (N-grams are a contiguous sequence of n items from a given sequence of text or speech). R provides several packages that can be used for creating N-grams, including “tau”, “ngram”, “RWeka” and “textcat”. I chose Tau because it is rather fast.

# create N-grams
unigram_tau  <- textcnt(training, n = 1L, method = "string", split = " ")
bigram_tau   <- textcnt(training, n = 2L, method = "string", split = " ")
trigram_tau  <- textcnt(training, n = 3L, method = "string", split = " ")
fourgram_tau <- textcnt(training, n = 4L, method = "string", split = " ")
fivegram_tau <- textcnt(training, n = 5L, method = "string", split = " ")

# transform trie encodings to dataframe (illustrated only for unigram_tau)
unigram.df <- data.frame(counts = unclass(unigram_tau), size = nchar(names(unigram_tau)))
unigram.df$n.gram <- rownames(unigram.df)
rownames(unigram.df) <- NULL

Calculating the N-gram frequencies took quiet a while even when just using the rather small sample. However, since they must be created only once speeding up the process is less important. What is important though, is storing them efficiently. Why is that? Well, first of all storing them in R requires RAM which is a limited resource. Besides, one needs to consider that the amount of time the algorithm takes to make a prediction may be affected by the manner in which N-grams are stored. Thus, let’s first take a look how large the generated N-grams are when stored in a data frame.

# used storage of the N-grams
print(object.size(unigram.df) + object.size(bigram.df) + object.size(trigram.df) + object.size(fourgram.df) + object.size(fivegram.df))

## 1850985728 bytes

Unfortunately, it seems that the classical data frame is not really suitable since the N-grams of my sample already requires a lot of RAM. Thus, I may have to switch to some other storage type later on. Using SQLite might be an option. Yet, let’s turn to some data exploration first to see if we can get some additional valuable insights. Maybe I can handle the size issue without using an external database.

Exploring the Data

Next, I gonna build some figures to understand variation in the frequencies (I used log because data is highly skewed) of words and word pairs in the data. Note, that I restricted the plot samples containing 70000 of the respective N-grams. Code below is illustrative for the two plot types:

set.seed(2017)
p1 <- ggplot(sample_n(unigram.df, 70000, replace = FALSE), 
             aes(x = log(counts))) + 
                geom_histogram(color = "steelblue", 
                               fill = "white", 
                               alpha=.7, 
                               binwidth = 0.5) +
                theme_gray() + 
                xlab("") +
                ylab("")
   
set.seed(2017)
p4 <- top_n(sample_n(unigram.df, 70000, replace = FALSE), n = 10, counts) %>%
                ggplot(., aes(x = n.gram, y = counts)) +
                geom_bar(stat='identity', 
                         color = "steelblue", 
                         fill = "white") +
                theme_gray() +
                coord_flip() +
                xlab("") +
                ylab("")

Well, obviously there is a tremendous amount of N-grams that show only up a few times (often even just once). Many of them are useless things like “aaarrrgggghhh”. This finding may be rather interesting since it suggests that deleting those cases to lower the number of the N-grams might be a valid approach to save some memory and run time. The variable size which counts the characters of each N gram, also supports that approach because there seem to be fare too many one and two letter words as well as words with and oddly large number of characters. Regarding the most used words and word pairs, the plots show that there frequency drops off quickly and that unsurprisingly stop words are most prevalent.

Next, let’s investigate how many unique n-grams one needs in a frequency sorted dictionary to cover a certain amount of all n-gram instances in the corpus.

# illustrative variable creation for unigrams
unigram.df.p <- sample_n(unigram.df, 70000, replace = FALSE) %>%
                arrange(desc(counts)) %>%
                mutate(counts.cs.p = cumsum(counts)/sum(counts)) %>%
                mutate(id = c(1:70000))

x <- unigram.df.p %>%
filter(counts.cs.p > 0.7499)  %>%
head(n = 3L)
kable(x)

counts	size	n.gram	counts.cs.p	id
673	7	winning	0.7500313	633
673	6	period	0.7501961	634
673	6	except	0.7503609	635

Above you see that in order to cover 75 percent of the 1-gram instances in the sample corpus not to many of the most occurring words are needed. Let’s plot this stuff for all N-grams to get even more valuable information. Code below is illustrative for one of the three plots.

p7 <- ggplot(unigram.df.p, aes(x=id, y=counts.cs.p)) +
                geom_line(color = "steelblue") +
                theme_gray() + 
                xlab("") +
                ylab("")

Clearly, at least for the 1-grams there is a strongly exponential pattern which becomes less extreme as the number of N increases.

Building the Algorithm: Stupid Backoff - A First Approach using Dplyr

Now that I have some workable data and got an idea about it’s descriptive characteristics the next step is to create an algorithm which assigns probabilities to the various N-grams and returns the top predictions based on an input string. The basic idea is to start with a simple baseline model which then can be used to tweak parameters. What follows is an implementation of a “stupid back off model” which starts with 5-grams. Here, back off basically means that one goes back to a N-1 gram model to calculate probabilities whenever one encounters a word with the probabilities of zero. For each time I back off one level, the probabilities are multiplied with the back off factor alpha (0.4). The aim here is not to code the final product. Rather I want to get a functioning base model which is then used to test any further specifications regarding things as efficient coding, data storage, sampling size, or corpus setting.

In order to illustrate the algorithm let’s walk through an example where we want to complete the sentence “I must know what are you going to”.

Take the last four words of the input sentence (“are you going to”).
Look for any 5-grams that start with the input sentence.
If applicable store the last word of encountered 5-grams and and calculate their probability.
Probability is calculated as follows: Number of times that 5-grams ended with that word divided by how often it’s first four words show up in the 4-gram table.
For example “are you going to be” may show up x times in the 5-gram table and “are you going to” may be listed y times in the 4-gram table. Therefore, the probability is x/y.
Store the 3 most probable words.
Step wise back off to the 2-gram level. For each level repeat the procedure with an adjusted probability (x*0.4) and store 3 most probable words
Drop any words that are suggested again during the backing off process.
Print the three most probable words along with their probability.
Lastly, if there are no matches just print the three most common words in the corpus.

Before writing the function, I preform two action to reduce the size of the N-gram tables. First, I delete now useless variables that were created during the analysis so far. Secondly, I drop any N-grams which show up only once which reduces the size of the N-gram tables by about 50 percent.

# based on the findings in the descriptive part, I drop any N-gram that shows up only once
unigram.df.short  <- filter(unigram.df, counts > 1) %>%
                     select(n.gram, counts)
bigram.df.short   <- filter(bigram.df, counts > 1) %>%
                     select(n.gram, counts)
trigram.df.short  <- filter(trigram.df, counts > 1) %>%
                     select(n.gram, counts)
fourgram.df.short <- filter(fourgram.df, counts > 1) %>%
                     select(n.gram, counts)
fivegram.df.short <- filter(fivegram.df, counts > 1)%>%
                     select(n.gram, counts)

# used storage of the N-grams
print(object.size(unigram.df.short) + object.size(bigram.df.short) + object.size(trigram.df.short) + object.size(fourgram.df.short) + object.size(fivegram.df.short))

## 112371168 bytes

Next, now that everything is prepared let’s create the algorithm. Basically, I create 4 of them where one is for input sentences of at least 4 words length. The other three are for input sentences of length 3,2, and 1, respectively. Code is shown only for the first one because it remains basically the same except that the “starting point” is are different N-grams.

# useful function to get n last word, borrowed from a fellow student
getLastWords <- function(string, words) {
    pattern <- paste("[a-z']+( [a-z']+){", words - 1, "}$", sep="")
    return(substring(string, str_locate(string, pattern)[,1]))
}
############################
######## ALGORITHM #########
############################

# pred.word4 is for cases where the input sentence is at least 4 words
pred.word4 <- function(input.sentence){
        
# take the last four words from input sentence as input for 5-gram table
isr5 <- getLastWords(input.sentence, 4)
# take the last three word from input sentence as input for 4-gram table
isr4 <- getLastWords(isr5, 3)
# take the last two words from input sentence as input for 3-gram table
isr3 <- getLastWords(isr4, 2)
# take the last word from input sentence as input for 2-gram table
isr2 <- getLastWords(isr3, 1)

# start at the 5-gram table
pred.five <- fivegram.df.short %>%
    # In the 5-gram table filter any 5-grams that start with the last four
    # words from input sentence and end with a whitespace
    filter(grepl(paste0("^", isr5," "), n.gram)) %>%
    # create variable that shows last word of the filtered 5-grams
    mutate(predicted.word = getLastWords(n.gram, 1))  %>%
    # calculate propability for the last word
    mutate(propability = counts/fourgram.df.short$counts[fourgram.df.short$n.gram == isr5]) %>%
    # arrange descending
    arrange(desc(propability)) %>%
    # only keep the predicted word and it's propability
    select(predicted.word, propability) %>%
    # only keep top 3 
    top_n(3) 
        
# do exactly the same stuff for the 4-gram table plus backoff factor
pred.four <- fourgram.df.short %>%  
    filter(grepl(paste0("^", isr4," "), n.gram)) %>%
    mutate(predicted.word = getLastWords(n.gram, 1))  %>% 
    mutate(propability = (counts*0.4)/trigram.df.short$counts[trigram.df.short$n.gram == isr4]) %>%
    arrange(desc(propability)) %>%
    select(predicted.word, propability) %>%
    top_n(3) 
        
# do exactly the same stuff for the 3-gram table plus backoff factor
pred.tri <- trigram.df.short %>%  
    filter(grepl(paste0("^", isr3," "), n.gram)) %>%
    mutate(predicted.word = getLastWords(n.gram, 1))  %>% 
    mutate(propability = (counts*0.4*0.4)/bigram.df.short$counts[bigram.df.short$n.gram == isr3]) %>%
    arrange(desc(propability)) %>%
    select(predicted.word, propability) %>%
    top_n(3) 
           
# do exactly the same stuff for the 2-gram table plus backoff factor
pred.two <- bigram.df.short %>%  
    filter(grepl(paste0("^", isr2," "), n.gram)) %>%
    mutate(predicted.word = getLastWords(n.gram, 1))  %>% 
    mutate(propability = (counts*0.4*0.4*0.4)/unigram.df.short$counts[unigram.df.short$n.gram == isr2]) %>%
    arrange(desc(propability)) %>%
    select(predicted.word, propability) %>%
    top_n(3) 
        
# do exactly the same stuff for the 1-gram table plus backoff factor
pred.one <- unigram.df.short %>%  
    arrange(desc(counts)) %>% 
    mutate(predicted.word = n.gram)  %>% 
    mutate(propability = (counts*0.4*0.4*0.4*0.4)/sum(unigram.df.short$counts)) %>%
    arrange(desc(propability)) %>%
    select(predicted.word, propability) %>%
    top_n(3) 

# combine results into a datframe and sort
prop.table <- rbind(pred.five,pred.four,pred.tri,pred.two, pred.one) %>%
    arrange(desc(propability))  

# drop words wich are found again in a lower N-gram
prop.table <- prop.table[!duplicated(prop.table$predicted.word),]

# show results
prop.table <- prop.table %>%
    top_n(3)

return(prop.table)
}

# test 
pred.word4("this thing works pretty")

##   predicted.word propability
## 1           much 0.007028553
## 2           good 0.005663006
## 3           sure 0.005341701

The final prediction function shown below just chooses the appropriate function based on the length of the input string. The reason for this approach is that I eventually want to predict a word whenever the user hits the space-bar.

# final function that chooses the appropiate pred.word() depending on input length
# and normalized the input sentences with norm.text()
word.pred <- function(input){
        input.clean <- norm.text(input, profanity)
        inlength <- length(unlist(strsplit(input.clean," ")))
        if (inlength >= 4) {
                p <- pred.word4(input)
        }  else if (inlength == 3) {
                p <- pred.word3(input)
        }  else if (inlength == 2) {
                p <- pred.word2(input)
        }  else
                p <- pred.word1(input)
return(p)
}

Now, that I have a working model There are quiet a few things which I need to consider for improving my results (some of them were already outlined):

What is a good alternative to the data frame format to store the N-gram tables? (I’ll try SQL if needed)
What sample size offers good predictive performance without over-straining the hardware? (I’ll try to find the right balance between speed and accuracy. Getting about 1 million sentences would be nice)
Does removing the stop words harms accuracy? (Removing seems not natural to me, but I will try it anyways)
Does aggregating words with similar word stem harms accuracy? (Again, removing seems not natural to me, but I will try it anyways)
Do I want to consider any other approaches such as Kneser-Ney? (Stupid Backoff seems to be a good compromise between too simple models such as Laplace Smoothing and unnecessary advanced models)
To which approach do I switch if the algorithm is rubbish? (Neural Nets)
How do I choose between word predictions that have the same probability? (no idea, except rolling a dice)
How do I deal with completely unknown words? (Maybe teach the algorithm to add new words/sentences to the text corpus. Though, I think that might be rather ruff because each time I would need to recalculate the N-grams again. Probably just show three most frequent 1-grams)

Testing the Model on some of the Quiz Data Provided by Coursera

word.pred("The guy in front of me just bought a pound of bacon, a bouquet, and a case of")

##   predicted.word propability
## 1            the  0.13877551
## 2           beer  0.02448980
## 3            too  0.01632653

word.pred("You're the reason why I smile everyday. Can you follow me please? It would mean the")

##   predicted.word propability
## 1          world 0.304000000
## 2            one 0.007017544
## 3     difference 0.005614035

word.pred("Hey sunshine, can you follow me and make me the")

##   predicted.word propability
## 1       happiest 0.250000000
## 2           most 0.005477889
## 3          other 0.004336662

word.pred("Ohhhhh #PointBreak is on tomorrow. Love that film and haven't seen it in quite some")

##   predicted.word propability
## 1           time 0.400000000
## 2             of 0.009030925
## 3         people 0.001746605

Testing the Model on the Test Data Set

Before I can answer any of the above statements, obviously I need to measure the accuracy on some test data. So that I’m eventually able to judge the algorithm not only by it’s required stored data and it’s speed but also by it’s accuracy. Below you find some code were the algorithm predicts the last words of each sentence in some of the test data. Note that, word prediction is not comparable to other machine learning fields were accuracy measures are much higher. In my case getting something around 15% would be descent (Someone told me that Swiftkey gets about 30 percent).

# drop really short sentences/single words and sample 
testing.short <- testing[nchar(testing) > 25]
set.seed(2017)
testing.short <- sample(testing.short, 10000, replace = FALSE)

# delete last word to get input sentence
input <- lapply(testing.short, function(x) gsub("\\s*\\w*$", "", x))

# save last true word  
last.word.real <- lapply(testing.short, function(x) getLastWords(x, 1))

# apply predict function to input only keep words 
last.word.pred <- lapply(input , function(x) word.pred(x)[,1])

# create data frame that includes pass and fail 
accuracy.df <- as.data.frame(cbind(last.word.real, last.word.pred)) %>%
    mutate(last.word.real = as.character(last.word.real)) %>%
    mutate(last.word.pred = as.character(last.word.pred)) %>%
    mutate(pass = ifelse(str_detect(last.word.pred,last.word.real), 1, 0))

head(accuracy.df)

##   last.word.real                                   last.word.pred pass
## 1   paddleboards                 c("lord", "best", "eah", "hill")    0
## 2     difference                   c("difference", "time", "way")    1
## 3            air                              c("and", "we", "i")    0
## 4        station c("of", "and", "for", "in", "is", "the", "with")    0
## 5        history                              c("the", "a", "my")    0
## 6         family                    c("hea", "brothers", "might")    0

Evaluation of the Model - Summary

The summary table below shows the accuracy (in perecent) for 10000 test stentences, the execution time for a single input string (in seconds), and the size of the required N-gram tables (in byte), as well as the provided number of N-grams.

accuracy	size_byte	speed_sec	n_grams
0.189	112371168	0.673	1559118

Using 1.5 million N-grams I get an accuracy of 19 percent. The execution time is becoming a problem though because I do not remain below the maximum of 0.5 seconds. Thus, there is no way to increase the N-gram library further using the current approach. Regarding storage, 112 MB is alright but still improvable. In order to improve the results I plan to implement the following two changes: Using indexes and shifting to data.table which should result in significant model improvements regarding speed and storage.

Building the Algorithm: Stupid Backoff 2.0 - Indexing and data.table

Indexing means that one uses the 1-gram table as an index dictionary, in a sense that each word gets an unique index number. Any other N-gram table is then transferred into a sequence of numbers. The reason why this may help is that searching for numbers instead of strings is considerably faster. Besides, I can probably save some storage because the data type integer is much smaller then character. So let’s give that approach a shot and transfer the N-grams accordingly.

# give each word an unique index number
unigram.df.short <- unigram.df.short  %>%
mutate(word.index = c(1:nrow(unigram.df.short)))

# split higher N-grams into single words
bigram.df.short.sep <- bigram.df.short  %>%
    separate(n.gram, c("word1", "word2"), " ")
trigram.df.short.sep <- trigram.df.short  %>%
    separate(n.gram, c("word1", "word2", "word3"), " ")
fourgram.df.short.sep <- fourgram.df.short  %>%
    separate(n.gram, c("word1", "word2", "word3", "word4"), " ")
fivegram.df.short.sep <- fivegram.df.short  %>%
    separate(n.gram, c("word1", "word2","word3", "word4", "word5"), " ")

# change N-gram values to corresponding index number
bigram.df.short.sep   <- tbl_df(sapply(bigram.df.short.sep,   mapvalues, from = unigram.df.short$n.gram, to = unigram.df.short$word.index))
trigram.df.short.sep  <- tbl_df(sapply(trigram.df.short.sep,  mapvalues, from = unigram.df.short$n.gram, to = unigram.df.short$word.index))
fourgram.df.short.sep <- tbl_df(sapply(fourgram.df.short.sep, mapvalues, from = unigram.df.short$n.gram, to = unigram.df.short$word.index))
fivegram.df.short.sep <- tbl_df(sapply(fivegram.df.short.sep, mapvalues, from = unigram.df.short$n.gram, to = unigram.df.short$word.index))

# adjust variable type
bigram.df.short.sep[]   <- lapply(bigram.df.short.sep, as.integer)
trigram.df.short.sep[]  <- lapply(trigram.df.short.sep, as.integer)
fourgram.df.short.sep[] <- lapply(fourgram.df.short.sep, as.integer)
fivegram.df.short.sep[] <- lapply(fivegram.df.short.sep, as.integer)

print(object.size(unigram.df.short) + object.size(bigram.df.short.sep) + object.size(trigram.df.short.sep) + object.size(fourgram.df.short.sep) + object.size(fivegram.df.short.sep))

## 28092488 bytes

Nice, the amount of required storage is much lower then before (only about a fourth). Now, I only must adjust my code (and maybe combine N-grams into a single table), which however is not shown anymore because the report already matches the requirements given by Coursera and I must meet the deadline. An updated version of this report will be available eventually on Github along with the final web application.