Introduction

A key-value store (in Computer Science often called associative array, or symbol table) is an abstract data type composed of a collection of (key, value) pairs, such that each possible key appears at most once in the collection.

Many modern programming languages include associative arrays as primitive data types, or they are available in software libraries. They come under different names in several languages, such as

  • ‘dictionary’ (Python, Julia),
  • ‘hash’ (Perl, Ruby, JavaScript),
  • ‘map’ (Java, Haskell), or
  • ‘table’ (Lua).

Operations associated with this data type allow:

  • the addition of a pair to the collection;
  • the removal of a pair from the collection;
  • the modification of an existing pair;
  • the look-up of a value associated with a particular key.

Example: dictionary usage in Python

D = {'x':42, 'pi':3.1415, 'z':3+4}  # or: D = {}
D.get('x')                          # 42
D.get('x') = 0.5772                 # changes the value
D['e'] = 2.71828                    # adds key-value pair
D.pop('z')                          # removes key and returns value
D.keys()  # or: values(), items()   # return all keys, values, items
D.clear()                           # empty the dictionary

Note: Base R does not have such a key-value data type (and did not for a long time, see below). We will look at some R packages that provide this functionality.

Write your own ‘hash’ table

It’s not that difficult to write one’s own hash package. In the following implementation the key-value pairs are stored as variable names and their values in an underlying environment.

def.h <- function() new.env(hash=TRUE, parent=emptyenv())
len.h <- function(dict) length(ls(envir=dict))
set.h <- function(key, val, dict) assign(as.character(key), val, envir=dict)
get.h <- function(key, dict, default=NULL) {
    key <- as.character(key)
    if (exists(key, envir=dict)) { get(key, dict)
    } else { default }
}
del.h <- function(key, dict) {
    key <- as.character(key)
    if (exists(key, envir=dict)) {
        val <- get.h(key, dict)
        rm(list=c(key), envir=dict)
    } else {
        val <- NULL
    }
    invisible(val)
}
has_key <- function(key, dict) exists(as.character(key), envir=dict)
keys.h <- function(dict) ls(envir=dict)
items.h <- function(dict) as.list(dict)
values.h <- function(dict, mode='character') {
    l <- as.list(dict)
    n <- length(l)
    if (n==0) invisible(NULL)
    v <- vector('character', n)
    for (i in 1:n) v[i] <- l[[i]]
    if (mode=='numeric') v <- as.numeric(v)
    return(v)
}
clear.h <- function(dict) {
    rm(list=keys.h(dict), envir=dict)
}

As it turns out, this implementation is more efficient than the one in the hash package.

Define a ‘counter’ function

Using these ‘hash’ functions we will define our own counting function as a replacement for the R ‘table’ function.

source("hash.R")
counter <- def.h()                  # define a hash counter ... and count
count <- function(x, hash = counter) {
  set.h(x, get.h(x, counter, default = 0) + 1, counter)
}

We generate some (capital) letters and count how many are there for each.

C = sample(LETTERS, 1000, replace = TRUE)
for (ch in C) count(ch, counter)

ks <- keys.h(counter)
vs <- numeric(length = length(ks))
for (i in 1:length(ks)) vs[i] <- get.h(ks[i], counter)

Cdf = data.frame(Letter = ks, Number = vs)
head(Cdf)
##   Letter Number
## 1      A     44
## 2      B     38
## 3      C     40
## 4      D     45
## 5      E     47
## 6      F     39

table(C)
## C
##  A  B  C  D  E  F  G  H  I  J  K  L  M  N  O  P  Q  R  S  T  U  V  W  X  Y  Z 
## 44 38 40 45 47 39 30 30 48 33 33 45 45 41 37 35 42 28 31 42 36 35 44 37 37 38

Example: Collatz Problem

The Collatz problem, aka the \(3n+1\) -Problem, is an unsolved mathematical problem.

The problem is about sequences of numbers that are constructed according to a simple recursive law:

Start with any natural number \(n > 0\);
If \(n\) is even, take \(n/2\) next;
If \(n\) is odd, take \(3n+1\) next.
Repeat the procedure with the number obtained. Stop when \(n=1\) is reached.

For example, starting with \(n=7\) one gets the sequence

7, 22, 11, 34, 17, 52, 26, 13, 40, 20, 10, 5, 16, 8, 4, 2, 1

Apparently, each such sequence will at some point generate a power of 2 and then end with …,4,2,1, regardless of the start number 𝑛. That is, there is no such sequence that is infinite or goes into a cycle without ever reaching 1.

The Collatz conjecture is thus:

Every sequence of numbers constructed in this way will end in 1, no matter with which natural number one starts.

Computing Collatz sequences

numbers is a package with functions from number theory and has a generalized collatz() function. Here we will us a simplified version that takes a number n as input and generates the sequence as defined above. We assume the Collatz assumption is true, so after a finite number of steps we will reach 1 and stop.

collatz <- function(n) {
    cseq <- c(n)
    while (n > 1) {
        if (n %% 2 == 0) n <- n / 2
        else n <- 3*n + 1
        cseq <- c(cseq, n)
    }
    cseq
}

We apply it to the number n=7 as above:

numbers::collatz(7)
##  [1]  7 22 11 34 17 52 26 13 40 20 10  5 16  8  4  2  1

We want to find numbers \(n\) whose Collatz sequences are very long or reach very high numbers on their way down to 1. Let’s write the results into a matrix with three columns, the number n , the length of the sequence, and its maximum.

N = 10; C <- matrix(0, N, 3)
for (n in 1:N) {
    cn = numbers::collatz(n)
    C[n, ] <- c(n, length(cn), max(cn))
}
C
##       [,1] [,2] [,3]
##  [1,]    1    1    1
##  [2,]    2    2    2
##  [3,]    3    8   16
##  [4,]    4    3    4
##  [5,]    5    6   16
##  [6,]    6    9   16
##  [7,]    7   17   52
##  [8,]    8    4    8
##  [9,]    9   20   52
## [10,]   10    7   16

Here we tested it for the first 10 numbers. The output shows that n=9 has a slightly longer sequence and at some point also runs into the number 52.

N = 1000
collatz_db <- def.h()
m = 1; l = 1
tictoc::tic()
for (n in 1:N) {
  cn = collatz(n); mn = max(cn); ln = length(cn)
  if (mn > m) m = mn
  if (ln > l) l = ln
  set.h(n, numbers::collatz(n), collatz_db)
}
tictoc::toc()
cat("Maximal length:", l, '\n')
cat("Maximal height:", m, '\n')
0.276 sec elapsed
Maximal length: 179
Maximal height: 250504 

Doing some benchmark tests, we can state that with our own implementation of dictionaries/hashes in R the insertion operation takes about 0.23 msecs while extraction takes 0.01 msecs per item.

Packages with key-value stores

We would like to find longer sequences for numbers up to, say, one million, also storing the intermediate results like sequences calculated, their length and maximum. A matrix (or even a date frame) does not seem a perfect place for this.

So we will load the sequences in a key-value store and length and maximum in a column-oriented database (for later statistical calculations). As database we chose the SQLite database through the R package RSQLite (unfortunately, MonetDBLite is not on CRAN anymore).

For key-value stores the following packages provide implementations:

  • hash
  • filehash
  • fastmap
  • rredis, RcppRedis

Package hash uses R environments to implement key-value pairs as variable names and values, similar to what we have done above. filehash implements keys (as S4 objects) with values as serialized objects stored on disc. fastmap stores keys as C++ objects and values as R lists. rredis connects to redis, an open-source, in-memory data store (that has to be installed by the user independently of R).

We will look at all these possibilities and also at their timings.

The ‘hash’ package

For the hash package, H <- hash() generates a hash, H[[key]] <- value will store a value for a key (which has to be a string), and H[[key]] will retrieve the value for that key.

Besides that, there are clear and delete methods to remove all or specific key-values. keys returns all keys as a character vector, and values all values. Values can also be set or retrieved with the $ operator, setting it to NULL removes the key-value pair.

library(hash)
H <- hash::hash()

N <- 10000
for (n in 1:N) {
  cseq <- collatz(n)
  H[[as.character(n)]] <- cseq
}

tic()
for (n in 1:N) {
   item <- H[[as.character(n)]]
}
toc()
Error: protect(): protection stack overflow

So we get a “protection stack overflow” showing that this is not the right key-value store for our task. I was not able to let this run with more than 50000 items. That is astonishing as the simple implementation above was capable of storing a million Collatz sequences without a problem.

With say 10000 items the timing were 0.12 msecs per item per insertion and 0.0053 msec per extraction.

Usage:

# library(hash)
H <- hash::hash()                   # hash(a=1, b=2, c=3, ...)
H[[key]] <- value                   # H$key <- value
value <- H[[key]]                   # H$key (no default)
has.key(key, H)                     #   all vectorized !
keys(H)  # values(H)                # length(H)
del[ete](key, H)                    # or H[[key]] <- NULL
clear(H)                            # is.empty(H)

The ‘filehash’ package

The filehash package provides a disk-based approach. dbCreate() generates a name for the hash file, dbInit associates it with a file on the disk. dbInsert() and dbFetch() set and get the values for keys.

library(filehash)
filehash: Simple Key-Value Database (2.4-2)
filehash::dbCreate("collatz.db")
db <- filehash::dbInit("collatz.db")

N <- 1000000
for (n in 1:N) {
  cn <- collatz(n)
  dbInsert(db, as.character(n), cn)
}

dbUnlink(db)

A rough estimation of the time for inserting a key-value pair may be \(5\cdot 10^{-4}\) seconds, for fetching \(0.08 \cdot 10^{-4}\), and the size of the file on disk is about 1.2 GB. With dbUnlink(cbd) the file will be erased from disk.

An overview of the available functions for handling filehash stores:

Usage

library(filehash)
dbCreate(filename)              # creates file-based hash table
db <- dbInit(filename)          # initialize the connection
dbLoad(db)                      # load hash table into an environment
dbInsert(db, "key", value)      # or: db[[key]] <- , db$'key' <- 
value <- dbFetch(db, key)       # or: db[[key]], db$'key'
dbExists(db, key)
dbDelete(db, key)
dbList(db)                      # list all keys
dbUnlink(db)                    # delete the database from disk

dbLoad() will load the entire database into an environment (via ‘promises’). This is only useful for read-only databeses as changes in the in this environment will not be reflected in the file.

If you don’t unlink the filename (and e.g. set db <- NULL) the file keeps existing and can be reloaded with dbInit() again.

The ‘fastmap’ package

While filehash is written in pure R, fastmap has a C++ implementation.

The fastmap package provides a quite easy and elegant interface for use. H <- fastmap() will create a hash, and H$set(key, value), H$get(), etc., have the expected meaning.

library(fastmap)
H <- fastmap::fastmap()

N <- 1000000
for (n in 1:N) {
  cseq <- collatz(n)
  H$set(as.character(n), cseq)
}

This takes about 180 secs on my machine. Given that the loop and running the collatz function for all these numbers takes 90 secs, we would have about \(1 \cdot 10^{-4}\) seconds for each set operation; for fetching operations we have \(5 \cdot 10^{-6}\) seconds.

An overview of the available functions for handling filehash stores:

Usage

library(fastmap)
db <- fastmap(NULL)         # and: missing_default = 0
db$set(key, value)          # also: mset
db$get(key)                 # also: mget; missing = <default>
db$has(keys)
db$remove(keys)
db$keys()                   # sort = FALSE|TRUE
db$size()                   # no. of items in the set
db$as_list(db)              # returns a named list
db$reset()                  # clear all items, reset the store

Note: ‘fastmap’ avoids the memory leak issue when hash tables are implemented as environments. And ‘fastmap’ also supports queues and stacks!

‘fastmap’ was developed by Winston Chang from RStudio.
You can find manual-like information on Github: fastmap

The ‘rredis’ package

redis is an open source, in-memory data structure store and can be used as a (non-relational) key-value database. redis supports different kinds of abstract data structures, such as strings, lists, maps, sets, sorted sets, etc. It is quite suited for storing time series data.

There are two packages on CRAN connecting to redis, rredis and RcppRedis. It is assumed that the user has installed the redis software, and that the redis-cli command is available in the “path”.

The rredis package provides an R client interface to redis. The RcppRedis package was developed by Dirk Edelbüttel and connects using the C-language client library ‘hiredis’. Here we will make use of rredis as it appears to be simpler to use.

Installation

On Ubuntu, for example, redis can be installed and configured with the command:

sudo apt install redis-server           # install
sudo nano /etc/redis/redis.conf         # configure
sudo systemctl restart redis.service    # restart
sudo systemctl status redis             # check if running
sudo systemctl disable redis            # start manually next time
redis-cli
    > ping                              # pong
    > set test "..."; get test          # store and read test string
    > exit

For more information see how to install and secure redis on Ubuntu

Connect with rredis

library(rredis)
redisConnect()
N <- 1000000

for (n in 1:N) {
  cn <- collatz(n)
  redisSet(as.character(n), cn)
}

The connection through rredis is relatively slow, 0.6 msecs for insertion and 0.4 msecs for fetching sequences back.

Usage

library(rredis)
redisConnect()                  # port 6379
redisSet(key, value)            # also: redisMSet
redisGet(key)                   # also: redisGet
redisExists(key)                # key exists
redisDelete(key)                # delete key/value pair
redisExpire('z', 1)             # expires in 1 second
redisKeys()                     # return all keys
redisFlushAll()                 # remove all key/value pairs
redisSelect(n)                  # connect to table n; default: 0
redisClose()                    # close open connection

With redisCmd("cmd", ...) one can perform any redis operation.

redis also provides list list and set data types which may be utilized to set up data queues, etc.

The ‘RcppRedis’ package

The RcppRedis package provides another interface to the redis key/value store (or database): Its interface is quite simple, just using a string-based exchange.

library(RcppRedis)
redis <- new(Redis)
redis$exec("ping")

N <- 1000000
for (n in 1:N) {
    cn <- collatz(n)
    cn <- redis$set(as.character(n),
                    serialize(cn, NULL))
}

Timings are 0.19 msec per insertion and 0.048 msecs for fetching values.

Usage

library(RcppRedis)
redis <- new(Redis)             # new connection object
redis$exec("cmd")               # send and receive string commands
# special:
redis$set(key, serialize(val, NULL, ascii = FALSE|TRUE)
redis$get(key)

Through the redis$exec() interface many more commands can be used, see the redis reference.

Using a relational database

Storing max and length

For each \(n\) we want to store the length and maximum of its Collatz sequence in a relational database. Unfortunately, MonetDBLite has been removed from CRAN and MonetDB requires the Monet database system to be installed by the user. Therefore we go back to SQLite.

library(DBI)
library(RSQLite)
# ':memory:' in-memory database
con <- dbConnect(RSQLite::SQLite(), dbname = ":memory:")
rs <- dbSendQuery(con,
    "CREATE TABLE Collatz (n INTEGER, imax INTEGER, ilen INTEGER)")
dbClearResult(rs)

To verify we list the table(s) and its field(s).

dbListTables(con)
dbListFields(con, "Collatz")
dbReadTable(con, "Collatz")
[1] "Collatz
[1] n    imax ilen
<0 rows> (or 0-length row.names)

We are ready to scan all Collatz sequences and fill the database for a set \(N\) of natural numbers.

collatz_filldb <- function(N) {
    for (n in 1:N) {
        cseq <- numbers::collatz(n)
        imax <- max(cseq); ilen <- length(cseq)
        rs <- dbSendQuery(con,
                  paste("INSERT INTO Collatz VALUES (",
                        n, ",", imax, ",", ilen, ")", sep = ""))
        dbClearResult(rs)
    }
}
N <- 1000000
collatz_filldb(N)

This takes about 320 elapsed seconds, that is \(2 \cdot 10^{-4}\) seconds per database operation.

Example evaluations

Now we can look at some results. For example, which numbers have the highest maximum in our range \(1 ... 10^6\)? We can see that some of them do have a maximum of over 50 Billion.

rs <- dbSendQuery(con,
          "SELECT * FROM Collatz WHERE imax >= 50000000000")
ans <- dbFetch(rs)
dbClearResult(rs)

ans
       n        imax ilen
1 665215 52483285312  442
2 704511 56991483520  243
3 886953 52483285312  445
4 997823 52483285312  440

And what are the numbers with the longest sequences?

rs <- dbSendQuery(con,
          "SELECT * FROM Collatz WHERE ilen >= 500")
ans <- dbFetch(rs)
dbClearResult(rs)

ans
       n       imax ilen
1 626331 7222283188  509
2 704623 7222283188  504
3 837799 2974984576  525
4 939497 7222283188  507

Example plots

We take a look at the last 10000 of our numbers …

rs <- dbSendQuery(con,
          "SELECT * FROM Collatz WHERE n >= 990001")
m <- dbFetch(rs)
dbClearResult(rs)

dbDisconnect(con)

… and plot length vs. maximum in gray, and as red circles all prime numbers among those numbers.

par(mar = c(2,2,2,1))
plot(m$ilen, log(m$imax), col = "darkgray", pch = 20,
     main = "Collatz Seqs: length vs. log(max)")

pinds <- numbers::isPrime(m$n)
points(m$ilen[pinds], log(m$imax[pinds]), col = "darkred")

Efficiency considerations

It does not seem necessary to recalculate the whole Collatz sequence in every case. For instance, if \(n\) is even then we have already generated the sequence for \(n/2\) and can simply reuse that. This calls for a combination of storing and computing,

H <- fastmap::fastmap()
H$set('1', c(1)); H$set('2', c(2,1))
nxt <- function(n) {
    if (n %% 2 == 0) n/2 else 3*n + 1
}

for (n in seq(3,1000000)) {
    if (n %% 2 == 0) {
        cseq <- c(n, H$get(as.character(n/2)))
    } else {
        m <- 3*n + 1
        cseq <- c(n, m)
        while (m >= n) {
            m <- nxt(m)
            cseq <- c(cseq, m)
        }
        cseq <- c(cseq, H$get(as.character(m))[-1])
    }
    H$set(as.character(n), cseq)
}

This takes about 20 seconds while the version with fastmap takes 180 seconds. Intermediate versions – where we look up keys to find out whether they have already been calculated – take about 60 seconds.

Timing comparisons

Package Insertion fetching Remarks
hash.R 0.23 msecs 0.010 msecs R environment
hash 0.12 msecs 0.005 msecs R environment, S4
filehash 0.50 msecs 0.008 msecs file-based, S4
fastmap 0.18 msecs 0.005 msecs C++ implementation, S3
rredis 0.65 msecs 0.413 msecs key-value database
RcppRedis 0.19 msecs 0.048 msecs key-value database, S3
PostgrSQL ?? msecs ?? msecs relational database

PostgreSQL is a relational database, but has a special module for storing time series. I might be a reasonable alternative, an alternative that I haven’t dared to try.

References

Roger Peng’s filehash vignette (April 2019)
https://cran.r-project.org/web/packages/filehash/vignettes/filehash.pdf

Winston Chang’s fastmap documentation (January 2021)
https://r-lib.github.io/fastmap/

redis key-value database, version 6.2.1 (March 2021)
Homepage: https://redis.io/

The rredis package vignette (July 2015)
https://cran.r-project.org/web/packages/rredis/vignettes/rredis.pdf

Jeffrey Horner. “Hash Table Performance in R: Parts I-IV”. R-bloggers 2015.
Part I | Part II | Part III | Part IV