A key-value store (in Computer Science often called associative array, or symbol table) is an abstract data type composed of a collection of (key, value) pairs, such that each possible key appears at most once in the collection.
Many modern programming languages include associative arrays as primitive data types, or they are available in software libraries. They come under different names in several languages, such as
Operations associated with this data type allow:
Example: dictionary usage in Python
D = {'x':42, 'pi':3.1415, 'z':3+4} # or: D = {}
D.get('x') # 42
D.get('x') = 0.5772 # changes the value
D['e'] = 2.71828 # adds key-value pair
D.pop('z') # removes key and returns value
D.keys() # or: values(), items() # return all keys, values, items
D.clear() # empty the dictionary
Note: Base R does not have such a key-value data type (and did not for a long time, see below). We will look at some R packages that provide this functionality.
It’s not that difficult to write one’s own hash package. In the following implementation the key-value pairs are stored as variable names and their values in an underlying environment.
def.h <- function() new.env(hash=TRUE, parent=emptyenv())
len.h <- function(dict) length(ls(envir=dict))
set.h <- function(key, val, dict) assign(as.character(key), val, envir=dict)
get.h <- function(key, dict, default=NULL) {
key <- as.character(key)
if (exists(key, envir=dict)) { get(key, dict)
} else { default }
}
del.h <- function(key, dict) {
key <- as.character(key)
if (exists(key, envir=dict)) {
val <- get.h(key, dict)
rm(list=c(key), envir=dict)
} else {
val <- NULL
}
invisible(val)
}
has_key <- function(key, dict) exists(as.character(key), envir=dict)
keys.h <- function(dict) ls(envir=dict)
items.h <- function(dict) as.list(dict)
values.h <- function(dict, mode='character') {
l <- as.list(dict)
n <- length(l)
if (n==0) invisible(NULL)
v <- vector('character', n)
for (i in 1:n) v[i] <- l[[i]]
if (mode=='numeric') v <- as.numeric(v)
return(v)
}
clear.h <- function(dict) {
rm(list=keys.h(dict), envir=dict)
}
As it turns out, this implementation is more efficient than the one in the hash package.
Using these ‘hash’ functions we will define our own counting function as a replacement for the R ‘table’ function.
source("hash.R")
counter <- def.h() # define a hash counter ... and count
count <- function(x, hash = counter) {
set.h(x, get.h(x, counter, default = 0) + 1, counter)
}
We generate some (capital) letters and count how many are there for each.
C = sample(LETTERS, 1000, replace = TRUE)
for (ch in C) count(ch, counter)
ks <- keys.h(counter)
vs <- numeric(length = length(ks))
for (i in 1:length(ks)) vs[i] <- get.h(ks[i], counter)
Cdf = data.frame(Letter = ks, Number = vs)
head(Cdf)
## Letter Number
## 1 A 44
## 2 B 38
## 3 C 40
## 4 D 45
## 5 E 47
## 6 F 39
table(C)
## C
## A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
## 44 38 40 45 47 39 30 30 48 33 33 45 45 41 37 35 42 28 31 42 36 35 44 37 37 38
The Collatz problem, aka the \(3n+1\) -Problem, is an unsolved mathematical problem.
The problem is about sequences of numbers that are constructed according to a simple recursive law:
Start with any natural number \(n > 0\);
If \(n\) is even, take \(n/2\) next;
If \(n\) is odd, take \(3n+1\) next.
Repeat the procedure with the number obtained. Stop when \(n=1\) is reached.
For example, starting with \(n=7\) one gets the sequence
7, 22, 11, 34, 17, 52, 26, 13, 40, 20, 10, 5, 16, 8, 4, 2, 1
Apparently, each such sequence will at some point generate a power of 2 and then end with …,4,2,1, regardless of the start number 𝑛. That is, there is no such sequence that is infinite or goes into a cycle without ever reaching 1.
The Collatz conjecture is thus:
Every sequence of numbers constructed in this way will end in 1, no matter with which natural number one starts.
numbers is a package with functions from number theory and has a generalized collatz() function. Here we will us a simplified version that takes a number n as input and generates the sequence as defined above. We assume the Collatz assumption is true, so after a finite number of steps we will reach 1 and stop.
collatz <- function(n) {
cseq <- c(n)
while (n > 1) {
if (n %% 2 == 0) n <- n / 2
else n <- 3*n + 1
cseq <- c(cseq, n)
}
cseq
}
We apply it to the number n=7 as above:
numbers::collatz(7)
## [1] 7 22 11 34 17 52 26 13 40 20 10 5 16 8 4 2 1
We want to find numbers \(n\) whose Collatz sequences are very long or reach very high numbers on their way down to 1. Let’s write the results into a matrix with three columns, the number n , the length of the sequence, and its maximum.
N = 10; C <- matrix(0, N, 3)
for (n in 1:N) {
cn = numbers::collatz(n)
C[n, ] <- c(n, length(cn), max(cn))
}
C
## [,1] [,2] [,3]
## [1,] 1 1 1
## [2,] 2 2 2
## [3,] 3 8 16
## [4,] 4 3 4
## [5,] 5 6 16
## [6,] 6 9 16
## [7,] 7 17 52
## [8,] 8 4 8
## [9,] 9 20 52
## [10,] 10 7 16
Here we tested it for the first 10 numbers. The output shows that n=9 has a slightly longer sequence and at some point also runs into the number 52.
N = 1000
collatz_db <- def.h()
m = 1; l = 1
tictoc::tic()
for (n in 1:N) {
cn = collatz(n); mn = max(cn); ln = length(cn)
if (mn > m) m = mn
if (ln > l) l = ln
set.h(n, numbers::collatz(n), collatz_db)
}
tictoc::toc()
cat("Maximal length:", l, '\n')
cat("Maximal height:", m, '\n')
0.276 sec elapsed
Maximal length: 179
Maximal height: 250504
Doing some benchmark tests, we can state that with our own implementation of dictionaries/hashes in R the insertion operation takes about 0.23 msecs while extraction takes 0.01 msecs per item.
We would like to find longer sequences for numbers up to, say, one million, also storing the intermediate results like sequences calculated, their length and maximum. A matrix (or even a date frame) does not seem a perfect place for this.
So we will load the sequences in a key-value store and length and maximum in a column-oriented database (for later statistical calculations). As database we chose the SQLite database through the R package RSQLite (unfortunately, MonetDBLite is not on CRAN anymore).
For key-value stores the following packages provide implementations:
Package hash uses R environments to implement key-value pairs as variable names and values, similar to what we have done above. filehash implements keys (as S4 objects) with values as serialized objects stored on disc. fastmap stores keys as C++ objects and values as R lists. rredis connects to redis, an open-source, in-memory data store (that has to be installed by the user independently of R).
We will look at all these possibilities and also at their timings.
For the hash package, H <- hash() generates a hash, H[[key]] <- value will store a value for a key (which has to be a string), and H[[key]] will retrieve the value for that key.
Besides that, there are clear and delete methods to remove all or specific key-values. keys returns all keys as a character vector, and values all values. Values can also be set or retrieved with the $ operator, setting it to NULL removes the key-value pair.
library(hash)
H <- hash::hash()
N <- 10000
for (n in 1:N) {
cseq <- collatz(n)
H[[as.character(n)]] <- cseq
}
tic()
for (n in 1:N) {
item <- H[[as.character(n)]]
}
toc()
Error: protect(): protection stack overflow
So we get a “protection stack overflow” showing that this is not the right key-value store for our task. I was not able to let this run with more than 50000 items. That is astonishing as the simple implementation above was capable of storing a million Collatz sequences without a problem.
With say 10000 items the timing were 0.12 msecs per item per insertion and 0.0053 msec per extraction.
Usage:
# library(hash)
H <- hash::hash() # hash(a=1, b=2, c=3, ...)
H[[key]] <- value # H$key <- value
value <- H[[key]] # H$key (no default)
has.key(key, H) # all vectorized !
keys(H) # values(H) # length(H)
del[ete](key, H) # or H[[key]] <- NULL
clear(H) # is.empty(H)
The filehash package provides a disk-based approach. dbCreate() generates a name for the hash file, dbInit associates it with a file on the disk. dbInsert() and dbFetch() set and get the values for keys.
library(filehash)
filehash: Simple Key-Value Database (2.4-2)
filehash::dbCreate("collatz.db")
db <- filehash::dbInit("collatz.db")
N <- 1000000
for (n in 1:N) {
cn <- collatz(n)
dbInsert(db, as.character(n), cn)
}
dbUnlink(db)
A rough estimation of the time for inserting a key-value pair may be \(5\cdot 10^{-4}\) seconds, for fetching \(0.08 \cdot 10^{-4}\), and the size of the file on disk is about 1.2 GB. With dbUnlink(cbd) the file will be erased from disk.
An overview of the available functions for handling filehash stores:
Usage
library(filehash)
dbCreate(filename) # creates file-based hash table
db <- dbInit(filename) # initialize the connection
dbLoad(db) # load hash table into an environment
dbInsert(db, "key", value) # or: db[[key]] <- , db$'key' <-
value <- dbFetch(db, key) # or: db[[key]], db$'key'
dbExists(db, key)
dbDelete(db, key)
dbList(db) # list all keys
dbUnlink(db) # delete the database from disk
dbLoad() will load the entire database into an environment (via ‘promises’). This is only useful for read-only databeses as changes in the in this environment will not be reflected in the file.
If you don’t unlink the filename (and e.g. set db <- NULL) the file keeps existing and can be reloaded with dbInit() again.
While filehash is written in pure R, fastmap has a C++ implementation.
The fastmap package provides a quite easy and elegant interface for use. H <- fastmap() will create a hash, and H$set(key, value), H$get(), etc., have the expected meaning.
library(fastmap)
H <- fastmap::fastmap()
N <- 1000000
for (n in 1:N) {
cseq <- collatz(n)
H$set(as.character(n), cseq)
}
This takes about 180 secs on my machine. Given that the loop and running the collatz function for all these numbers takes 90 secs, we would have about \(1 \cdot 10^{-4}\) seconds for each set operation; for fetching operations we have \(5 \cdot 10^{-6}\) seconds.
An overview of the available functions for handling filehash stores:
Usage
library(fastmap)
db <- fastmap(NULL) # and: missing_default = 0
db$set(key, value) # also: mset
db$get(key) # also: mget; missing = <default>
db$has(keys)
db$remove(keys)
db$keys() # sort = FALSE|TRUE
db$size() # no. of items in the set
db$as_list(db) # returns a named list
db$reset() # clear all items, reset the store
Note: ‘fastmap’ avoids the memory leak issue when hash tables are implemented as environments. And ‘fastmap’ also supports queues and stacks!
‘fastmap’ was developed by Winston Chang from RStudio.
You can find manual-like information on Github: fastmap
redis is an open source, in-memory data structure store and can be used as a (non-relational) key-value database. redis supports different kinds of abstract data structures, such as strings, lists, maps, sets, sorted sets, etc. It is quite suited for storing time series data.
There are two packages on CRAN connecting to redis, rredis and RcppRedis. It is assumed that the user has installed the redis software, and that the redis-cli command is available in the “path”.
The rredis package provides an R client interface to redis. The RcppRedis package was developed by Dirk Edelbüttel and connects using the C-language client library ‘hiredis’. Here we will make use of rredis as it appears to be simpler to use.
On Ubuntu, for example, redis can be installed and configured with the command:
sudo apt install redis-server # install
sudo nano /etc/redis/redis.conf # configure
sudo systemctl restart redis.service # restart
sudo systemctl status redis # check if running
sudo systemctl disable redis # start manually next time
redis-cli
> ping # pong
> set test "..."; get test # store and read test string
> exit
For more information see how to install and secure redis on Ubuntu
library(rredis)
redisConnect()
N <- 1000000
for (n in 1:N) {
cn <- collatz(n)
redisSet(as.character(n), cn)
}
The connection through rredis is relatively slow, 0.6 msecs for insertion and 0.4 msecs for fetching sequences back.
Usage
library(rredis)
redisConnect() # port 6379
redisSet(key, value) # also: redisMSet
redisGet(key) # also: redisGet
redisExists(key) # key exists
redisDelete(key) # delete key/value pair
redisExpire('z', 1) # expires in 1 second
redisKeys() # return all keys
redisFlushAll() # remove all key/value pairs
redisSelect(n) # connect to table n; default: 0
redisClose() # close open connection
With redisCmd("cmd", ...) one can perform any redis operation.
redis also provides list list and set data types which may be utilized to set up data queues, etc.
The RcppRedis package provides another interface to the redis key/value store (or database): Its interface is quite simple, just using a string-based exchange.
library(RcppRedis)
redis <- new(Redis)
redis$exec("ping")
N <- 1000000
for (n in 1:N) {
cn <- collatz(n)
cn <- redis$set(as.character(n),
serialize(cn, NULL))
}
Timings are 0.19 msec per insertion and 0.048 msecs for fetching values.
Usage
library(RcppRedis)
redis <- new(Redis) # new connection object
redis$exec("cmd") # send and receive string commands
# special:
redis$set(key, serialize(val, NULL, ascii = FALSE|TRUE)
redis$get(key)
Through the redis$exec() interface many more commands can be used, see the redis reference.
For each \(n\) we want to store the length and maximum of its Collatz sequence in a relational database. Unfortunately, MonetDBLite has been removed from CRAN and MonetDB requires the Monet database system to be installed by the user. Therefore we go back to SQLite.
library(DBI)
library(RSQLite)
# ':memory:' in-memory database
con <- dbConnect(RSQLite::SQLite(), dbname = ":memory:")
rs <- dbSendQuery(con,
"CREATE TABLE Collatz (n INTEGER, imax INTEGER, ilen INTEGER)")
dbClearResult(rs)
To verify we list the table(s) and its field(s).
dbListTables(con)
dbListFields(con, "Collatz")
dbReadTable(con, "Collatz")
[1] "Collatz
[1] n imax ilen
<0 rows> (or 0-length row.names)
We are ready to scan all Collatz sequences and fill the database for a set \(N\) of natural numbers.
collatz_filldb <- function(N) {
for (n in 1:N) {
cseq <- numbers::collatz(n)
imax <- max(cseq); ilen <- length(cseq)
rs <- dbSendQuery(con,
paste("INSERT INTO Collatz VALUES (",
n, ",", imax, ",", ilen, ")", sep = ""))
dbClearResult(rs)
}
}
N <- 1000000
collatz_filldb(N)
This takes about 320 elapsed seconds, that is \(2 \cdot 10^{-4}\) seconds per database operation.
Now we can look at some results. For example, which numbers have the highest maximum in our range \(1 ... 10^6\)? We can see that some of them do have a maximum of over 50 Billion.
rs <- dbSendQuery(con,
"SELECT * FROM Collatz WHERE imax >= 50000000000")
ans <- dbFetch(rs)
dbClearResult(rs)
ans
n imax ilen
1 665215 52483285312 442
2 704511 56991483520 243
3 886953 52483285312 445
4 997823 52483285312 440
And what are the numbers with the longest sequences?
rs <- dbSendQuery(con,
"SELECT * FROM Collatz WHERE ilen >= 500")
ans <- dbFetch(rs)
dbClearResult(rs)
ans
n imax ilen
1 626331 7222283188 509
2 704623 7222283188 504
3 837799 2974984576 525
4 939497 7222283188 507
We take a look at the last 10000 of our numbers …
rs <- dbSendQuery(con,
"SELECT * FROM Collatz WHERE n >= 990001")
m <- dbFetch(rs)
dbClearResult(rs)
dbDisconnect(con)
… and plot length vs. maximum in gray, and as red circles all prime numbers among those numbers.
par(mar = c(2,2,2,1))
plot(m$ilen, log(m$imax), col = "darkgray", pch = 20,
main = "Collatz Seqs: length vs. log(max)")
pinds <- numbers::isPrime(m$n)
points(m$ilen[pinds], log(m$imax[pinds]), col = "darkred")
It does not seem necessary to recalculate the whole Collatz sequence in every case. For instance, if \(n\) is even then we have already generated the sequence for \(n/2\) and can simply reuse that. This calls for a combination of storing and computing,
H <- fastmap::fastmap()
H$set('1', c(1)); H$set('2', c(2,1))
nxt <- function(n) {
if (n %% 2 == 0) n/2 else 3*n + 1
}
for (n in seq(3,1000000)) {
if (n %% 2 == 0) {
cseq <- c(n, H$get(as.character(n/2)))
} else {
m <- 3*n + 1
cseq <- c(n, m)
while (m >= n) {
m <- nxt(m)
cseq <- c(cseq, m)
}
cseq <- c(cseq, H$get(as.character(m))[-1])
}
H$set(as.character(n), cseq)
}
This takes about 20 seconds while the version with fastmap takes 180 seconds. Intermediate versions – where we look up keys to find out whether they have already been calculated – take about 60 seconds.
| Package | Insertion | fetching | Remarks |
|---|---|---|---|
| hash.R | 0.23 msecs | 0.010 msecs | R environment |
| hash | 0.12 msecs | 0.005 msecs | R environment, S4 |
| filehash | 0.50 msecs | 0.008 msecs | file-based, S4 |
| fastmap | 0.18 msecs | 0.005 msecs | C++ implementation, S3 |
| rredis | 0.65 msecs | 0.413 msecs | key-value database |
| RcppRedis | 0.19 msecs | 0.048 msecs | key-value database, S3 |
| PostgrSQL | ?? msecs | ?? msecs | relational database |
PostgreSQL is a relational database, but has a special module for storing time series. I might be a reasonable alternative, an alternative that I haven’t dared to try.
Roger Peng’s filehash vignette (April 2019)
https://cran.r-project.org/web/packages/filehash/vignettes/filehash.pdf
Winston Chang’s fastmap documentation (January 2021)
https://r-lib.github.io/fastmap/
redis key-value database, version 6.2.1 (March 2021)
Homepage: https://redis.io/
The rredis package vignette (July 2015)
https://cran.r-project.org/web/packages/rredis/vignettes/rredis.pdf
Jeffrey Horner. “Hash Table Performance in R: Parts I-IV”. R-bloggers 2015.
Part I | Part II | Part III | Part IV