rmongodb

Introduction to the rmongodb package

Intro to rmongodb vignette, somewhat adapted for my environment.

MongoDB www.mongodb.org is a scalable, high-performance, document-oriented NoSQL database. The rmongodb package provides an interface from the statistical software R www.r-project.org to MongoDB and back using the mongodb-C library.

Installing and loading the rmongodb package

There is a stable CRAN version of rmongodb available: (install rmongodb)

install.packages("rmongodb")

You can also install the latest dev release.

library(devtools)
install_github("rmongodb", "mongosoup")

Load the package as you would any other package.

library("rmongodb")

Connecting R to MongoDB

First, create a connection to a MongoDB instance. If no parameters, it assumes the localhost.

help("mongo.create")

## starting httpd help server ... done

mongo <- mongo.create()
mongo

## [1] 0
## attr(,"mongo")
## <pointer: 0x00000000077b4a38>
## attr(,"class")
## [1] "mongo"
## attr(,"host")
## [1] "127.0.0.1"
## attr(,"name")
## [1] ""
## attr(,"username")
## [1] ""
## attr(,"password")
## [1] ""
## attr(,"db")
## [1] "admin"
## attr(,"timeout")
## [1] 0

mongo.is.connected

## function (mongo) 
## .Call(".mongo.is.connected", mongo)
## <environment: namespace:rmongodb>

Getting Databases and Collections

Get all databases of your MongoDB connection:

if (mongo.is.connected(mongo) == TRUE) {
    mongo.get.databases(mongo)
}

## [1] "calibre"  "census"   "Fss06hid" "learn"    "names"    "people"  
## [7] "test"

Get all collections in a specific database of the local MongoDB instance. (In this case, “census”)

if (mongo.is.connected(mongo) == TRUE) {
    db <- "census"
    mongo.get.database.collections(mongo, db)
}

## [1] "census.cities"          "census.states"         
## [3] "census.[object Object]" "census.zips"

coll <- "census.zips"

We will use the ‘zips’ collection in the following examples. The ‘zips’ collection holds the MongoDB example data set called “Zip Code Data Set” (http://docs.mongodb.org/manual/tutorial/aggregation-zip-code-data-set/). This data set is available as JSON and contains zip code data from the US.

Getting the size of collections, a sample document and values for a key

Use mongo.count to count the number of documents in a collection.

if (mongo.is.connected(mongo) == TRUE) {
    help("mongo.count")
    mongo.count(mongo, coll)
}

## [1] 29353

We can run some queries. The command mongo.find.one gets one document from the collection.

if (mongo.is.connected(mongo) == TRUE) {
    mongo.find.one(mongo, coll)
}

##  _id : 2      01001
##  city : 2     AGAWAM
##  loc : 4      
##      0 : 1    -72.622739
##      1 : 1    42.070206
## 
##  pop : 16     15338
##  state : 2    MA

The command mongo.distinct provides a list of all values for a given key.

if (mongo.is.connected(mongo) == TRUE) {
    res <- mongo.distinct(mongo, coll, "city")
}

Finding some first data

Now we can run the first queries on our MongoDB. In this case we ask for one document providing zip code data for the city, “COLORADO CITY”. The output of mongo.find.one is a BSON object, which cannot be used directly for further analysis in R. Using the command mongo.bson.to.list, an R object will be created from the BSON object.

if (mongo.is.connected(mongo) == TRUE) {
    cityone <- mongo.find.one(mongo, coll, "{\"city\":\"COLORADO CITY\"}")
    print(cityone)
    mongo.bson.to.list(cityone)
    class(cityone)
}

##  _id : 2      79512
##  city : 2     COLORADO CITY
##  loc : 4      
##      0 : 1    -100.860948
##      1 : 1    32.398736
## 
##  pop : 16     5874
##  state : 2    TX

## [1] "mongo.bson"

Creating BSON objects

Until recently, working with rmonbodb was all about creating BSON objects in R. This was a very difficult and error-prone task. Since rmongodb version 1.2 you can use JSON directly, and the syntax of the rmongodb package is much more similar to the MongoDB shell.

buf <- mongo.bson.buffer.create()
mongo.bson.buffer.append(buf, "city", "COLORADO CITY")

## [1] TRUE

query <- mongo.bson.from.buffer(buf)
query

##  city : 2     COLORADO CITY

The same BSON object can be created with one line of code and JSON:

mongo.bson.from.JSON("{\"city\":\"COLORADO CITY\"}")

##  city : 2     COLORADO CITY

Finding more data

For real analyses it is important to get more than one document of data from MongoDB. As an example, we first use the command mongo.distict to get an overview about the population distribution. Then we check for all cities with less than two inhabitants (errors in the data set?).

if (mongo.is.connected(mongo) == TRUE) {
    pop <- mongo.distinct(mongo, coll, "pop")
    hist(pop)
    boxplot(pop, main = "City Populations")
    
    nr <- mongo.count(mongo, coll, "{\"pop\":{\"$lte\":2}}")
    print(nr)
    pops <- mongo.find.all(mongo, coll, "{\"pop\":{\"$lte\":2}}")
    print(dim(pops))
    head(pop)
}

plot of chunk unnamed-chunk-14

## [1] 85

## Warning: This fails for most NoSQL data structures. I am working on a new
## solution

plot of chunk unnamed-chunk-14

## [1] 85  5

## [1] 15338 36963  4546 10579  1240  3706

Finding more data with a more complex query

The analysis gets more interesting when creating a more complex query with two arguments. Using the package jsonlite you can check and visualize your JSON syntax first. Afterwards we query MongoDB with this JSON query.

library("jsonlite")

## 
## Attaching package: 'jsonlite'
## 
## The following object is masked from 'package:utils':
## 
##     View

json <- "{\"pop\":{\"$lte\":2}, \"pop\":{\"$gte\":1}}"
cat(prettify(json))

## {
##     "pop": {
##         "$lte": 2
##     },
##     "pop": {
##         "$gte": 1
##     }
## }

validate(json)

## [1] TRUE

if (mongo.is.connected(mongo) == TRUE) {
    pops <- mongo.find.all(mongo, coll, json)
    print(dim(pops))
    head(pops)
}

## Warning: This fails for most NoSQL data structures. I am working on a new
## solution

## [1] 19  5

##     _id     city               loc       pop state
## val "11042" "NEW HYDE PARK"    Numeric,2 1   "NY" 
## val "21240" "BALTIMORE"        Numeric,2 1   "MD" 
## val "24950" "KIEFFER"          Numeric,2 2   "WV" 
## val "26818" "FISHER"           Numeric,2 1   "WV" 
## val "32815" "KENNEDY SPACE CE" Numeric,2 1   "FL" 
## val "38132" "MEMPHIS"          Numeric,2 2   "TN"

In this case you will get a warning because the construction of the data.frame object is still inefficient and error-prone due to the NoSQL data structure. We are currently working on a solution.

mongo.cursor.to.data.frame

## function (cursor, nullToNA = TRUE, ...) 
## {
##     warning("This fails for most NoSQL data structures. I am working on a new solution")
##     res <- data.frame()
##     while (mongo.cursor.next(cursor)) {
##         val <- mongo.bson.to.list(mongo.cursor.value(cursor))
##         if (nullToNA == TRUE) 
##             val[sapply(val, is.null)] <- NA
##         val <- val[sapply(val, class) != "mongo.oid"]
##         res <- rbind.fill(res, as.data.frame(val, ...))
##     }
##     return(as.data.frame(res))
## }
## <environment: namespace:rmongodb>

Inserting some data into MongoDB

Another interesting point is inserting data into MongoDB.

a <- mongo.bson.from.JSON("{\"ident\":\"a\", \"name\":\"Markus\", \"age\":33}")
b <- mongo.bson.from.JSON("{\"ident\":\"b\", \"name\":\"MongoSoup\", \"age\":1}")
c <- mongo.bson.from.JSON("{\"ident\":\"c\", \"name\":\"UseR\", \"age\":18}")

if (mongo.is.connected(mongo) == TRUE) {
    icoll <- paste(db, "test", sep = ".")
    mongo.insert.batch(mongo, icoll, list(a, b, c))
    
    dbs <- mongo.get.database.collections(mongo, db)
    print(dbs)
    mongo.find.all(mongo, icoll)
}

## [1] "census.cities"          "census.states"         
## [3] "census.[object Object]" "census.zips"           
## [5] "census.test"

## Warning: This fails for most NoSQL data structures. I am working on a new
## solution

##     _id       ident name        age
## val 73206816  "a"   "Markus"    33 
## val 0         "b"   "MongoSoup" 1  
## val 123655216 "c"   "UseR"      18

Updating documents and creating indices for efficient queries

You can also update your data in MongoDB from R and add indices for more efficient queries. (Incement age by 3)

if (mongo.is.connected(mongo) == TRUE) {
    mongo.update(mongo, icoll, "{\"ident\":\"b\"}", "{\"$inc\":{\"age\":3}}")
    
    res <- mongo.find.all(mongo, icoll)
    print(res)
    
    # Creating an index for field 'ident'
    mongo.index.create(mongo, icoll, "{\"ident\":1}")
    # check mongo shell! 
    #>  db.test.getIndexes()
}

## Warning: This fails for most NoSQL data structures. I am working on a new
## solution

##     _id      ident name        age
## val 15925248 "a"   "Markus"    33 
## val 0        "b"   "MongoSoup" 4  
## val 65536    "c"   "UseR"      18

## NULL

Dropping/removing collections and databases and closing the connection to MongoDB

Of course there are also commands to drop databases and collections in MongoDB. After you finished all your analyses it’s a good idea to destroy the connection to your MongoDB.

if (mongo.is.connected(mongo) == TRUE) {
    mongo.drop(mongo, icoll)
    # mongo.drop.database(mongo, db)
    res <- mongo.get.database.collections(mongo, db)
    print(res)
    
    # always close yourconnection when finished
    # mongo.command(mongo, db, getIndexes())
    mongo.destroy(mongo)
}

## [1] "census.cities"          "census.states"         
## [3] "census.[object Object]" "census.zips"

## NULL