MongoDB www.mongodb.org is a scalable, high-performance, document-oriented NoSQL database. The rmongodb package provides an interface from the statistical software R www.r-project.org to MongoDB and back using the mongodb-C library.
There is a stable CRAN version of rmongodb available: (install rmongodb)
install.packages("rmongodb")
You can also install the latest dev release.
library(devtools)
install_github("rmongodb", "mongosoup")
Load the package as you would any other package.
library("rmongodb")
First, create a connection to a MongoDB instance. If no parameters, it assumes the localhost.
help("mongo.create")
## starting httpd help server ... done
mongo <- mongo.create()
mongo
## [1] 0
## attr(,"mongo")
## <pointer: 0x00000000077b4a38>
## attr(,"class")
## [1] "mongo"
## attr(,"host")
## [1] "127.0.0.1"
## attr(,"name")
## [1] ""
## attr(,"username")
## [1] ""
## attr(,"password")
## [1] ""
## attr(,"db")
## [1] "admin"
## attr(,"timeout")
## [1] 0
mongo.is.connected
## function (mongo)
## .Call(".mongo.is.connected", mongo)
## <environment: namespace:rmongodb>
Get all databases of your MongoDB connection:
if (mongo.is.connected(mongo) == TRUE) {
mongo.get.databases(mongo)
}
## [1] "calibre" "census" "Fss06hid" "learn" "names" "people"
## [7] "test"
Get all collections in a specific database of the local MongoDB instance. (In this case, “census”)
if (mongo.is.connected(mongo) == TRUE) {
db <- "census"
mongo.get.database.collections(mongo, db)
}
## [1] "census.cities" "census.states"
## [3] "census.[object Object]" "census.zips"
coll <- "census.zips"
We will use the ‘zips’ collection in the following examples. The ‘zips’ collection holds the MongoDB example data set called “Zip Code Data Set” (http://docs.mongodb.org/manual/tutorial/aggregation-zip-code-data-set/). This data set is available as JSON and contains zip code data from the US.
Use mongo.count to count the number of documents in a collection.
if (mongo.is.connected(mongo) == TRUE) {
help("mongo.count")
mongo.count(mongo, coll)
}
## [1] 29353
We can run some queries. The command mongo.find.one gets one document from the collection.
if (mongo.is.connected(mongo) == TRUE) {
mongo.find.one(mongo, coll)
}
## _id : 2 01001
## city : 2 AGAWAM
## loc : 4
## 0 : 1 -72.622739
## 1 : 1 42.070206
##
## pop : 16 15338
## state : 2 MA
The command mongo.distinct provides a list of all values for a given key.
if (mongo.is.connected(mongo) == TRUE) {
res <- mongo.distinct(mongo, coll, "city")
}
Now we can run the first queries on our MongoDB. In this case we ask for one document providing zip code data for the city, “COLORADO CITY”. The output of mongo.find.one is a BSON object, which cannot be used directly for further analysis in R. Using the command mongo.bson.to.list, an R object will be created from the BSON object.
if (mongo.is.connected(mongo) == TRUE) {
cityone <- mongo.find.one(mongo, coll, "{\"city\":\"COLORADO CITY\"}")
print(cityone)
mongo.bson.to.list(cityone)
class(cityone)
}
## _id : 2 79512
## city : 2 COLORADO CITY
## loc : 4
## 0 : 1 -100.860948
## 1 : 1 32.398736
##
## pop : 16 5874
## state : 2 TX
## [1] "mongo.bson"
Until recently, working with rmonbodb was all about creating BSON objects in R. This was a very difficult and error-prone task. Since rmongodb version 1.2 you can use JSON directly, and the syntax of the rmongodb package is much more similar to the MongoDB shell.
buf <- mongo.bson.buffer.create()
mongo.bson.buffer.append(buf, "city", "COLORADO CITY")
## [1] TRUE
query <- mongo.bson.from.buffer(buf)
query
## city : 2 COLORADO CITY
The same BSON object can be created with one line of code and JSON:
mongo.bson.from.JSON("{\"city\":\"COLORADO CITY\"}")
## city : 2 COLORADO CITY
For real analyses it is important to get more than one document of data from MongoDB. As an example, we first use the command mongo.distict to get an overview about the population distribution. Then we check for all cities with less than two inhabitants (errors in the data set?).
if (mongo.is.connected(mongo) == TRUE) {
pop <- mongo.distinct(mongo, coll, "pop")
hist(pop)
boxplot(pop, main = "City Populations")
nr <- mongo.count(mongo, coll, "{\"pop\":{\"$lte\":2}}")
print(nr)
pops <- mongo.find.all(mongo, coll, "{\"pop\":{\"$lte\":2}}")
print(dim(pops))
head(pop)
}
## [1] 85
## Warning: This fails for most NoSQL data structures. I am working on a new
## solution
## [1] 85 5
## [1] 15338 36963 4546 10579 1240 3706
The analysis gets more interesting when creating a more complex query with two arguments. Using the package jsonlite you can check and visualize your JSON syntax first. Afterwards we query MongoDB with this JSON query.
library("jsonlite")
##
## Attaching package: 'jsonlite'
##
## The following object is masked from 'package:utils':
##
## View
json <- "{\"pop\":{\"$lte\":2}, \"pop\":{\"$gte\":1}}"
cat(prettify(json))
## {
## "pop": {
## "$lte": 2
## },
## "pop": {
## "$gte": 1
## }
## }
validate(json)
## [1] TRUE
if (mongo.is.connected(mongo) == TRUE) {
pops <- mongo.find.all(mongo, coll, json)
print(dim(pops))
head(pops)
}
## Warning: This fails for most NoSQL data structures. I am working on a new
## solution
## [1] 19 5
## _id city loc pop state
## val "11042" "NEW HYDE PARK" Numeric,2 1 "NY"
## val "21240" "BALTIMORE" Numeric,2 1 "MD"
## val "24950" "KIEFFER" Numeric,2 2 "WV"
## val "26818" "FISHER" Numeric,2 1 "WV"
## val "32815" "KENNEDY SPACE CE" Numeric,2 1 "FL"
## val "38132" "MEMPHIS" Numeric,2 2 "TN"
In this case you will get a warning because the construction of the data.frame object is still inefficient and error-prone due to the NoSQL data structure. We are currently working on a solution.
mongo.cursor.to.data.frame
## function (cursor, nullToNA = TRUE, ...)
## {
## warning("This fails for most NoSQL data structures. I am working on a new solution")
## res <- data.frame()
## while (mongo.cursor.next(cursor)) {
## val <- mongo.bson.to.list(mongo.cursor.value(cursor))
## if (nullToNA == TRUE)
## val[sapply(val, is.null)] <- NA
## val <- val[sapply(val, class) != "mongo.oid"]
## res <- rbind.fill(res, as.data.frame(val, ...))
## }
## return(as.data.frame(res))
## }
## <environment: namespace:rmongodb>
Another interesting point is inserting data into MongoDB.
a <- mongo.bson.from.JSON("{\"ident\":\"a\", \"name\":\"Markus\", \"age\":33}")
b <- mongo.bson.from.JSON("{\"ident\":\"b\", \"name\":\"MongoSoup\", \"age\":1}")
c <- mongo.bson.from.JSON("{\"ident\":\"c\", \"name\":\"UseR\", \"age\":18}")
if (mongo.is.connected(mongo) == TRUE) {
icoll <- paste(db, "test", sep = ".")
mongo.insert.batch(mongo, icoll, list(a, b, c))
dbs <- mongo.get.database.collections(mongo, db)
print(dbs)
mongo.find.all(mongo, icoll)
}
## [1] "census.cities" "census.states"
## [3] "census.[object Object]" "census.zips"
## [5] "census.test"
## Warning: This fails for most NoSQL data structures. I am working on a new
## solution
## _id ident name age
## val 73206816 "a" "Markus" 33
## val 0 "b" "MongoSoup" 1
## val 123655216 "c" "UseR" 18
You can also update your data in MongoDB from R and add indices for more efficient queries. (Incement age by 3)
if (mongo.is.connected(mongo) == TRUE) {
mongo.update(mongo, icoll, "{\"ident\":\"b\"}", "{\"$inc\":{\"age\":3}}")
res <- mongo.find.all(mongo, icoll)
print(res)
# Creating an index for field 'ident'
mongo.index.create(mongo, icoll, "{\"ident\":1}")
# check mongo shell!
#> db.test.getIndexes()
}
## Warning: This fails for most NoSQL data structures. I am working on a new
## solution
## _id ident name age
## val 15925248 "a" "Markus" 33
## val 0 "b" "MongoSoup" 4
## val 65536 "c" "UseR" 18
## NULL
Of course there are also commands to drop databases and collections in MongoDB. After you finished all your analyses it’s a good idea to destroy the connection to your MongoDB.
if (mongo.is.connected(mongo) == TRUE) {
mongo.drop(mongo, icoll)
# mongo.drop.database(mongo, db)
res <- mongo.get.database.collections(mongo, db)
print(res)
# always close yourconnection when finished
# mongo.command(mongo, db, getIndexes())
mongo.destroy(mongo)
}
## [1] "census.cities" "census.states"
## [3] "census.[object Object]" "census.zips"
## NULL