R Statistics with MongoDB

Dr. Markus Schmidberger
October 14th, 2013 Munich, Germany

Email: markus@mongosoup.de
Twitter: @cloudHPC

MongoSoup

Dr. Markus Schmidberger

plot of chunk worldcloud

Markus

Outline

  • Introduction to Big Data, MongoSoup and R
  • R statistics with MongoDB and Examples
  • Summary & Questions

R MongoSoup

Big Data

  • Wikipedia: … a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing. …
    • storing
    • processing

Markus

Storing: NoSQL - MongoDB

  • databases using looser consistency models to store data
  • German MongoDB as a Service: MongoSoup
    • cloudControl Add-On
    • currently running on AWS EU-Region (Ireland)
    • all features available: shared / dedicated hosting, replica set, sharding
    • 24/7 support available

MongoDB MongoSoup

MongoSoup in < 5 min

  • go to cloudControl: www.cloudcontrol.com
  • add an account and a billing address
  • create a new app, e.g. “rmongodb”
  • install cloudControl command line tools: cctrlapp
  • enable your preferred MongoSoup hosting: cctrlapp rmongodb/default addon.add mongosoup.medium
  • go to the cloudControl Web-Console-AddOns and get your credentials

https://www.cloudcontrol.com/console/app/rmongodb

Processing: Analyzing with R and Hadoop

  • backward-looking analysis is outdated
  • today: quasi real-time analysis
  • tomorrow: forward-looking predictive analysis

  • more complex methods, more data available, more processing time required

  • Check my Strata London Tutorial “Big Data Analyses with R”
    Strata

Introduction to R

  • R is a free software environment for statistical computing and graphics
  • offers tools to manage and analyze data
  • standard statistical methods are implemented
  • compiles and runs under different OS
  • support via huge community

R www.r-project.org

  • huge online-libraries with > 5000 R-packages:
  • possibility to write personalized code and to contribute new packages
  • really famous since January 6, 2009: The New York Times, “Data Analysts Captivated by R's Power” Roberts

RStudio IDE

R as calculator

 (5+5) - 1 * 3
[1] 7
 x <- 3
 x
[1] 3
 x^2 + 4
[1] 13
y <- c(1,2,3)
y
[1] 1 2 3
x <- 1:10
x
 [1]  1  2  3  4  5  6  7  8  9 10
x < 5
 [1]  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE
x[3:7]
[1] 3 4 5 6 7
mean(x)
[1] 5.5
help("mean")
?mean
library(onion)
data(bunny)
head(bunny, n=3)
            x      y        z
[1,] -0.03783 0.1279 0.004475
[2,] -0.04478 0.1289 0.001905
[3,] -0.06801 0.1512 0.037195
p3d(bunny,theta=3,
    phi=104,box=FALSE)

plot of chunk onion

Many Statistical Functions

kmeans(dat, 4)
K-means clustering with 4 clusters of sizes 21, 18, 30, 31

Cluster means:
     [,1]    [,2]
1  0.7755  0.8509
2 -0.1557 -0.2305
3  1.2299  1.1472
4  0.1510  0.1507

Clustering vector:
  [1] 4 2 4 4 2 4 4 4 2 4 4 4 2 2 4 4 1 4 2 2 2 4 4 4 2 4 2 4 4 2 4 2 2 4 4
 [36] 4 4 4 4 4 4 4 4 2 4 2 2 4 2 2 1 1 1 1 3 1 3 3 3 1 1 3 3 3 3 1 3 1 3 3
 [71] 1 3 1 1 3 3 3 3 1 1 3 3 1 1 1 3 3 3 3 1 3 1 3 3 3 3 1 3 3 3

Within cluster sum of squares by cluster:
[1] 3.318 1.166 4.019 3.195
 (between_SS / total_SS =  83.0 %)

Available components:

[1] "cluster"      "centers"      "totss"        "withinss"    
[5] "tot.withinss" "betweenss"    "size"        
plot(dat, col = cl$cluster, cex=2, pch=16)
points(cl$centers, col = 1:4, pch = 13, cex = 4)

plot of chunk kmeans

R Shiny - easy web application

  • developed by RStudio
  • turns R analyses into interactive web applications that anyone can use
  • let your users choose input parameters using friendly controls like sliders, drop-downs, and text fields
  • easily incorporate any number of outputs like plots, tables, and summaries
  • no HTML or JavaScript knowledge is necessary, only R

http://www.rstudio.com/shiny/

R and Databases

  • SQL provides a standard language to filter, aggregate, group, sort data
  • SQL in new places: Hive, Impala, …
  • ODBC provides SQL interface to non-database data (Excel, CSV, text files)
  • R stores relational data in data.frames (extended lists)
data(iris)
head(iris, n=3)
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
class(iris)
[1] "data.frame"

R package: sqldf

running SQL statements on R data frames

library(sqldf)
sqldf("select * from iris limit 2")
  Sepal_Length Sepal_Width Petal_Length Petal_Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
sqldf("select count(*) from iris")
  count(*)
1      150

Other relational R package

  • RMySQL package provides an interface to MySQL
  • RPostgreSQL package provides an interface to PostgreSQL
  • ROracle package provides an interface for Oracle
  • RJDBC package provides access to databases through a JDBC interface
  • RSQLite package provides access to SQLite
    (SQLite engine is included)
  • One big problem:
    all packages read the full result in R memory

R and MongoDB

on CRAN there are two packages to connect R with MongoDB

  • rmongodb supported by MongoDB, Inc.
    • powerful for big data
    • difficult to use due to BSON objects
  • RMongo
    • easy to use
    • limited functionality
    • reads full results in R memory
    • does not work on MAC OS X

R package: RMongo

library(Rmongo)
mongo <- mongoDbConnect("cc_JwQcDLJSYQJb", "dbs001.mongosoup.de", 27017)
dbAuthenticate(mongo, username="JwQcDLJSYQJb", password="RSXPkUkXXXXX")
dbShowCollections(mongo)
dbGetQuery(mongo, "zips","{'state':'AL'}")
dbInsertDocument(mongo, "test_data", '{"foo": "bar", "size": 5 }')
dbDisconnect(mongo)

R package: rmongodb

  • developed on top of the MongoDB supported C driver
library(rmongodb)
mongo <- mongo.create(host="dbs001.mongosoup.de", db="cc_JwQcDLJSYQJb", username="JwQcDLJSYQJb", password="RSXPkUkXXXXX")
mongo
[1] 0
attr(,"mongo")
<pointer: 0x105a1de80>
attr(,"class")
[1] "mongo"
attr(,"host")
[1] "dbs001.mongosoup.de"
attr(,"name")
[1] ""
attr(,"username")
[1] "JwQcDLJSYQJb"
attr(,"password")
[1] "RSXPkUkxRdOX"
attr(,"db")
[1] "cc_JwQcDLJSYQJb"
attr(,"timeout")
[1] 0
mongo.get.database.collections(mongo, "cc_JwQcDLJSYQJb")
[1] "cc_JwQcDLJSYQJb.zips" "cc_JwQcDLJSYQJb.ccp"  "cc_JwQcDLJSYQJb.test"
mongo <- mongo.disconnect(mongo)
buf <- mongo.bson.buffer.create()
mongo.bson.buffer.append(buf, "state", "AL")
[1] TRUE
query <- mongo.bson.from.buffer(buf)
query
    state : 2    AL
res <- mongo.find.one(mongo, "cc_JwQcDLJSYQJb.zips", query)
res
    city : 2     ACMAR
    loc : 4      
        0 : 1    -86.515570
        1 : 1    33.584132

    pop : 16     6055
    state : 2    AL
    _id : 2      35004
out <- mongo.bson.to.list(res)
out$loc
[1] -86.52  33.58
typeof(out$loc)
[1] "double"
out$pop
[1] 6055
out$state
[1] "AL"
cursor <- mongo.find(mongo, "cc_JwQcDLJSYQJb.zips", query)
res <- NULL
while (mongo.cursor.next(cursor)){
    value <- mongo.cursor.value(cursor)
    Rvalue <- mongo.bson.to.list(value)
    res <- rbind(res, Rvalue)
}
err <- mongo.cursor.destroy(cursor)

head(res, n=4)
       city         loc       pop   state _id    
Rvalue "ACMAR"      Numeric,2 6055  "AL"  "35004"
Rvalue "ADAMSVILLE" Numeric,2 10616 "AL"  "35005"
Rvalue "ADGER"      Numeric,2 3205  "AL"  "35006"
Rvalue "KEYSTONE"   Numeric,2 14218 "AL"  "35007"
  • It is all about creating BSON query or field objects
b <- mongo.bson.from.list(
  list(name="Fred", age=29, city="Boston"))
b
    name : 2     Fred
    age : 1      29.000000
    city : 2     Boston
mongo.bson.to.list(b)
$name
[1] "Fred"

$age
[1] 29

$city
[1] "Boston"
?mongo.bson
?mongo.bson.buffer.append
?mongo.bson.buffer.start.array
?mongo.bson.buffer.start.object
buf <- mongo.bson.buffer.create()
mongo.bson.buffer.append(buf, "aggregate", "zips")
mongo.bson.buffer.start.array(buf, "pipeline")
  mongo.bson.buffer.start.object(buf, "$group")
    mongo.bson.buffer.append(buf, "_id", "$state")
    mongo.bson.buffer.start.object(buf, "totalPop")
      mongo.bson.buffer.append(buf, "$sum", "$pop")
    mongo.bson.buffer.finish.object(buf)
  mongo.bson.buffer.finish.object(buf)
mongo.bson.buffer.start.object(buf, "$match")
  mongo.bson.buffer.start.object(buf, "totalPop")
    mongo.bson.buffer.append(buf, "$gte", "10000")
  mongo.bson.buffer.finish.object(buf)
mongo.bson.buffer.finish.object(buf)
mongo.bson.buffer.finish.object(buf)
query <- mongo.bson.from.buffer(buf)

CCP Web Analytics Challenge

buf <- mongo.bson.buffer.create()
query <- mongo.bson.from.buffer(buf)
buf <- mongo.bson.buffer.create()
err <- mongo.bson.buffer.append(buf, "user", 1)
err <- mongo.bson.buffer.append(buf, "type", 1)
field <- mongo.bson.from.buffer(buf)
out <- mongo.find(mongo, "cc_JwQcDLJSYQJb.ccp", query, fields=field, limit=1000)
res <- NULL
while (mongo.cursor.next(out)){
    value <- mongo.cursor.value(out)
    Rvalue <- mongo.bson.to.list(value)
    res <- rbind(res, Rvalue)
}
boxplot( as.integer(table(unlist(res[,2])) ), cex=4, horizontal=TRUE, main="Number of actions per user")

plot of chunk ccp-boxplot

Shiny Mongo

  • R based MongoDB User Interface
  • R packages shiny and rmongodb
  • less than 200 lines of code
  • DEMO: http://localhost:8100

https://github.com/comsysto/ShinyMongo

Summary

See you soon

  • thanks a lot for your attention
  • there are R trainings in December 2013 in Munich
  • we are hosting many events and meetups
  • meet you at the MongoSoup booth

MongoSoup ComSysto

Email: markus@mongosoup.de
Twitter: @cloudHPC