R Statistics with MongoDB

Dr. Markus Schmidberger
October 14th, 2013 Munich, Germany

Email: markus@mongosoup.de
Twitter: @cloudHPC

MongoSoup

Dr. Markus Schmidberger

plot of chunk worldcloud

Markus

Outline

Introduction to Big Data, MongoSoup and R
R statistics with MongoDB and Examples
Summary & Questions

MongoSoup

Big Data

Wikipedia: … a collection of data sets so large and complex that it becomes difficult to process using on-hand database management tools or traditional data processing. …
- storing
- processing

Markus

Storing: NoSQL - MongoDB

databases using looser consistency models to store data
German MongoDB as a Service: MongoSoup
- cloudControl Add-On
- currently running on AWS EU-Region (Ireland)
- all features available: shared / dedicated hosting, replica set, sharding
- 24/7 support available

MongoDB MongoSoup

MongoSoup in < 5 min

go to cloudControl: www.cloudcontrol.com
add an account and a billing address
create a new app, e.g. “rmongodb”
install cloudControl command line tools: cctrlapp
enable your preferred MongoSoup hosting: cctrlapp rmongodb/default addon.add mongosoup.medium
go to the cloudControl Web-Console-AddOns and get your credentials

https://www.cloudcontrol.com/console/app/rmongodb

Processing: Analyzing with R and Hadoop

backward-looking analysis is outdated
today: quasi real-time analysis
tomorrow: forward-looking predictive analysis
more complex methods, more data available, more processing time required
Check my Strata London Tutorial “Big Data Analyses with R”

Introduction to R

R is a free software environment for statistical computing and graphics
offers tools to manage and analyze data
standard statistical methods are implemented
compiles and runs under different OS
support via huge community

www.r-project.org

huge online-libraries with > 5000 R-packages:
- http://cran.r-project.org
possibility to write personalized code and to contribute new packages
really famous since January 6, 2009: The New York Times, “Data Analysts Captivated by R's Power”

RStudio IDE

Roberts

http://www.rstudio.com

R as calculator

 (5+5) - 1 * 3

[1] 7

 x <- 3
 x

[1] 3

 x^2 + 4

[1] 13

y <- c(1,2,3)
y

[1] 1 2 3

x <- 1:10
x

 [1]  1  2  3  4  5  6  7  8  9 10

x < 5

 [1]  TRUE  TRUE  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE

x[3:7]

[1] 3 4 5 6 7

mean(x)

[1] 5.5

help("mean")
?mean

library(onion)
data(bunny)
head(bunny, n=3)

            x      y        z
[1,] -0.03783 0.1279 0.004475
[2,] -0.04478 0.1289 0.001905
[3,] -0.06801 0.1512 0.037195

p3d(bunny,theta=3,
    phi=104,box=FALSE)

plot of chunk onion

Many Statistical Functions

kmeans(dat, 4)

K-means clustering with 4 clusters of sizes 21, 18, 30, 31

Cluster means:
     [,1]    [,2]
1  0.7755  0.8509
2 -0.1557 -0.2305
3  1.2299  1.1472
4  0.1510  0.1507

Clustering vector:
  [1] 4 2 4 4 2 4 4 4 2 4 4 4 2 2 4 4 1 4 2 2 2 4 4 4 2 4 2 4 4 2 4 2 2 4 4
 [36] 4 4 4 4 4 4 4 4 2 4 2 2 4 2 2 1 1 1 1 3 1 3 3 3 1 1 3 3 3 3 1 3 1 3 3
 [71] 1 3 1 1 3 3 3 3 1 1 3 3 1 1 1 3 3 3 3 1 3 1 3 3 3 3 1 3 3 3

Within cluster sum of squares by cluster:
[1] 3.318 1.166 4.019 3.195
 (between_SS / total_SS =  83.0 %)

Available components:

[1] "cluster"      "centers"      "totss"        "withinss"    
[5] "tot.withinss" "betweenss"    "size"

plot(dat, col = cl$cluster, cex=2, pch=16)
points(cl$centers, col = 1:4, pch = 13, cex = 4)

plot of chunk kmeans

R Shiny - easy web application

developed by RStudio
turns R analyses into interactive web applications that anyone can use
let your users choose input parameters using friendly controls like sliders, drop-downs, and text fields
easily incorporate any number of outputs like plots, tables, and summaries
no HTML or JavaScript knowledge is necessary, only R

http://www.rstudio.com/shiny/

R and Databases

SQL provides a standard language to filter, aggregate, group, sort data
SQL in new places: Hive, Impala, …
ODBC provides SQL interface to non-database data (Excel, CSV, text files)
R stores relational data in data.frames (extended lists)

data(iris)
head(iris, n=3)

  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa

class(iris)

[1] "data.frame"

R package: sqldf

running SQL statements on R data frames

library(sqldf)
sqldf("select * from iris limit 2")

  Sepal_Length Sepal_Width Petal_Length Petal_Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa

sqldf("select count(*) from iris")

  count(*)
1      150

Other relational R package

RMySQL package provides an interface to MySQL
RPostgreSQL package provides an interface to PostgreSQL
ROracle package provides an interface for Oracle
RJDBC package provides access to databases through a JDBC interface
RSQLite package provides access to SQLite
(SQLite engine is included)
One big problem:
all packages read the full result in R memory

R and MongoDB

on CRAN there are two packages to connect R with MongoDB

rmongodb supported by MongoDB, Inc.
- powerful for big data
- difficult to use due to BSON objects
RMongo
- easy to use
- limited functionality
- reads full results in R memory
- does not work on MAC OS X

R package: RMongo

library(Rmongo)
mongo <- mongoDbConnect("cc_JwQcDLJSYQJb", "dbs001.mongosoup.de", 27017)
dbAuthenticate(mongo, username="JwQcDLJSYQJb", password="RSXPkUkXXXXX")

dbShowCollections(mongo)
dbGetQuery(mongo, "zips","{'state':'AL'}")
dbInsertDocument(mongo, "test_data", '{"foo": "bar", "size": 5 }')

dbDisconnect(mongo)

R package: rmongodb

developed on top of the MongoDB supported C driver

library(rmongodb)
mongo <- mongo.create(host="dbs001.mongosoup.de", db="cc_JwQcDLJSYQJb", username="JwQcDLJSYQJb", password="RSXPkUkXXXXX")

mongo

[1] 0
attr(,"mongo")
<pointer: 0x105a1de80>
attr(,"class")
[1] "mongo"
attr(,"host")
[1] "dbs001.mongosoup.de"
attr(,"name")
[1] ""
attr(,"username")
[1] "JwQcDLJSYQJb"
attr(,"password")
[1] "RSXPkUkxRdOX"
attr(,"db")
[1] "cc_JwQcDLJSYQJb"
attr(,"timeout")
[1] 0

mongo.get.database.collections(mongo, "cc_JwQcDLJSYQJb")

[1] "cc_JwQcDLJSYQJb.zips" "cc_JwQcDLJSYQJb.ccp"  "cc_JwQcDLJSYQJb.test"

mongo <- mongo.disconnect(mongo)

buf <- mongo.bson.buffer.create()
mongo.bson.buffer.append(buf, "state", "AL")

[1] TRUE

query <- mongo.bson.from.buffer(buf)
query

    state : 2    AL

res <- mongo.find.one(mongo, "cc_JwQcDLJSYQJb.zips", query)
res

    city : 2     ACMAR
    loc : 4      
        0 : 1    -86.515570
        1 : 1    33.584132

    pop : 16     6055
    state : 2    AL
    _id : 2      35004

out <- mongo.bson.to.list(res)
out$loc

[1] -86.52  33.58

typeof(out$loc)

[1] "double"

out$pop

[1] 6055

out$state

[1] "AL"

cursor <- mongo.find(mongo, "cc_JwQcDLJSYQJb.zips", query)

res <- NULL
while (mongo.cursor.next(cursor)){
    value <- mongo.cursor.value(cursor)
    Rvalue <- mongo.bson.to.list(value)
    res <- rbind(res, Rvalue)
}
err <- mongo.cursor.destroy(cursor)

head(res, n=4)

       city         loc       pop   state _id    
Rvalue "ACMAR"      Numeric,2 6055  "AL"  "35004"
Rvalue "ADAMSVILLE" Numeric,2 10616 "AL"  "35005"
Rvalue "ADGER"      Numeric,2 3205  "AL"  "35006"
Rvalue "KEYSTONE"   Numeric,2 14218 "AL"  "35007"

It is all about creating BSON query or field objects

b <- mongo.bson.from.list(
  list(name="Fred", age=29, city="Boston"))
b

    name : 2     Fred
    age : 1      29.000000
    city : 2     Boston

mongo.bson.to.list(b)

$name
[1] "Fred"

$age
[1] 29

$city
[1] "Boston"

?mongo.bson
?mongo.bson.buffer.append
?mongo.bson.buffer.start.array
?mongo.bson.buffer.start.object

buf <- mongo.bson.buffer.create()
mongo.bson.buffer.append(buf, "aggregate", "zips")
mongo.bson.buffer.start.array(buf, "pipeline")
  mongo.bson.buffer.start.object(buf, "$group")
    mongo.bson.buffer.append(buf, "_id", "$state")
    mongo.bson.buffer.start.object(buf, "totalPop")
      mongo.bson.buffer.append(buf, "$sum", "$pop")
    mongo.bson.buffer.finish.object(buf)
  mongo.bson.buffer.finish.object(buf)
mongo.bson.buffer.start.object(buf, "$match")
  mongo.bson.buffer.start.object(buf, "totalPop")
    mongo.bson.buffer.append(buf, "$gte", "10000")
  mongo.bson.buffer.finish.object(buf)
mongo.bson.buffer.finish.object(buf)
mongo.bson.buffer.finish.object(buf)
query <- mongo.bson.from.buffer(buf)

CCP Web Analytics Challenge

buf <- mongo.bson.buffer.create()
query <- mongo.bson.from.buffer(buf)
buf <- mongo.bson.buffer.create()
err <- mongo.bson.buffer.append(buf, "user", 1)
err <- mongo.bson.buffer.append(buf, "type", 1)
field <- mongo.bson.from.buffer(buf)
out <- mongo.find(mongo, "cc_JwQcDLJSYQJb.ccp", query, fields=field, limit=1000)
res <- NULL
while (mongo.cursor.next(out)){
    value <- mongo.cursor.value(out)
    Rvalue <- mongo.bson.to.list(value)
    res <- rbind(res, Rvalue)
}

boxplot( as.integer(table(unlist(res[,2])) ), cex=4, horizontal=TRUE, main="Number of actions per user")

plot of chunk ccp-boxplot

Shiny Mongo

R based MongoDB User Interface
R packages shiny and rmongodb
less than 200 lines of code
DEMO: http://localhost:8100

https://github.com/comsysto/ShinyMongo

Summary

R is a powerful statistical tool to analyse many different kind of data
R can access databases
MongoDB and rmongodb ready for Big Data
start playing around with R, Big Data and MongoDB

See you soon

thanks a lot for your attention
there are R trainings in December 2013 in Munich
- http://comsysto.com/events.html#r
we are hosting many events and meetups
meet you at the MongoSoup booth

MongoSoup ComSysto

Email: markus@mongosoup.de
Twitter: @cloudHPC