This is a notebook to explain how to connect Arm Treasure Data from R.

Ingest data

We will use Chicago Smart Green data, which is used in this tutorial.

Before executing this tutorial, ensure the data is uploaded as follows:

  1. Get City of Chicago energy benchmarking data
  2. Go to connection catalog on TD console and upload the dump CSV file to your own database

Install dependencies

First time you have to install all dependent packages.

install.packages("rJava")
install.packages("RJDBC", dependencies=TRUE)

For this tutorial, we will use following packages:

install.packages(c("dplyr", "dbplyr", "ggplot2"))

Connect to Arm Treasure Data

Connect TD using RJDBC package. Before connecting to TD, Download TD JDBC driver from here. Replace JDBC jar path for your environment.

library(RJDBC)
## Loading required package: DBI
## Loading required package: rJava
.jinit()

drv <- JDBC("com.treasuredata.jdbc.TreasureDataDriver",
  "/Users/ariga/bin/td-jdbc-0.5.10-jar-with-dependencies.jar",
  identifier.quote='"')

Then, you can connect TD as follows. Before using this command, you need to set “TD_API_KEY” as an environment variable.

con <- dbConnect(drv,
  "jdbc:td://api.treasuredata.com/chicago_smart_green;useSSL=true",
  type="presto",
  apikey=Sys.getenv("TD_API_KEY"))

This connection assumes to use chicago_smart_green database with Presto as a query engine. If you want to use Hive as a query engine, you can change type as "hive" instead.

You can list the table names in the chicago_smart_green DB using the function dbListTables from DBI.

library(DBI)

dbListTables(con)
## [1] "energy_benchmarking" "model"               "samples"            
## [4] "vectors"

You can list column names via dbListFields(con, "energy_benchmarking").

dbListFields(con, "energy_benchmarking")
## Note: method with signature 'DBIConnection#character' chosen for function 'dbListFields',
##  target signature 'JDBCConnection#character'.
##  "JDBCConnection#ANY" would also be valid
##  [1] "data_year"                                 
##  [2] "id"                                        
##  [3] "property_name"                             
##  [4] "address"                                   
##  [5] "zip_code"                                  
##  [6] "community_area"                            
##  [7] "primary_property_type"                     
##  [8] "gross_floor_area___buildings__sq_ft_"      
##  [9] "year_built"                                
## [10] "num_of_buildings"                          
## [11] "energy_star_score"                         
## [12] "electricity_use__kbtu_"                    
## [13] "natural_gas_use__kbtu_"                    
## [14] "district_steam_use__kbtu_"                 
## [15] "district_chilled_water_use__kbtu_"         
## [16] "all_other_fuel_use__kbtu_"                 
## [17] "site_eui__kbtu_sq_ft_"                     
## [18] "source_eui__kbtu_sq_ft_"                   
## [19] "weather_normalized_site_eui__kbtu_sq_ft_"  
## [20] "weather_normalized_source_eui__kbtu_sq_ft_"
## [21] "total_ghg_emissions__metric_tons_co2e_"    
## [22] "ghg_intensity__kg_co2e_sq_ft_"             
## [23] "latitude"                                  
## [24] "longitude"                                 
## [25] "location"                                  
## [26] "time"

Use dplyr for quering

dplyr is a popular R package for manupilating data. You can use dplyr with dbplyr for RJDBC.

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
library(dbplyr)
## 
## Attaching package: 'dbplyr'
## The following objects are masked from 'package:dplyr':
## 
##     ident, sql
tbl(con, "energy_benchmarking")

Aggregate data and plot it

Using dplyr, you can aggregate data easily as follows:

tbl(con, "energy_benchmarking") %>% 
  group_by(community_area) %>% 
  summarise(ave=mean(electricity_use__kbtu_)) %>% 
  arrange(desc(ave))
## Warning: Missing values are always removed in SQL.
## Use `AVG(x, na.rm = TRUE)` to silence this warning

Note that this is a lazy executed result, so you need to collect() data for plotting. Let’s fetch top 10 energy consumption data in average into R data.frame.

df <- tbl(con, "energy_benchmarking") %>% 
  group_by(community_area) %>% 
  summarise(ave=mean(electricity_use__kbtu_)) %>% 
  arrange(desc(ave)) %>%
  top_n(10) %>%
  collect()
## Selecting by ave
## Warning: Missing values are always removed in SQL.
## Use `AVG(x, na.rm = TRUE)` to silence this warning
df

Now you can plot them with ggplot2.

library(ggplot2)

p <- ggplot(data=df, aes(x=reorder(community_area,ave), y=ave)) + geom_bar(stat="identity")
p + coord_flip() + labs(x="community area", y="average energy consumption")

Alternate solutions

Presto JDBC Driver

We also have another option; Presto JDBC Driver, which is a faster approach only for Presto. Download JDBC driver from Presto official web site.

Replace JDBC jar file place where you downloaded.

drv2 <- JDBC("com.facebook.presto.jdbc.PrestoDriver",
  "/Users/ariga/bin/presto-jdbc-0.214.jar",
  identifier.quote='"')

Connect chicago_smart_green database via Presto JDBC driver.

con2 <- dbConnect(drv2,
  "jdbc:presto://api-presto.treasuredata.com:443/td-presto/chicago_smart_green?SSL=true",
  Sys.getenv("TD_API_KEY"),
  NULL)

Following executions are almost same as TD JDBC driver.

Note that behavior of dbListTables function is slightly different from TD JDBC driver. It shows all the tables of default database.

dbListTables(con2)
tbl(con2, "energy_benchmarking")
tbl(con2, "energy_benchmarking") %>% 
  group_by(community_area) %>%
  summarise(ave=mean(electricity_use__kbtu_)) %>%
  arrange(desc(ave))
## Warning: Missing values are always removed in SQL.
## Use `AVG(x, na.rm = TRUE)` to silence this warning

RPresto

You can use RPresto as well. RPresto is a package developed by

install.packages("RPresto")

Since RPresto v1.3.2 lacks adding headers for some funcitons, we need to set httr config.

library(RPresto)

httr::set_config(
  httr::add_headers("X-Presto-User"=Sys.getenv("TD_API_KEY"))
)

Then, you can connect TD with RPresto.

library(DBI)
con <- dbConnect(
  RPresto::Presto(),
  host="https://api-presto.treasuredata.com",
  port=443,
  user=Sys.getenv("TD_API_KEY"),
  schema='chicago_smart_green',
  catalog='td-presto'
)

With DBI, it’s easy to use dbGetQuery function.

dbGetQuery(con, 'SELECT count(1) from energy_benchmarking')

You also can use dplyr with RPresto.

library(dplyr)

db <- src_presto(
  host="https://api-presto.treasuredata.com",
  port=443,
  user=Sys.getenv("TD_API_KEY"),
  schema='chicago_smart_green',
  catalog='td-presto'
)

eb <- tbl(db, 'energy_benchmarking')
eb %>%
  group_by(community_area) %>% 
  summarise(ave=mean(electricity_use__kbtu_)) %>% 
  arrange(desc(ave))
## Warning: Missing values are always removed in SQL.
## Use `AVG(x, na.rm = TRUE)` to silence this warning