This is a notebook to explain how to connect Arm Treasure Data from R.
We will use Chicago Smart Green data, which is used in this tutorial.
Before executing this tutorial, ensure the data is uploaded as follows:
First time you have to install all dependent packages.
install.packages("rJava")
install.packages("RJDBC", dependencies=TRUE)
For this tutorial, we will use following packages:
install.packages(c("dplyr", "dbplyr", "ggplot2"))
Connect TD using RJDBC package. Before connecting to TD, Download TD JDBC driver from here. Replace JDBC jar path for your environment.
library(RJDBC)
## Loading required package: DBI
## Loading required package: rJava
.jinit()
drv <- JDBC("com.treasuredata.jdbc.TreasureDataDriver",
"/Users/ariga/bin/td-jdbc-0.5.10-jar-with-dependencies.jar",
identifier.quote='"')
Then, you can connect TD as follows. Before using this command, you need to set “TD_API_KEY” as an environment variable.
con <- dbConnect(drv,
"jdbc:td://api.treasuredata.com/chicago_smart_green;useSSL=true",
type="presto",
apikey=Sys.getenv("TD_API_KEY"))
This connection assumes to use chicago_smart_green database with Presto as a query engine. If you want to use Hive as a query engine, you can change type as "hive" instead.
You can list the table names in the chicago_smart_green DB using the function dbListTables from DBI.
library(DBI)
dbListTables(con)
## [1] "energy_benchmarking" "model" "samples"
## [4] "vectors"
You can list column names via dbListFields(con, "energy_benchmarking").
dbListFields(con, "energy_benchmarking")
## Note: method with signature 'DBIConnection#character' chosen for function 'dbListFields',
## target signature 'JDBCConnection#character'.
## "JDBCConnection#ANY" would also be valid
## [1] "data_year"
## [2] "id"
## [3] "property_name"
## [4] "address"
## [5] "zip_code"
## [6] "community_area"
## [7] "primary_property_type"
## [8] "gross_floor_area___buildings__sq_ft_"
## [9] "year_built"
## [10] "num_of_buildings"
## [11] "energy_star_score"
## [12] "electricity_use__kbtu_"
## [13] "natural_gas_use__kbtu_"
## [14] "district_steam_use__kbtu_"
## [15] "district_chilled_water_use__kbtu_"
## [16] "all_other_fuel_use__kbtu_"
## [17] "site_eui__kbtu_sq_ft_"
## [18] "source_eui__kbtu_sq_ft_"
## [19] "weather_normalized_site_eui__kbtu_sq_ft_"
## [20] "weather_normalized_source_eui__kbtu_sq_ft_"
## [21] "total_ghg_emissions__metric_tons_co2e_"
## [22] "ghg_intensity__kg_co2e_sq_ft_"
## [23] "latitude"
## [24] "longitude"
## [25] "location"
## [26] "time"
dplyr is a popular R package for manupilating data. You can use dplyr with dbplyr for RJDBC.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(dbplyr)
##
## Attaching package: 'dbplyr'
## The following objects are masked from 'package:dplyr':
##
## ident, sql
tbl(con, "energy_benchmarking")
Using dplyr, you can aggregate data easily as follows:
tbl(con, "energy_benchmarking") %>%
group_by(community_area) %>%
summarise(ave=mean(electricity_use__kbtu_)) %>%
arrange(desc(ave))
## Warning: Missing values are always removed in SQL.
## Use `AVG(x, na.rm = TRUE)` to silence this warning
Note that this is a lazy executed result, so you need to collect() data for plotting. Let’s fetch top 10 energy consumption data in average into R data.frame.
df <- tbl(con, "energy_benchmarking") %>%
group_by(community_area) %>%
summarise(ave=mean(electricity_use__kbtu_)) %>%
arrange(desc(ave)) %>%
top_n(10) %>%
collect()
## Selecting by ave
## Warning: Missing values are always removed in SQL.
## Use `AVG(x, na.rm = TRUE)` to silence this warning
df
Now you can plot them with ggplot2.
library(ggplot2)
p <- ggplot(data=df, aes(x=reorder(community_area,ave), y=ave)) + geom_bar(stat="identity")
p + coord_flip() + labs(x="community area", y="average energy consumption")
We also have another option; Presto JDBC Driver, which is a faster approach only for Presto. Download JDBC driver from Presto official web site.
Replace JDBC jar file place where you downloaded.
drv2 <- JDBC("com.facebook.presto.jdbc.PrestoDriver",
"/Users/ariga/bin/presto-jdbc-0.214.jar",
identifier.quote='"')
Connect chicago_smart_green database via Presto JDBC driver.
con2 <- dbConnect(drv2,
"jdbc:presto://api-presto.treasuredata.com:443/td-presto/chicago_smart_green?SSL=true",
Sys.getenv("TD_API_KEY"),
NULL)
Following executions are almost same as TD JDBC driver.
Note that behavior of dbListTables function is slightly different from TD JDBC driver. It shows all the tables of default database.
dbListTables(con2)
tbl(con2, "energy_benchmarking")
tbl(con2, "energy_benchmarking") %>%
group_by(community_area) %>%
summarise(ave=mean(electricity_use__kbtu_)) %>%
arrange(desc(ave))
## Warning: Missing values are always removed in SQL.
## Use `AVG(x, na.rm = TRUE)` to silence this warning
You can use RPresto as well. RPresto is a package developed by
install.packages("RPresto")
Since RPresto v1.3.2 lacks adding headers for some funcitons, we need to set httr config.
library(RPresto)
httr::set_config(
httr::add_headers("X-Presto-User"=Sys.getenv("TD_API_KEY"))
)
Then, you can connect TD with RPresto.
library(DBI)
con <- dbConnect(
RPresto::Presto(),
host="https://api-presto.treasuredata.com",
port=443,
user=Sys.getenv("TD_API_KEY"),
schema='chicago_smart_green',
catalog='td-presto'
)
With DBI, it’s easy to use dbGetQuery function.
dbGetQuery(con, 'SELECT count(1) from energy_benchmarking')
You also can use dplyr with RPresto.
library(dplyr)
db <- src_presto(
host="https://api-presto.treasuredata.com",
port=443,
user=Sys.getenv("TD_API_KEY"),
schema='chicago_smart_green',
catalog='td-presto'
)
eb <- tbl(db, 'energy_benchmarking')
eb %>%
group_by(community_area) %>%
summarise(ave=mean(electricity_use__kbtu_)) %>%
arrange(desc(ave))
## Warning: Missing values are always removed in SQL.
## Use `AVG(x, na.rm = TRUE)` to silence this warning