This is an example notebook to connect Arm Treasure Data (TD) from R. If you want to know how to access from R with RJDBC or RPresto, you can see How to connect Arm Treasure Data from R.

While RPresto enables you to query TD easily, it can’t handle some TD execution. In this notebook, I will explain how to query with RPresto and how to execute basic TD operation.

If you want to use embulk and embulk-output-td plugin, ensure PATH for embulk command to be set appropriately.

You also have to set your “TD_API_KEY” is set as an environement variable.

Let’s install RPresto and RTD packages.

install.packages(c("RPresto", "RTD", "dplyr", "dbplyr", "ggplot2"))

We will use nycflights13 package for analysis. Copy flights data to TD.

install.packages("nycflights13")
library(RTD)

client <- Td(apikey = Sys.getenv("TD_API_KEY"), endpoint="api.treasuredata.com")
td_upload(client, "test", "flights", nycflights13::flights, overwrite=T)
## Created bulk import session
## Created msgpack file
## Start uploading bulk import part file
## Freezing bulk import
## Performing bulk import
## Commiting bulk import
## Finished bulk import

If you want to use embulk for td_upload, you can set mode="embulk" like:

td_upload(client, "test", "flights", nycflights13::flights, mode="embulk", overwrite=T)

Then, connect the table with RPresto.

library(RPresto)
library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
db <- src_presto(
  host="https://api-presto.treasuredata.com",
  port=443,
  user=Sys.getenv("TD_API_KEY"),
  schema='test',
  catalog='td-presto'  
)

flights_tbl <- tbl(db, 'flights')

Execute simple dplyr operation.

# filter by departure delay
flights_tbl %>% filter(dep_delay == 2)

“Introduction to dplyr” introduces advanced dplyr examples. Let’s try grouped operations example from the tutorial and plot the result.

delay <- flights_tbl %>% 
  group_by(tailnum) %>%
  summarise(count = n(), dist = mean(distance, na.rm = TRUE), delay = mean(arr_delay, na.rm = TRUE)) %>%
  filter(count > 20, dist < 2000) %>%
  collect

# plot delays
library(ggplot2)
ggplot(delay, aes(dist, delay)) +
  geom_point(aes(size = count), alpha = 1/2) +
  geom_smooth() +
  scale_size_area()
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
## Warning: Removed 1643 rows containing non-finite values (stat_smooth).
## Warning: Removed 1643 rows containing missing values (geom_point).

You can write SQL using DBI package.

library(DBI)

con <- dbConnect(
  RPresto::Presto(),
  host="https://api-presto.treasuredata.com",
  port=443,
  user=Sys.getenv("TD_API_KEY"),
  schema='test',
  catalog='td-presto'
)

flights_preview <- dbGetQuery(con, 'SELECT year, month, day, dep_time, dep_delay, carrier, flight from flights limit 10')
flights_preview

You can execute basic TD operation like delete table.

delete_table(client, "test", "flights")
## [1] TRUE

References