dbplot and sparklyr

1. Introduction
2. Example

1. Introduction

dbplot 패키지는 맨 아래 reference 링크를 통해 소개되었다.
간략하게 얘기하자면 Database와 Rstudio를 같이 쓰자는 것이다.
- Database에서 원하는 데이터를 뽑아서, Rstudio에서 plotting을 한다는 개념이다.
그러나 필자는 sparklyr의 Spark DataFrame을 가지고 바로 시각화 할 수 있다는 것에 인상을 받았다.
본 문서에서는 Spark DataFrame으로 데이터를 변형 후 dbplot을 이용해서 시각화를 하였다.
- Histogram만 시각화하였다.
dbplot은 기본적인 ggplot2 시각화 4가지를 제공한다.
- Bar plot
- Line plot
- Histogram
- Raster
아직 cran에 정식으로 등록되진 않아서 github를 통해 설치하였다.
그리고 여기서는 기능들을 중점적으로 소개하기 때문에 분석과정을 굳이 생각할 필요는 없다.

devtools::install_github("edgararuiz/dbplot")
library(dbplot)
library(sparklyr)
library(rsparkling)
config <- spark_config()
config$sparklyr.cores.local <- 10
config$`sparklyr.shell.driver-memory`  <- "70G"

sc <- spark_connect(master="local",app_name = "Sparklyr", config = config)

2. Example

데이터는 nycflights13 패키지의 2013년 flights data를 가져왔다.

library(dbplyr)
library(dplyr)
library(nycflights13)

FlightsTB <- copy_to(sc, flights, "FlightsTB")
AirlineTB <- copy_to(sc, airlines, "AirlineTB")

src_tbls(sc)

[1] "airlinetb"  "airportstb" "flightstb"

2.1. dbplot functions

dbplot_histogram() 함수는 30-bin 히스토그램을 기본적으로 시각화한다.
이는 함수내에서 dplyr을 이용해 bin 계산을 수행하기 때문이다.
앞으로 이러한 함수들은 sparklyr을 포함해서 dplyr이 지원되는 모든 Database와 같이 작동할 것이다.
주의할 점은 max()와 min() 같은 기본적인 함수를 제공하는 Database이여야 한다.
기존에는 sparklyr의 Spark DataFrame형태를 R의 data.frame 형태로 변환한 후 ggplot2를 이용하여 시각화를 해야했으며, 아래의 코드를 실행시켜야 했다.

library(ggplot2)
FlightsTB %>% 
  select(distance) %>% 
  as.data.frame() %>% ggplot(aes(x = distance)) + geom_histogram(stat = "bin", bins = 30)

dbplot 패키지를 이용하여 훨씬 간단하게 시각화가 가능하다.

library(dbplot)

tbl(sc, "FlightsTB") %>% 
  dbplot_histogram(distance)

그리고 ggplot2 패키지의 labs나 theme_minimal과 같은 시각화를 꾸며주는 함수들도 같이 연동해서 쓸수 있다.

library(ggplot2)
tbl(sc, "FlightsTB") %>% 
  dbplot_histogram(distance, binwidth = 500) + 
  labs(title = "Flight distance") + 
  theme_minimal()

2.2. db_compute functions

만약 plot에 대해서 더 많은 컨트롤이 필요하다면, db_compute_bins() 함수를 사용하여 시각화에 필요한 data.frame을 얻을 수 있다.
히스토그램을 시각화하는데 필요한 데이터는 다음과 같이 얻어질 수 있다.

AirportsTBbins <- tbl(sc, "AirportsTB") %>%  db_compute_bins(alt)

head(AirportsTBbins)

# A tibble: 6 x 2
     alt count
   <dbl> <dbl>
1  859.2   139
2  250.4   190
3  554.8   212
4  -54.0   582
5 1468.0    41
6 1772.4    20

계산 수행 후 결과는 ggplot2로 연결이 가능하다.

tbl(con, "airports") %>% 
  db_compute_bins(alt) %>%
  ggplot() +
  geom_col(aes(alt, count, fill = count))

2.3. db_bin()

dbplot 패키지는 db_bin() 함수를 가지고 있다.
db_bin()을 통해 원하는 연속형 변수를 계급값들로 바꿀수 있다.
단, db_bin 함수를 사용할 때 앞에 !! 를 아래와 같이 붙여야한다.

df <- tbl(sc, "FlightsTB") %>%
  group_by(x = !! db_bin(arr_delay, bins = 10)) %>%
  tally %>%
  collect

head(df)

# A tibble: 6 x 2
       x      n
   <dbl>  <dbl>
1  -86.0 293108
2   49.8  30699
3  728.8     19
4  185.6   3088
5 1136.2      1
6 1000.4      3

References :