Accessing the ECEM ESCII data in the cloud with the tidyverse

The tidyverse meta-package provides unvaluable tools to access and manipulate data in many formats. One of its greatest features is the possibility to access data directly from a database, translating when possible the dplyr functions into the equivalent in the database-specific SQL language. A brief introduction on databases and R can be found here

The first step is obviously the command to load the tidyverse meta-package

library(tidyverse)

## ── Attaching packages ──────────────────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──

## ✔ ggplot2 2.2.1     ✔ purrr   0.2.4
## ✔ tibble  1.4.1     ✔ dplyr   0.7.4
## ✔ tidyr   0.7.2     ✔ stringr 1.2.0
## ✔ readr   1.1.1     ✔ forcats 0.2.0

## ── Conflicts ─────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

I have prepared a PostgreSQL database (``the world’s most advanced open source database’’ according the developers) on a Amazon AWS account. Actually, I have activated an Amazon RDS (Relational Database Service, see here). You can try this at home asking for, as I did, an AWS Free Tier account. In this way you can have freely 750 hours per month of a small database (different engines are available like MySQL, PostgreSQL, MariaDB, SQL Server…).

Here the lines to connect to the database:

db_name <- src_postgres(dbname = 'entsoe_db', user = 'ecem_user', password = 'horsedice',
                       host = "entsoe-tp.cv6glosov0uu.eu-west-2.rds.amazonaws.com")

The user has been created with read-only permissions, so feel free to experiment without any problem. Now you can see that:

print(db_name)

## src:  postgres 9.6.3 [ecem_user@entsoe-tp.cv6glosov0uu.eu-west-2.rds.amazonaws.com:5432/entsoe_db]
## tbls: ecem_esciis, entsoe_dp

The output says that you are connected to a postgres database and two tables (tbls) are present: ecem_esciis and entsoe_dp. The two tables have been created me and they contain, respectively, ECEM ESCIIs data and all the hourly generation data from the ENTSO-E Transparency Platform. Now, let’s try to access the first:

d <- tbl(db_name, 'ecem_esciis') 
print(d)

## # Source:   table<ecem_esciis> [?? x 9]
## # Database: postgres 9.6.3
## #   [ecem_user@entsoe-tp.cv6glosov0uu.eu-west-2.rds.amazonaws.com:5432/entsoe_db]
##    row.names date       interval area  aggr  source variable model   value
##    <chr>     <date>     <chr>    <chr> <chr> <chr>  <chr>    <chr>   <dbl>
##  1 6176973   1980-01-08 01d      DE    CTRY  WIN    CFR      PhM01 0.00400
##  2 6176974   1980-01-09 01d      DE    CTRY  WIN    CFR      PhM01 0.00500
##  3 6176975   1980-01-10 01d      DE    CTRY  WIN    CFR      PhM01 0.0140 
##  4 6176976   1980-01-11 01d      DE    CTRY  WIN    CFR      PhM01 0.0280 
##  5 6176977   1980-01-12 01d      DE    CTRY  WIN    CFR      PhM01 0.0670 
##  6 6176978   1980-01-13 01d      DE    CTRY  WIN    CFR      PhM01 0.0320 
##  7 6176979   1980-01-14 01d      DE    CTRY  WIN    CFR      PhM01 0.0200 
##  8 6176980   1980-01-15 01d      DE    CTRY  WIN    CFR      PhM01 0.0150 
##  9 6176981   1980-01-16 01d      DE    CTRY  WIN    CFR      PhM01 0.0610 
## 10 6176982   1980-01-17 01d      DE    CTRY  WIN    CFR      PhM01 0.0270 
## # ... with more rows

Now d seems to store the entire database but if we check the size of the variable:

print(object.size(d))

## 3728 bytes

It seems too small to be the entire ESCII database, isn’t it? This is because the at this stage d has not still be evaluated, it is a sort of pointer to the original data stored on the remote database. This is called lazy evaluation and it is a powerful feature because it tries to postpone the actual gathering of the data only when needed, to allow a more effective memory management. In fact, the remote database could be even larger than the memory available on your workstation, and thanks to this mechanism it does not matter if you plan to work only with a subset of the original data. For example we can filter the data using dplyr:

d <- tbl(db_name, 'ecem_esciis') %>% 
  filter(area == 'DE', source == 'WIN')
print(object.size(d))

## 6880 bytes

print(d)

## # Source:   lazy query [?? x 9]
## # Database: postgres 9.6.3
## #   [ecem_user@entsoe-tp.cv6glosov0uu.eu-west-2.rds.amazonaws.com:5432/entsoe_db]
##    row.names date       interval area  aggr  source variable model   value
##    <chr>     <date>     <chr>    <chr> <chr> <chr>  <chr>    <chr>   <dbl>
##  1 6176973   1980-01-08 01d      DE    CTRY  WIN    CFR      PhM01 0.00400
##  2 6176974   1980-01-09 01d      DE    CTRY  WIN    CFR      PhM01 0.00500
##  3 6176975   1980-01-10 01d      DE    CTRY  WIN    CFR      PhM01 0.0140 
##  4 6176976   1980-01-11 01d      DE    CTRY  WIN    CFR      PhM01 0.0280 
##  5 6176977   1980-01-12 01d      DE    CTRY  WIN    CFR      PhM01 0.0670 
##  6 6176978   1980-01-13 01d      DE    CTRY  WIN    CFR      PhM01 0.0320 
##  7 6176979   1980-01-14 01d      DE    CTRY  WIN    CFR      PhM01 0.0200 
##  8 6176980   1980-01-15 01d      DE    CTRY  WIN    CFR      PhM01 0.0150 
##  9 6176981   1980-01-16 01d      DE    CTRY  WIN    CFR      PhM01 0.0610 
## 10 6176982   1980-01-17 01d      DE    CTRY  WIN    CFR      PhM01 0.0270 
## # ... with more rows

What happened here? The dbplyr PostgreSQL backend translated the dplyr function into a SQL query executed directly on the server. You can see in the first line that R does not even know the real size of the returned data but only the number of columns. You can in the end gather the data using the collect function:

d <- tbl(db_name, 'ecem_esciis') %>% 
  filter(area == 'DE', source == 'WIN') %>%
  collect()
print(object.size(d))

## 6883728 bytes

print(class(d))

## [1] "tbl_df"     "tbl"        "data.frame"

print(d)

## # A tibble: 57,344 x 9
##    row.names date       interval area  aggr  source variable model   value
##  * <chr>     <date>     <chr>    <chr> <chr> <chr>  <chr>    <chr>   <dbl>
##  1 6176973   1980-01-08 01d      DE    CTRY  WIN    CFR      PhM01 0.00400
##  2 6176974   1980-01-09 01d      DE    CTRY  WIN    CFR      PhM01 0.00500
##  3 6176975   1980-01-10 01d      DE    CTRY  WIN    CFR      PhM01 0.0140 
##  4 6176976   1980-01-11 01d      DE    CTRY  WIN    CFR      PhM01 0.0280 
##  5 6176977   1980-01-12 01d      DE    CTRY  WIN    CFR      PhM01 0.0670 
##  6 6176978   1980-01-13 01d      DE    CTRY  WIN    CFR      PhM01 0.0320 
##  7 6176979   1980-01-14 01d      DE    CTRY  WIN    CFR      PhM01 0.0200 
##  8 6176980   1980-01-15 01d      DE    CTRY  WIN    CFR      PhM01 0.0150 
##  9 6176981   1980-01-16 01d      DE    CTRY  WIN    CFR      PhM01 0.0610 
## 10 6176982   1980-01-17 01d      DE    CTRY  WIN    CFR      PhM01 0.0270 
## # ... with 57,334 more rows

Now, you have a proper tibble/data.frame in your memory and you can use it into the rest of your code.

my_plot <- d %>% filter(lubridate::year(date) >= 2000 & lubridate::year(date) < 2001) %>% 
  ggplot(aes(x = date, y = value)) + geom_line()
print(my_plot)

Accessing the ECEM ESCII data in the cloud with the tidyverse

matteo.defelice

24 January, 2018