The tidyverse meta-package provides unvaluable tools to access and manipulate data in many formats. One of its greatest features is the possibility to access data directly from a database, translating when possible the dplyr functions into the equivalent in the database-specific SQL language. A brief introduction on databases and R can be found here
The first step is obviously the command to load the tidyverse meta-package
library(tidyverse)
## ── Attaching packages ──────────────────────────────────────────────────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 2.2.1 ✔ purrr 0.2.4
## ✔ tibble 1.4.1 ✔ dplyr 0.7.4
## ✔ tidyr 0.7.2 ✔ stringr 1.2.0
## ✔ readr 1.1.1 ✔ forcats 0.2.0
## ── Conflicts ─────────────────────────────────────────────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
I have prepared a PostgreSQL database (``the world’s most advanced open source database’’ according the developers) on a Amazon AWS account. Actually, I have activated an Amazon RDS (Relational Database Service, see here). You can try this at home asking for, as I did, an AWS Free Tier account. In this way you can have freely 750 hours per month of a small database (different engines are available like MySQL, PostgreSQL, MariaDB, SQL Server…).
Here the lines to connect to the database:
db_name <- src_postgres(dbname = 'entsoe_db', user = 'ecem_user', password = 'horsedice',
host = "entsoe-tp.cv6glosov0uu.eu-west-2.rds.amazonaws.com")
The user has been created with read-only permissions, so feel free to experiment without any problem. Now you can see that:
print(db_name)
## src: postgres 9.6.3 [ecem_user@entsoe-tp.cv6glosov0uu.eu-west-2.rds.amazonaws.com:5432/entsoe_db]
## tbls: ecem_esciis, entsoe_dp
The output says that you are connected to a postgres database and two tables (tbls) are present: ecem_esciis and entsoe_dp. The two tables have been created me and they contain, respectively, ECEM ESCIIs data and all the hourly generation data from the ENTSO-E Transparency Platform. Now, let’s try to access the first:
d <- tbl(db_name, 'ecem_esciis')
print(d)
## # Source: table<ecem_esciis> [?? x 9]
## # Database: postgres 9.6.3
## # [ecem_user@entsoe-tp.cv6glosov0uu.eu-west-2.rds.amazonaws.com:5432/entsoe_db]
## row.names date interval area aggr source variable model value
## <chr> <date> <chr> <chr> <chr> <chr> <chr> <chr> <dbl>
## 1 6176973 1980-01-08 01d DE CTRY WIN CFR PhM01 0.00400
## 2 6176974 1980-01-09 01d DE CTRY WIN CFR PhM01 0.00500
## 3 6176975 1980-01-10 01d DE CTRY WIN CFR PhM01 0.0140
## 4 6176976 1980-01-11 01d DE CTRY WIN CFR PhM01 0.0280
## 5 6176977 1980-01-12 01d DE CTRY WIN CFR PhM01 0.0670
## 6 6176978 1980-01-13 01d DE CTRY WIN CFR PhM01 0.0320
## 7 6176979 1980-01-14 01d DE CTRY WIN CFR PhM01 0.0200
## 8 6176980 1980-01-15 01d DE CTRY WIN CFR PhM01 0.0150
## 9 6176981 1980-01-16 01d DE CTRY WIN CFR PhM01 0.0610
## 10 6176982 1980-01-17 01d DE CTRY WIN CFR PhM01 0.0270
## # ... with more rows
Now d seems to store the entire database but if we check the size of the variable:
print(object.size(d))
## 3728 bytes
It seems too small to be the entire ESCII database, isn’t it? This is because the at this stage d has not still be evaluated, it is a sort of pointer to the original data stored on the remote database. This is called lazy evaluation and it is a powerful feature because it tries to postpone the actual gathering of the data only when needed, to allow a more effective memory management. In fact, the remote database could be even larger than the memory available on your workstation, and thanks to this mechanism it does not matter if you plan to work only with a subset of the original data. For example we can filter the data using dplyr:
d <- tbl(db_name, 'ecem_esciis') %>%
filter(area == 'DE', source == 'WIN')
print(object.size(d))
## 6880 bytes
print(d)
## # Source: lazy query [?? x 9]
## # Database: postgres 9.6.3
## # [ecem_user@entsoe-tp.cv6glosov0uu.eu-west-2.rds.amazonaws.com:5432/entsoe_db]
## row.names date interval area aggr source variable model value
## <chr> <date> <chr> <chr> <chr> <chr> <chr> <chr> <dbl>
## 1 6176973 1980-01-08 01d DE CTRY WIN CFR PhM01 0.00400
## 2 6176974 1980-01-09 01d DE CTRY WIN CFR PhM01 0.00500
## 3 6176975 1980-01-10 01d DE CTRY WIN CFR PhM01 0.0140
## 4 6176976 1980-01-11 01d DE CTRY WIN CFR PhM01 0.0280
## 5 6176977 1980-01-12 01d DE CTRY WIN CFR PhM01 0.0670
## 6 6176978 1980-01-13 01d DE CTRY WIN CFR PhM01 0.0320
## 7 6176979 1980-01-14 01d DE CTRY WIN CFR PhM01 0.0200
## 8 6176980 1980-01-15 01d DE CTRY WIN CFR PhM01 0.0150
## 9 6176981 1980-01-16 01d DE CTRY WIN CFR PhM01 0.0610
## 10 6176982 1980-01-17 01d DE CTRY WIN CFR PhM01 0.0270
## # ... with more rows
What happened here? The dbplyr PostgreSQL backend translated the dplyr function into a SQL query executed directly on the server. You can see in the first line that R does not even know the real size of the returned data but only the number of columns. You can in the end gather the data using the collect function:
d <- tbl(db_name, 'ecem_esciis') %>%
filter(area == 'DE', source == 'WIN') %>%
collect()
print(object.size(d))
## 6883728 bytes
print(class(d))
## [1] "tbl_df" "tbl" "data.frame"
print(d)
## # A tibble: 57,344 x 9
## row.names date interval area aggr source variable model value
## * <chr> <date> <chr> <chr> <chr> <chr> <chr> <chr> <dbl>
## 1 6176973 1980-01-08 01d DE CTRY WIN CFR PhM01 0.00400
## 2 6176974 1980-01-09 01d DE CTRY WIN CFR PhM01 0.00500
## 3 6176975 1980-01-10 01d DE CTRY WIN CFR PhM01 0.0140
## 4 6176976 1980-01-11 01d DE CTRY WIN CFR PhM01 0.0280
## 5 6176977 1980-01-12 01d DE CTRY WIN CFR PhM01 0.0670
## 6 6176978 1980-01-13 01d DE CTRY WIN CFR PhM01 0.0320
## 7 6176979 1980-01-14 01d DE CTRY WIN CFR PhM01 0.0200
## 8 6176980 1980-01-15 01d DE CTRY WIN CFR PhM01 0.0150
## 9 6176981 1980-01-16 01d DE CTRY WIN CFR PhM01 0.0610
## 10 6176982 1980-01-17 01d DE CTRY WIN CFR PhM01 0.0270
## # ... with 57,334 more rows
Now, you have a proper tibble/data.frame in your memory and you can use it into the rest of your code.
my_plot <- d %>% filter(lubridate::year(date) >= 2000 & lubridate::year(date) < 2001) %>%
ggplot(aes(x = date, y = value)) + geom_line()
print(my_plot)