You’ll need to install three packages today:
install.packages(c("tidyverse", "RSQLite", "nycflights13"))tidyverse installs an entire suite of tools for interacting with data. We’ll use dplyr, along with some of the things it uses behind the scenes, for interacting with a database.RSQLite is a package for interacting with SQLite (a lightweight SQL database system) through R.nycflights13 is a package that contains data about flights departing JFK, LaGuardia, and Newark airports in 2013.1The only packages we actually need to load in are tidyverse and DBI.
library(tidyverse)
library(DBI)We’ll create a SQLite database in memory to keep things easy today. It will disappear as soon as we disconnect, so if you want to keep it, replace ":memory:" with the name of a .sqlite file. We’ll pretend that it’s actually a larger database on a remote server somewhere.
con <- dbConnect(RSQLite::SQLite(), ":memory:")
con## <SQLiteConnection>
## Path: :memory:
## Extensions: TRUE
Now that we’ve made a database, let’s copy some data over to it using dbWriteTable().
dbWriteTable(con, "flights", nycflights13::flights)
dbWriteTable(con, "planes", nycflights13::planes)
dbWriteTable(con, "airlines", nycflights13::airlines)We can pass SQL queries directly as a character string using dbGetQuery(), but it requires knowledge of SQL. It will also collect the results of our query and store them as a local dataframe. This might not be desirable if it means gathering huge amounts of data from our remote connection and storing them in local memory.
For now, let’s try getting flights that arrived more than 20 minutes late.
delays <- dbGetQuery(con, "SELECT * FROM flights WHERE arr_delay > 20")
head(delays)We can use the tbl() function to create pointers to the remote tables on the database. We can save that pointer into a variable, and perform manipulations on it as though it were a local dataframe. It won’t store anything on our local machine until we’re ready, though, which means that we can have the server do some of the initial - sometimes heavy - processing rather than our machine.
flights <- tbl(con, "flights")
head(flights)Now that we have a reference to the table we want, but haven’t clogged up local memory with a large dataframe, let’s try getting the result we want. We can use the dplyr filter() command to only keep flights that arrive more than 20 minutes late.
delays <- flights %>%
filter(arr_delay > 20)
head(delays)Isn’t that easy? Let’s try adding pointers to the other tables we made, and taking a look at them.
planes <- tbl(con, "planes")
head(planes)airlines <- tbl(con, "airlines")
head(airlines)When we use dplyr functions like filter() on a remote database, dplyr is smart enough to know what’s going on. It generates a SQL query under the hood, but doesn’t send it off to the server until we want to get a local copy of our data.
We can inspect the query ourselves using the sql_render() function from dbplyr (which was installed with tidyverse).
delays %>%
dbplyr::sql_render()## <SQL> SELECT *
## FROM `flights`
## WHERE (`arr_delay` > 20.0)
When we’re manipulating data, it’s very important for us to have some basic information about it. With local data frames, we can always run commands like dim(), ncol(), and nrow() to understand the size and shape of our data. This doesn’t necessarily work with remote tables, though. Let’s try:
dim(flights)## [1] NA 19
That’s an interesting result…we have 19 columns, but the number of rows is NA. We can try in more specificity.
ncol(flights)## [1] 19
nrow(flights)## [1] NA
As it turns out, nrow() doesn’t work for remote tables. We can get the number of columns easily, because in the SQL database it’s just a list of field names. The number of rows is undetermined, though, because depending on the query defined by our variable flights, we could be dealing with millions and millions of rows – it could take the same computing power to get the number of rows as to run the entire query, because the main way we can get the number of rows accurately is to run the entire query. If we want to force that issue, we can run tally().
tally(flights)Note that we can still run some basic commands like head() on the remote table and get a result. We could also run glimpse() for something similar.
glimpse(flights)## Observations: 25
## Variables: 19
## $ year <int> 2013, 2013, 2013, 2013, 2013, 2013, 2013, 2013,...
## $ month <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ day <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,...
## $ dep_time <int> 517, 533, 542, 544, 554, 554, 555, 557, 557, 55...
## $ sched_dep_time <int> 515, 529, 540, 545, 600, 558, 600, 600, 600, 60...
## $ dep_delay <dbl> 2, 4, 2, -1, -6, -4, -5, -3, -3, -2, -2, -2, -2...
## $ arr_time <int> 830, 850, 923, 1004, 812, 740, 913, 709, 838, 7...
## $ sched_arr_time <int> 819, 830, 850, 1022, 837, 728, 854, 723, 846, 7...
## $ arr_delay <dbl> 11, 20, 33, -18, -25, 12, 19, -14, -8, 8, -2, -...
## $ carrier <chr> "UA", "UA", "AA", "B6", "DL", "UA", "B6", "EV",...
## $ flight <int> 1545, 1714, 1141, 725, 461, 1696, 507, 5708, 79...
## $ tailnum <chr> "N14228", "N24211", "N619AA", "N804JB", "N668DN...
## $ origin <chr> "EWR", "LGA", "JFK", "JFK", "LGA", "EWR", "EWR"...
## $ dest <chr> "IAH", "IAH", "MIA", "BQN", "ATL", "ORD", "FLL"...
## $ air_time <dbl> 227, 227, 160, 183, 116, 150, 158, 53, 140, 138...
## $ distance <dbl> 1400, 1416, 1089, 1576, 762, 719, 1065, 229, 94...
## $ hour <dbl> 5, 5, 5, 5, 6, 5, 6, 6, 6, 6, 6, 6, 6, 6, 6, 5,...
## $ minute <dbl> 15, 29, 40, 45, 0, 58, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ time_hour <dbl> 1357016400, 1357016400, 1357016400, 1357016400,...
Most major dplyr functions will work with remote tables, including (but not limited to):
group_by()summarize()filter()select()mutate()arrange()The dplyr pipe %>% is also supported, so we can write complicated code cleanly. Let’s try seeing which airport has the longest-delayed arrivals.
delays_by_airport <- flights %>%
group_by(origin) %>%
summarize(mean_dep_delay = mean(dep_delay), mean_arr_delay = mean(arr_delay)) %>%
arrange(-mean_arr_delay)
head(delays_by_airport)If nothing else, we’ve confirmed the universally-known truth that Newark airport is the worst.
Let’s check out the SQL behind that code – we can see that it’s more complicated, but because most dplyr functions are built to act like SQL commands, they port over fairly easily.
delays_by_airport %>%
dbplyr::sql_render()## <SQL> SELECT *
## FROM (SELECT `origin`, AVG(`dep_delay`) AS `mean_dep_delay`, AVG(`arr_delay`) AS `mean_arr_delay`
## FROM `flights`
## GROUP BY `origin`)
## ORDER BY -`mean_arr_delay`
We can also use any of the *_join() functions. Rather than performing SQL-like joins with local data frames, we’re actually performing SQL joins remotely! Let’s try finding which airline has the longest average delays, separated by airport.
The flights table contains the carrier code, but not the names of the airlines. That is in the airlines table. They have the carrier column in common, though, so we can perform a left join to bring that information together.
delays_airport_airline <- flights %>%
group_by(origin, carrier) %>%
summarize(mean_dep_delay = mean(dep_delay), mean_arr_delay = mean(arr_delay)) %>%
arrange(-mean_arr_delay) %>%
left_join(airlines)## Joining, by = "carrier"
head(delays_airport_airline)American Airlines happens to buy planes from the largest number of companies. Let’s see how each manufacturer performs for them. We can do this by filtering down our flights table, joining it with our planes table to get manufacturer information, and then performing a few other grouping and summarizing operations on it.
aa_delays_by_manufacturer <- flights %>%
filter(carrier == "AA") %>%
left_join(planes, by = "tailnum") %>%
group_by(manufacturer) %>%
summarize(mean_dep_delay = mean(dep_delay), mean_arr_delay = mean(arr_delay)) %>%
arrange(-mean_arr_delay)
head(aa_delays_by_manufacturer)One thing to note is that while we can use a lot of the base functions (mean(), max(), etc.) inside these dplyr statements, we can’t usually use functions from other packages because they don’t have direct SQL translations. For instance, we couldn’t use something from stringr as a filter for our data - let’s say we wanted tail numbers that included "ND" as an example.
delays_error <- flights %>%
filter(stringr::str_detect(tailnum, "ND"))## Error in stri_detect_regex(string, pattern, opts_regex = opts(pattern)): object 'tailnum' not found
So far we’ve been constructing queries, and having them run on a small scale with the head() function, which only returns a few rows rather than the entire result. When we’ve decided that we have a good query and are ready to gather all the results, we use the collect() function.
collect() will have the server perform the operations we’ve requested, and will bring the data back through the connection to our local machines. We can save this as a variable, like df.
df <- aa_delays_by_manufacturer %>%
collect()
dfWe can also pipe the results of collect() directly into a plotting function. If we were interested in seeing how each airport performed month-by-month, we could do something like this:
flights %>%
group_by(month, origin) %>%
summarize(mean_arr_delay = mean(arr_delay)) %>%
collect()Once the data is local, we can pipe it into a ggplot() function, like so:
flights %>%
group_by(month, origin) %>%
summarize(mean_arr_delay = mean(arr_delay)) %>%
collect() %>%
ggplot(aes(x = month, y = mean_arr_delay)) +
geom_line(aes(color = origin)) +
geom_point(aes(color = origin))Once we’re done, it’s important to close our database connection. We do that with another DBI function, dbDisconnect():
dbDisconnect(con)The idea for using this data, and some of this code, came from this article.↩