In this lecture, I cover how to create a SQLite database image version of the Ergast Developer API containing historical Formula 1 racing data.
SQL queries are sent to this file from R/RStudio using the DBI and RSQLite packages and several F1 race records and category-leaders for:
were derived from the tables in the SQLite database. RSQLite is a SQL engine with superb R integration. It exhibits reasonably fast joins and aggregations, typically among the most important features of a database system. It is simple to use, loads data robustly, and runs very fast on larger-than-RAM data.
Finally, I discuss how to use the dplyr package to interface with databases and to make window functions available to SQLite files. Two race records were produced using dplyr syntax and window functions that would have been difficult to accomplish using pure SQL.
For fans of Formula 1 racing, the Ergast Developer API is an invaluable source of historical race data, race results, and constructor and driver standings for every season since 1950. This data is openly licensed and made readily available through a variety of interfaces, including an API, a webpage that can be queried from a manual interface, and a downloadable database image.
This data can provide answers to countless questions about Formula 1 and can perhaps even be used to reveal stories that went unnoticed or unappreciated in F1 history.
The intention is to write a few SQL queries using SQLite and dplyr that return interesting records and other superlatives in F1 racing history.
Data in the ergast experimental Motor Racing Developer API is organized into \(13\) database tables:
Many of these tables (status, qualifying, etc.) will likely not be of interest to us. Others (results, drivers, races, etc.) will be extremely useful. That said, it is nevertheless important to obtain an understanding of the relations between all of the different tables in the database.
To that end, we do need a schema with primary and foreign keys in the database and displays the fields in each table. The schema below illustrates a useful one:
f1db_draw(con)
A complete MySQL database export file (“f1db.sql.gz”) is published at the ergast website following each race. While one could certainly import this database dump directly into a MySQL management tool and manipulate it from there, I would prefer the ability to connect to the database from RStudio and leverage several R packages (such as DBI, SQLite, dbplyr, etc.) to interface with the database.
While R can connect to a MySQL database, this approach does require that the data be uploaded to a MySQL database, and that the database be configured with the appropriate credentials and permissions to allow R access. A simpler route is to first downloads the ergast database as a set of CSV files and create a SQLite1 database file.
downloadErgastCSV
function (destfile = paste0(getwd(), "/f1db_csv"))
{
zipdest <- paste0(getwd(), "/f1db_csv.zip")
download.file("http://ergast.com/downloads/f1db_csv.zip",
destfile = zipdest)
csv_dir <- destfile
unzip(zipdest, exdir = csv_dir)
unlink(zipdest)
}
<environment: 0x7f7feb2d3938>
After downloading f1db_csv.zip
, we can establish a
connection to it like so. Although not strictly necessary, I have chosen
to use R and RStudio to interface with the SQLite
database. The function createF1db
allows R to
directly connect to the F1db.SQLite
database file.
createF1db # https://github.com/NilsDM/f1db/tree/main/R
function (csv_dir = NULL, rm_csv = FALSE, type = "sqlite")
{
if (file.exists(paste0(getwd(), "/f1_db.", type))) {
stop("Database file already exists", call. = FALSE)
}
if (is.null(csv_dir)) {
downloadErgastCSV()
csv_dir <- paste0(getwd(), "/f1db_csv")
}
con <- DBI::dbConnect(RSQLite::SQLite(), "f1_db.SQLite")
table_names <- c("constructors", "constructor_standings",
"constructor_results", "circuits", "drivers", "driver_standings",
"lap_times", "pit_stops", "seasons", "status", "races",
"results", "qualifying")
csv_names <- paste0("f1db_csv", "/", table_names, ".csv")
num_tables <- length(table_names)
tryCatch({
for (i in 1:length(table_names)) {
assign(table_names[i], utils::read.csv(file = csv_names[i],
fileEncoding = "UTF-8"))
}
f1_tables <- list(constructors, constructor_standings,
constructor_results, circuits, drivers, driver_standings,
lap_times, pit_stops, seasons, status, races, results,
qualifying)
constructors <- as.data.frame(lapply(constructors, function(x) if (is.character(x) |
is.factor(x))
gsub("\\\\N", "NULL", x)
else x))
constructor_standings <- as.data.frame(lapply(constructor_standings,
function(x) if (is.character(x) | is.factor(x))
gsub("\\\\N", "NULL", x)
else x))
constructor_results <- as.data.frame(lapply(constructor_results,
function(x) if (is.character(x) | is.factor(x))
gsub("\\\\N", "NULL", x)
else x))
circuits <- as.data.frame(lapply(circuits, function(x) if (is.character(x) |
is.factor(x))
gsub("\\\\N", "NULL", x)
else x))
drivers <- as.data.frame(lapply(drivers, function(x) if (is.character(x) |
is.factor(x))
gsub("\\\\N", NA, x)
else x))
driver_standings <- as.data.frame(lapply(driver_standings,
function(x) if (is.character(x) | is.factor(x))
gsub("\\\\N", "NULL", x)
else x))
lap_times <- as.data.frame(lapply(lap_times, function(x) if (is.character(x) |
is.factor(x))
gsub("\\\\N", "NULL", x)
else x))
pit_stops <- as.data.frame(lapply(pit_stops, function(x) if (is.character(x) |
is.factor(x))
gsub("\\\\N", "NULL", x)
else x))
seasons <- as.data.frame(lapply(seasons, function(x) if (is.character(x) |
is.factor(x))
gsub("\\\\N", "NULL", x)
else x))
status <- as.data.frame(lapply(status, function(x) if (is.character(x) |
is.factor(x))
gsub("\\\\N", "NULL", x)
else x))
races <- as.data.frame(lapply(races, function(x) if (is.character(x) |
is.factor(x))
gsub("\\\\N", NA, x)
else x))
results <- as.data.frame(lapply(results, function(x) if (is.character(x) |
is.factor(x))
gsub("\\\\N", NA, x)
else x))
qualifying <- as.data.frame(lapply(qualifying, function(x) if (is.character(x) |
is.factor(x))
gsub("\\\\N", "NULL", x)
else x))
dm <- dm::dm(constructors, constructor_standings, constructor_results,
circuits, drivers, driver_standings, lap_times, pit_stops,
seasons, status, races, results, qualifying)
dm_primary_keys <- dm %>% dm::dm_add_pk(circuits, circuitId) %>%
dm::dm_add_pk(constructors, constructorId) %>% dm::dm_add_pk(drivers,
driverId) %>% dm::dm_add_pk(results, resultId) %>%
dm::dm_add_pk(races, raceId) %>% dm::dm_add_pk(constructor_standings,
constructorStandingsId) %>% dm::dm_add_pk(constructor_results,
constructorResultsId) %>% dm::dm_add_pk(qualifying,
qualifyId) %>% dm::dm_add_pk(seasons, year) %>% dm::dm_add_pk(status,
statusId) %>% dm::dm_add_pk(driver_standings, driverStandingsId)
pk_check <- dm_primary_keys %>% dm::dm_examine_constraints()
if (all(pk_check$is_key)) {
dm_foreign_keys <- dm_primary_keys %>% dm::dm_add_fk(pit_stops,
raceId, races, raceId) %>% dm::dm_add_fk(pit_stops,
driverId, drivers, driverId) %>% dm::dm_add_fk(lap_times,
raceId, races, raceId) %>% dm::dm_add_fk(lap_times,
driverId, drivers, driverId)
}
fk_check <- dm_foreign_keys %>% dm::dm_examine_constraints()
if (all(fk_check$is_key)) {
dm_all_keys <- dm_foreign_keys %>% dm::dm_add_fk(constructor_standings,
raceId, races, raceId) %>% dm::dm_add_fk(constructor_standings,
constructorId, constructors) %>% dm::dm_add_fk(results,
constructorId, constructors, constructorId) %>%
dm::dm_add_fk(results, statusId, status, statusId) %>%
dm::dm_add_fk(results, driverId, drivers, driverId) %>%
dm::dm_add_fk(results, raceId, races, raceId) %>%
dm::dm_add_fk(races, year, seasons, year) %>%
dm::dm_add_fk(qualifying, raceId, races, raceId) %>%
dm::dm_add_fk(qualifying, constructorId, constructors,
constructorId) %>% dm::dm_add_fk(qualifying,
driverId, drivers, driverId) %>% dm::dm_add_fk(constructor_results,
constructorId, constructors, constructorId) %>%
dm::dm_add_fk(constructor_results, raceId, races,
raceId) %>% dm::dm_add_fk(driver_standings,
raceId, races, raceId) %>% dm::dm_add_fk(driver_standings,
driverId, drivers, driverId)
}
all_check <- dm_all_keys %>% dm::dm_examine_constraints()
if (all(all_check$is_key)) {
db_dm <- dm::copy_dm_to(con, dm_all_keys, temporary = FALSE,
set_key_constraints = TRUE)
}
}, error = function(e) {
DBI::dbDisconnect(con)
unlink("f1_db.sqlite")
stop(e)
})
if (rm_csv) {
unlink(csv_dir)
}
print(db_dm)
return(list(con, db_dm))
}
<environment: 0x7f7feb2d3938>
We can now use the DBI package to examine and/or query the database:
drivers <- tbl(con, "drivers") %>% collect()
drivers %>% format.dt.f(.)
races <- tbl(con, "races") %>% collect()
races %>% format.dt.f(.)
When using the DBI package, the dbGetQuery() function will send a SQL query to the database and return the results as a dataframe. dbGetQuery() takes, at a minimum, two arguments:
A sample query, then, might look something like this:
driver_info1 <- dbGetQuery(
conn = con,
statement = "SELECT driverId, forename, surname, driverRef, dob,
nationality
FROM drivers
ORDER BY driverId;"
)
# For increased clarity, however, I will keep showing both the SQL query and the dplyr syntax:
driver_info2 <- tbl(con, "drivers") %>% # R or Python
collect() %>%
select(driverId, forename, surname, driverRef, dob, nationality) %>%
arrange(driverId)
all.equal(driver_info1, driver_info2, check.attributes=FALSE)
[1] TRUE
driver_info1 %>% format.dt.f(.)
Althought there are many more race records that one could derive from the ergast F1 database, some of these will likely be challenging or require somewhat circuitous means (self joins, in particular).
The dplyr R package contain many window functions such as:
If database tables are pulled into local R objects, we can use the full suite of dplyr window functions to solve problems that might require very complicated SQL solutions in SQLite.
With this bit of explanation out of the way, we can move ahead with some simple queries to return interesting F1 racing records.
For example, imagine we are interested in knowing which driver recorded the most wins in the first season of their F1 career?
An R solution might look like the following:
# Most wins in first season of career
driverStandings <- tbl(con, "driver_standings") %>% collect()
# list(driverStandings) %>% set_names("driverStandings") %>% listviewer::jsonedit()
driverStandings %>%
left_join(drivers, by = c("driverId" = "driverId")) %>%
left_join(races, by = c("raceId" = "raceId")) %>%
group_by(driverId, forename, surname, nationality) %>%
filter(year == min(year)) %>%
summarize(Wins = max(wins)) %>%
ungroup() %>%
select(-driverId) %>%
arrange(desc(Wins)) %>%
format.dt.f(.)
A similar query in SQLite may also work since SQLites has window functions.
Similarly, if we were to ask which driver entered the most races without ever recording a win, we might use the following dplyr call:
# Most races entered without recording a win
driverStandings %>%
left_join(drivers, by = c("driverId" = "driverId")) %>%
select(driverId, forename, surname, nationality, wins) %>%
group_by(driverId, forename, surname, nationality) %>%
mutate(Total_Wins = sum(wins)) %>%
filter(Total_Wins == 0) %>%
count() %>%
arrange(desc(n)) %>%
format.dt.f(.)
The solutions to some problems will likely be easier to obtain using pure SQL syntax, while others will likely be more intuitive using dplyr syntax.
Personally, I have mixed feelings about SQL. On the positive side lies the ability of good database engines to optimize arbitrary queries for performance. And, SQL’s declarative style and relational model can indeed make many data wrangling intentions easy to express, at least for simple data manipulation tasks.
But, many problems linger with SQL implementations. Query optimizers, it turns out, can only do so much. Worse, there exists substantial syntax variation and idiosyncrasies across various SQL implementations. For even slightly complex tasks SQL often becomes difficult to write and understand. I find SQL awkward to compose and nest. It is hard to do things that should be really simple like mixing result types in a query (scalars and vectors, say). And there are obvious downsides of representing SQL programs as big character strings in R–for instance, all errors can only be caught at run time.
Fortunately, there is dplyr! Dplyr calls itself a grammar of data manipulation, but it is also an impedance matching circuit that lets us write lovely composable R functions and still gain the benefits of query optimization and performance that the database implementation has to offer.
Finally, for problems that fit in our system’s memory, we may not find any performance advantage of using SQLite over alternative R approaches. In particular data.table is usually (always?) faster than SQLite for in-memory problems. Humble base R also often gets close to or exceeds SQLite performance. Plus, base R and data.table offer a substantial array of specialized data manipulation operations that easily exceed the performance and capabilities of SQLite, as shown in some of the pieces. However, SQLite works without change on problems that are larger than the computer memory residing in files which can be an important advantage in some cases.
It seems to me SQLite is best used with R using dplyr, a really nice combination of technologies. The ability to pick-and-choose the best tool for the job, however, is deeply valuable and both can clearly be used to supplement and assist each other. In summary:
SQLite is a nifty C-language library that implements a small, fast, self-contained, high-reliability, full-featured, SQL database engine. SQLite is the most used database engine in the world. SQLite is built into all mobile phones and most computers and comes bundled inside countless other applications that people use every day. The SQLite file format is stable, cross-platform, and backwards compatible. SQLite’s integration with R and Python is so complete that it can work directly on in-memory language objects without copying. This lets us use SQLite directly on Pandas and R data frames, including excellent support for dplyr.↩︎