1 Introduction

In this lecture, I cover how to create a SQLite database image version of the Ergast Developer API containing historical Formula 1 racing data.

SQL queries are sent to this file from R/RStudio using the DBI and RSQLite packages and several F1 race records and category-leaders for:

lap times
driver records and
constructor records

were derived from the tables in the SQLite database. RSQLite is a SQL engine with superb R integration. It exhibits reasonably fast joins and aggregations, typically among the most important features of a database system. It is simple to use, loads data robustly, and runs very fast on larger-than-RAM data.

Finally, I discuss how to use the dplyr package to interface with databases and to make window functions available to SQLite files. Two race records were produced using dplyr syntax and window functions that would have been difficult to accomplish using pure SQL.

2 Ergast Developer API

For fans of Formula 1 racing, the Ergast Developer API is an invaluable source of historical race data, race results, and constructor and driver standings for every season since 1950. This data is openly licensed and made readily available through a variety of interfaces, including an API, a webpage that can be queried from a manual interface, and a downloadable database image.

This data can provide answers to countless questions about Formula 1 and can perhaps even be used to reveal stories that went unnoticed or unappreciated in F1 history.

The intention is to write a few SQL queries using SQLite and dplyr that return interesting records and other superlatives in F1 racing history.

Ergast Developer API Homepage

Data in the ergast experimental Motor Racing Developer API is organized into \(13\) database tables:

circuits:
information about each circuit and its competition history
constructorResults:
team results for each race
constructorStandings: :
team standings after each race
constructors: :
profiles of each team
driverStandings: :
driver standings after each race
drivers: :
nformation bout each driver and their race career
lapTimes: :
race lap times from the 2011 season onward
pitStops: :
pit stop data for each race from 2012 onward
qualifying: :
the results of each qualifying session from 2003 onward
races: :
the races that took place in each given season
results: :
the final classification for each race
seasons: :
a list of the seasons for which data is available
status: :
describes the finishing status for each competitor

Many of these tables (status, qualifying, etc.) will likely not be of interest to us. Others (results, drivers, races, etc.) will be extremely useful. That said, it is nevertheless important to obtain an understanding of the relations between all of the different tables in the database.

To that end, we do need a schema with primary and foreign keys in the database and displays the fields in each table. The schema below illustrates a useful one:

f1db_draw(con)

ergast Database Schema

3 Ergast Database via SQLite & dplyr

A complete MySQL database export file (“f1db.sql.gz”) is published at the ergast website following each race. While one could certainly import this database dump directly into a MySQL management tool and manipulate it from there, I would prefer the ability to connect to the database from RStudio and leverage several R packages (such as DBI, SQLite, dbplyr, etc.) to interface with the database.

While R can connect to a MySQL database, this approach does require that the data be uploaded to a MySQL database, and that the database be configured with the appropriate credentials and permissions to allow R access. A simpler route is to first downloads the ergast database as a set of CSV files and create a SQLite ¹ database file.

downloadErgastCSV
function (destfile = paste0(getwd(), "/f1db_csv")) 
{
    zipdest <- paste0(getwd(), "/f1db_csv.zip")
    download.file("http://ergast.com/downloads/f1db_csv.zip", 
        destfile = zipdest)
    csv_dir <- destfile
    unzip(zipdest, exdir = csv_dir)
    unlink(zipdest)
}
<environment: 0x7f7feb2d3938>

After downloading f1db_csv.zip, we can establish a connection to it like so. Although not strictly necessary, I have chosen to use R and RStudio to interface with the SQLite database. The function createF1db allows R to directly connect to the F1db.SQLite database file.

createF1db # https://github.com/NilsDM/f1db/tree/main/R
function (csv_dir = NULL, rm_csv = FALSE, type = "sqlite") 
{
    if (file.exists(paste0(getwd(), "/f1_db.", type))) {
        stop("Database file already exists", call. = FALSE)
    }
    if (is.null(csv_dir)) {
        downloadErgastCSV()
        csv_dir <- paste0(getwd(), "/f1db_csv")
    }
    con <- DBI::dbConnect(RSQLite::SQLite(), "f1_db.SQLite")
    table_names <- c("constructors", "constructor_standings", 
        "constructor_results", "circuits", "drivers", "driver_standings", 
        "lap_times", "pit_stops", "seasons", "status", "races", 
        "results", "qualifying")
    csv_names <- paste0("f1db_csv", "/", table_names, ".csv")
    num_tables <- length(table_names)
    tryCatch({
        for (i in 1:length(table_names)) {
            assign(table_names[i], utils::read.csv(file = csv_names[i], 
                fileEncoding = "UTF-8"))
        }
        f1_tables <- list(constructors, constructor_standings, 
            constructor_results, circuits, drivers, driver_standings, 
            lap_times, pit_stops, seasons, status, races, results, 
            qualifying)
        constructors <- as.data.frame(lapply(constructors, function(x) if (is.character(x) | 
            is.factor(x)) 
            gsub("\\\\N", "NULL", x)
        else x))
        constructor_standings <- as.data.frame(lapply(constructor_standings, 
            function(x) if (is.character(x) | is.factor(x)) 
                gsub("\\\\N", "NULL", x)
            else x))
        constructor_results <- as.data.frame(lapply(constructor_results, 
            function(x) if (is.character(x) | is.factor(x)) 
                gsub("\\\\N", "NULL", x)
            else x))
        circuits <- as.data.frame(lapply(circuits, function(x) if (is.character(x) | 
            is.factor(x)) 
            gsub("\\\\N", "NULL", x)
        else x))
        drivers <- as.data.frame(lapply(drivers, function(x) if (is.character(x) | 
            is.factor(x)) 
            gsub("\\\\N", NA, x)
        else x))
        driver_standings <- as.data.frame(lapply(driver_standings, 
            function(x) if (is.character(x) | is.factor(x)) 
                gsub("\\\\N", "NULL", x)
            else x))
        lap_times <- as.data.frame(lapply(lap_times, function(x) if (is.character(x) | 
            is.factor(x)) 
            gsub("\\\\N", "NULL", x)
        else x))
        pit_stops <- as.data.frame(lapply(pit_stops, function(x) if (is.character(x) | 
            is.factor(x)) 
            gsub("\\\\N", "NULL", x)
        else x))
        seasons <- as.data.frame(lapply(seasons, function(x) if (is.character(x) | 
            is.factor(x)) 
            gsub("\\\\N", "NULL", x)
        else x))
        status <- as.data.frame(lapply(status, function(x) if (is.character(x) | 
            is.factor(x)) 
            gsub("\\\\N", "NULL", x)
        else x))
        races <- as.data.frame(lapply(races, function(x) if (is.character(x) | 
            is.factor(x)) 
            gsub("\\\\N", NA, x)
        else x))
        results <- as.data.frame(lapply(results, function(x) if (is.character(x) | 
            is.factor(x)) 
            gsub("\\\\N", NA, x)
        else x))
        qualifying <- as.data.frame(lapply(qualifying, function(x) if (is.character(x) | 
            is.factor(x)) 
            gsub("\\\\N", "NULL", x)
        else x))
        dm <- dm::dm(constructors, constructor_standings, constructor_results, 
            circuits, drivers, driver_standings, lap_times, pit_stops, 
            seasons, status, races, results, qualifying)
        dm_primary_keys <- dm %>% dm::dm_add_pk(circuits, circuitId) %>% 
            dm::dm_add_pk(constructors, constructorId) %>% dm::dm_add_pk(drivers, 
            driverId) %>% dm::dm_add_pk(results, resultId) %>% 
            dm::dm_add_pk(races, raceId) %>% dm::dm_add_pk(constructor_standings, 
            constructorStandingsId) %>% dm::dm_add_pk(constructor_results, 
            constructorResultsId) %>% dm::dm_add_pk(qualifying, 
            qualifyId) %>% dm::dm_add_pk(seasons, year) %>% dm::dm_add_pk(status, 
            statusId) %>% dm::dm_add_pk(driver_standings, driverStandingsId)
        pk_check <- dm_primary_keys %>% dm::dm_examine_constraints()
        if (all(pk_check$is_key)) {
            dm_foreign_keys <- dm_primary_keys %>% dm::dm_add_fk(pit_stops, 
                raceId, races, raceId) %>% dm::dm_add_fk(pit_stops, 
                driverId, drivers, driverId) %>% dm::dm_add_fk(lap_times, 
                raceId, races, raceId) %>% dm::dm_add_fk(lap_times, 
                driverId, drivers, driverId)
        }
        fk_check <- dm_foreign_keys %>% dm::dm_examine_constraints()
        if (all(fk_check$is_key)) {
            dm_all_keys <- dm_foreign_keys %>% dm::dm_add_fk(constructor_standings, 
                raceId, races, raceId) %>% dm::dm_add_fk(constructor_standings, 
                constructorId, constructors) %>% dm::dm_add_fk(results, 
                constructorId, constructors, constructorId) %>% 
                dm::dm_add_fk(results, statusId, status, statusId) %>% 
                dm::dm_add_fk(results, driverId, drivers, driverId) %>% 
                dm::dm_add_fk(results, raceId, races, raceId) %>% 
                dm::dm_add_fk(races, year, seasons, year) %>% 
                dm::dm_add_fk(qualifying, raceId, races, raceId) %>% 
                dm::dm_add_fk(qualifying, constructorId, constructors, 
                  constructorId) %>% dm::dm_add_fk(qualifying, 
                driverId, drivers, driverId) %>% dm::dm_add_fk(constructor_results, 
                constructorId, constructors, constructorId) %>% 
                dm::dm_add_fk(constructor_results, raceId, races, 
                  raceId) %>% dm::dm_add_fk(driver_standings, 
                raceId, races, raceId) %>% dm::dm_add_fk(driver_standings, 
                driverId, drivers, driverId)
        }
        all_check <- dm_all_keys %>% dm::dm_examine_constraints()
        if (all(all_check$is_key)) {
            db_dm <- dm::copy_dm_to(con, dm_all_keys, temporary = FALSE, 
                set_key_constraints = TRUE)
        }
    }, error = function(e) {
        DBI::dbDisconnect(con)
        unlink("f1_db.sqlite")
        stop(e)
    })
    if (rm_csv) {
        unlink(csv_dir)
    }
    print(db_dm)
    return(list(con, db_dm))
}
<environment: 0x7f7feb2d3938>

We can now use the DBI package to examine and/or query the database:

drivers <- tbl(con, "drivers") %>% collect()
drivers %>% format.dt.f(.)

races <- tbl(con, "races") %>% collect()
races %>% format.dt.f(.)

When using the DBI package, the dbGetQuery() function will send a SQL query to the database and return the results as a dataframe. dbGetQuery() takes, at a minimum, two arguments:

conn - A DBIConnection object, which I’ve previously saved as ‘con’
statement - A character string containing an SQL query. DBI will raise an error if the syntax of the query is invalid.

A sample query, then, might look something like this:

driver_info1 <- dbGetQuery(
  conn = con,
  statement = "SELECT driverId, forename, surname, driverRef, dob,
  nationality
  FROM drivers
  ORDER BY driverId;"
)

# For increased clarity, however, I will keep showing both the SQL query and the dplyr syntax:

driver_info2 <- tbl(con, "drivers") %>% # R or Python
  collect() %>%
  select(driverId, forename, surname, driverRef, dob, nationality) %>%
  arrange(driverId)

all.equal(driver_info1, driver_info2, check.attributes=FALSE)
[1] TRUE

driver_info1 %>% format.dt.f(.)

Althought there are many more race records that one could derive from the ergast F1 database, some of these will likely be challenging or require somewhat circuitous means (self joins, in particular).

The dplyr R package contain many window functions such as:

minrank()
dense_rank()
cumsum()
lead()
lag()
etc.

If database tables are pulled into local R objects, we can use the full suite of dplyr window functions to solve problems that might require very complicated SQL solutions in SQLite.

With this bit of explanation out of the way, we can move ahead with some simple queries to return interesting F1 racing records.

For example, imagine we are interested in knowing which driver recorded the most wins in the first season of their F1 career?

An R solution might look like the following:

# Most wins in first season of career
driverStandings <-  tbl(con, "driver_standings") %>% collect()
# list(driverStandings) %>% set_names("driverStandings") %>% listviewer::jsonedit()

driverStandings %>%
  left_join(drivers, by = c("driverId" = "driverId")) %>%
  left_join(races, by = c("raceId" = "raceId")) %>%
  group_by(driverId, forename, surname, nationality) %>%
  filter(year == min(year)) %>%
  summarize(Wins = max(wins)) %>%
  ungroup() %>%
  select(-driverId) %>%
  arrange(desc(Wins)) %>%
  format.dt.f(.)

A similar query in SQLite may also work since SQLites has window functions.

Similarly, if we were to ask which driver entered the most races without ever recording a win, we might use the following dplyr call:

# Most races entered without recording a win
driverStandings %>%
  left_join(drivers, by = c("driverId" = "driverId")) %>%
  select(driverId, forename, surname, nationality, wins) %>%
  group_by(driverId, forename, surname, nationality) %>%
  mutate(Total_Wins = sum(wins)) %>%
  filter(Total_Wins == 0) %>%
  count() %>%
  arrange(desc(n)) %>%
  format.dt.f(.)

The solutions to some problems will likely be easier to obtain using pure SQL syntax, while others will likely be more intuitive using dplyr syntax.

4 Conclusion

Personally, I have mixed feelings about SQL. On the positive side lies the ability of good database engines to optimize arbitrary queries for performance. And, SQL’s declarative style and relational model can indeed make many data wrangling intentions easy to express, at least for simple data manipulation tasks.

But, many problems linger with SQL implementations. Query optimizers, it turns out, can only do so much. Worse, there exists substantial syntax variation and idiosyncrasies across various SQL implementations. For even slightly complex tasks SQL often becomes difficult to write and understand. I find SQL awkward to compose and nest. It is hard to do things that should be really simple like mixing result types in a query (scalars and vectors, say). And there are obvious downsides of representing SQL programs as big character strings in R–for instance, all errors can only be caught at run time.

Fortunately, there is dplyr! Dplyr calls itself a grammar of data manipulation, but it is also an impedance matching circuit that lets us write lovely composable R functions and still gain the benefits of query optimization and performance that the database implementation has to offer.

Finally, for problems that fit in our system’s memory, we may not find any performance advantage of using SQLite over alternative R approaches. In particular data.table is usually (always?) faster than SQLite for in-memory problems. Humble base R also often gets close to or exceeds SQLite performance. Plus, base R and data.table offer a substantial array of specialized data manipulation operations that easily exceed the performance and capabilities of SQLite, as shown in some of the pieces. However, SQLite works without change on problems that are larger than the computer memory residing in files which can be an important advantage in some cases.

It seems to me SQLite is best used with R using dplyr, a really nice combination of technologies. The ability to pick-and-choose the best tool for the job, however, is deeply valuable and both can clearly be used to supplement and assist each other. In summary:

Use SQLite for larger-than-RAM data manipulation.
Use SQLite with dplyr, avoid SQL except for basic stuff.
But remember, there are companies that are very comfortable with SQL, and we may agree that for many data manipulation tasks SQL can be quite nice. For those tasks, SQLite provides a reasonably high-performance and elegantly integrated solution.

SQLite is a nifty C-language library that implements a small, fast, self-contained, high-reliability, full-featured, SQL database engine. SQLite is the most used database engine in the world. SQLite is built into all mobile phones and most computers and comes bundled inside countless other applications that people use every day. The SQLite file format is stable, cross-platform, and backwards compatible. SQLite’s integration with R and Python is so complete that it can work directly on in-memory language objects without copying. This lets us use SQLite directly on Pandas and R data frames, including excellent support for dplyr.↩︎

F1 Ergast DB with SQLite and dplyr

Nico

2022-08-27

1 Introduction

2 Ergast Developer API

3 Ergast Database via SQLite & dplyr

4 Conclusion