Databases and R

The goal of this document is teach you the basics of connecting to a database management system (aka DBMS). I assume you’re already familiar with the basics of SQL; you want to know to run it from R. I’m going to focus on DBI-compatible packages. The DBI (database interface) package defines a set of S4 classes and generics that individual DBMS backends implement. Once you’ve mastered one backend, it’s simple process to use any other backend.

Currently, there are seven packages that provide DBI compatible backends:

Unfortunately, RODBC which allows you to talk to ODBC compatible databases is not a DBI backend. You’ll need to read it’s documentation to learn how to use it.

Most of the time the database has already been picked for you and you just need to pick the appropriate driver. If you’re also creating the database I recommend first trying SQLite. If that doesn’t meet your needs, try PostgreSQL. It’s open source, relatively easy to run the server yourself and it’s a capable database engine.

Getting started

To get started, install RSQLite:

install.packages("RSQLite")

RSQLite is the easiest DBI backend to start with because it includes the complete DBMS. All other databases use a client-server architecture which is harder to set up. In a client-server architecture, there is one database server which has all the data and runs all the queries, and many database clients, one of which will be your R session.

Important objects

There are three important DBI classes: DBIDriver, DBIConnection and DBIResultSet. These are abstract classes which means that you’ll never use them directly. Instead DBI backend subclasses them to provide (e.g.) SQLiteDriver, SQLiteConnection and SQLiteResultSet. These are S4 classes. The only reason you need to care about that is because they use a special way of accessing help: class?DBIDriver, class?SQLiteConnection etc.

The driver

The driver defines which type of database you’re connecting to (e.g. SQLite, MySQL, etc). Each package provides a function to access the driver. You should never create an instance yourself with methods::new(). The function to access driver is usually the same as the name of the DBMS (i.e. the name of the package without R):

RSQLite::SQLite()
#> <SQLiteDriver>
RMySQL::MySQL()
#> <MySQLDriver:(42328)>
RPostgreSQL::PostgreSQL()
#> <PostgreSQLDriver:(42328)>

Typically this will be the only function you call outside of the DBI package.

The connection

The driver is used to distinguish between different database systems, e.g. PostgreSQL, MySQL, Oracle. A connection connects to a specific database, e.g. StudentTestScores within a database management system.

You make a connection with dbConnect():

library(DBI)
con <- dbConnect(RSQLite::SQLite(), dbname = ":memory:")

SQLite is the easiest to connect to because it doesn’t use a client-server architecture, so you only need to give it the name of the database. In SQLite this is either the name of a file on disk, or :memory: to create a temporary in-memory database (which is again great for testing and experimentation).

For all other databases, you’ll also need to provide additional arguments. The easiest way to find out what they are is to look at the help. Again, since this is S4, you’ll need to use a slightly different syntax to usual, e.g. ?dbConnect(RSQLite::SQLite()). For most databases you’ll need to supply the following information:

  • Where the database server is located, with the host and port arguments.

  • Your access credentials with username and password. It’s not a good idea to save these in your R code, because it’s easy for others to find them (for example you don’t want to accidentally share on github). Most database drivers provide an alternative way of specifying them, typically with a global configuration file. For example, RMySQL uses ~/.my.cnf.

Once you’re finished with a connection, it’s a polite to disconnect. This tells the database server that it no longer needs to allocate resources for you.

dbDisconnect(con)
#> [1] TRUE

The result set

Now that you have a connection to a database, you’ll want to do something with it. For the purposes of illustration, we’ll use the datasetsDb() included in the RSQLite package. It contains a copy of every dataset from R’s datasets package.

To access it we first create a connection, using the appropriate driver:

path <- system.file("db", "datasets.sqlite", package = "RSQLite")
con <- dbConnect(RSQLite::SQLite(), path)

Now we can run SQL queries with dbSendQuery():

rs <- dbSendQuery(con, "SELECT * FROM iris LIMIT 5")
rs
#> <SQLiteResult>

This returns an object that inherits from DBIResultSet. To see the contents of the query, use dbFetch():

dbFetch(rs)
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> 1          5.1         3.5          1.4         0.2  setosa
#> 2          4.9         3.0          1.4         0.2  setosa
#> 3          4.7         3.2          1.3         0.2  setosa
#> 4          4.6         3.1          1.5         0.2  setosa
#> 5          5.0         3.6          1.4         0.2  setosa

Once you’re done with the result set, you need to close it. Most drivers only support only result set per connection, so you need to get in the habit of cleaning up after yourself.

dbClearResult(rs)
#> [1] TRUE

dbSendQuery() , dbFetch() and dbClearResult() are most useful when you’re executing a large query and don’t want to pull down all of the data at once. You can call dbFetch(rs, n = 100) to fetch only 100 records at a time, and use dbHasCompleted(rs) to determine if there are any rows.

Most of the time, however, you’ll use dbGetQuery() which calls dbSendQuery(), then dbFetch(), then dbClearResults() and returns a data frame containing the results:

dbSendQuery(con, "SELECT * FROM iris LIMIT 5")
#> <SQLiteResult>

Helper functions

DBI provides a number of helpful functions that will generate and execute SQL. They fall mainly into two camps: table manipulation and transaction management.

Table manipulation

# See if a table already exists
dbListTables(con)
#> Warning in sqliteSendQuery(con, statement, bind.data): Closing result set
#> with pending rows
#>  [1] "BOD"              "CO2"              "ChickWeight"     
#>  [4] "DNase"            "Formaldehyde"     "Indometh"        
#>  [7] "InsectSprays"     "LifeCycleSavings" "Loblolly"        
#> [10] "Orange"           "OrchardSprays"    "PlantGrowth"     
#> [13] "Puromycin"        "Theoph"           "ToothGrowth"     
#> [16] "USArrests"        "USJudgeRatings"   "airquality"      
#> [19] "anscombe"         "attenu"           "attitude"        
#> [22] "cars"             "chickwts"         "esoph"           
#> [25] "faithful"         "freeny"           "infert"          
#> [28] "iris"             "longley"          "morley"          
#> [31] "mtcars"           "npk"              "pressure"        
#> [34] "quakes"           "randu"            "rock"            
#> [37] "sleep"            "stackloss"        "swiss"           
#> [40] "trees"            "warpbreaks"       "women"
dbExistsTable(con, "cars")
#> [1] TRUE
dbExistsTable(con, "abcdef")
#> [1] FALSE

# Add a table from a local data frame
df <- data.frame(y = runif(5))
dbWriteTable(con, "df", df)
#> [1] TRUE
dbExistsTable(con, "df")
#> [1] TRUE

# Retrieve a data frame from a remote table
dbReadTable(con, "df")
#>           y
#> 1 0.8218436
#> 2 0.3640670
#> 3 0.2168223
#> 4 0.4555132
#> 5 0.2826618

# Remove table
dbRemoveTable(con, "df")
#> [1] TRUE
dbExistsTable(con, "df")
#> [1] FALSE

Transactions

Transactions allow you to hold off committing changes to a database. Multiple connections to the same database won’t see uncommited changes.

tmp <- tempfile()
con1 <- dbConnect(RSQLite::SQLite(), dbname = tmp)
con2 <- dbConnect(RSQLite::SQLite(), dbname = tmp)

# Begin a transation
dbBegin(con1)
#> [1] TRUE
dbWriteTable(con1, "mtcars", mtcars)
#> [1] TRUE
dbListTables(con1)
#> [1] "mtcars"

# Other connections can't see it
dbListTables(con2)
#> character(0)

# until it's commited:
dbCommit(con1)
#> [1] TRUE
dbListTables(con2)
#> [1] "mtcars"

You can use dbRollback() to rollback any uncommited changes.