This tutorial will cover how to merge data frames using dplyr. We will cover the following topics:

Mutating joins

dplyr contains joining functions that enable you to combine two data frames into one. These functions mimic database joins. If you aren’t familiar with database joins, this site has a good overview.

We’ll use two simple data frames to demonstrate how joins work.

monitors <- read.table(header=TRUE, text='
  monitorid        lat       long                 
          1  42.467573  -87.810047     
          2  42.049148  -88.273029
          3  39.110539  -90.324080
                       ')

pollutants <- read.table(header=TRUE, text='
  pollutant   duration    monitorid                 
      ozone         1h            1
        so2         1h            1
      ozone         8h            2
        no2         1h            4
                         ')

To merge two data frames, there must be one or more columns that are in both data frames. If we look at the monitors data frame and the pollutants data frame, we can see that they both have a “monitorid” column. That’s the column we would merge on, and in this tutorial we will refer to it as the “key”.

Below is a list of merging functions in dplyr that combine two data frames into one, also referred to as mutating joins.

library(dplyr)

inner_join(pollutants, monitors, by = "monitorid")
##   pollutant duration monitorid      lat      long
## 1     ozone       1h         1 42.46757 -87.81005
## 2       so2       1h         1 42.46757 -87.81005
## 3     ozone       8h         2 42.04915 -88.27303

Note that there is only one record for monitor 1 in the monitors data frame, but the information for that monitor is repeated in the output because there is more than one monitor id with the value 1 in the pollutants data frame.

left_join(pollutants, monitors, by = "monitorid")
##   pollutant duration monitorid      lat      long
## 1     ozone       1h         1 42.46757 -87.81005
## 2       so2       1h         1 42.46757 -87.81005
## 3     ozone       8h         2 42.04915 -88.27303
## 4       no2       1h         4       NA        NA
full_join(pollutants, monitors, by = "monitorid")
##   pollutant duration monitorid      lat      long
## 1     ozone       1h         1 42.46757 -87.81005
## 2       so2       1h         1 42.46757 -87.81005
## 3     ozone       8h         2 42.04915 -88.27303
## 4       no2       1h         4       NA        NA
## 5      <NA>     <NA>         3 39.11054 -90.32408

Filtering joins with dplyr

Sometimes we would like to subset, or filter, a data frame based on the contents of another data frame. In this kind of task we aren’t combining two data frames, so it’s not a mutating join. We are filtering, so these are called filtering joins.

semi_join(pollutants, monitors, by = "monitorid")
##   pollutant duration monitorid
## 1     ozone       1h         1
## 2       so2       1h         1
## 3     ozone       8h         2
anti_join(monitors, pollutants, by = "monitorid")
##   monitorid      lat      long
## 1         3 39.11054 -90.32408

A note on keys

The dplyr merging functions will merge two data frames on multiple columns with differing names. Below we create two data frames for illustration.

sites <- read.table(header=TRUE, text='
  state   site         lat       long                 
     IL      1   42.467573  -87.810047     
     IL      2   42.049148  -88.273029
     IN      1   41.68140   -87.49473
                       ')

parameters <- read.table(header=TRUE, text='
  region   monitor   parameter   duration                     
      IL         1       ozone         1h
      IN         1         so2         1h            
      IL         2       ozone         8h            
      IL         2         no2         1h            
                         ')

These two data frames can only be merged if we join on two keys, the column with state information and the column with site information. Those two columns have different names in the two data frames. We can do a full join of these two data frames if we supply the names of the keys in the by parameter.

full_join(sites, parameters, by = c("state" = "region", "site" = "monitor"))
##   state site      lat      long parameter duration
## 1    IL    1 42.46757 -87.81005     ozone       1h
## 2    IL    2 42.04915 -88.27303     ozone       8h
## 3    IL    2 42.04915 -88.27303       no2       1h
## 4    IN    1 41.68140 -87.49473       so2       1h

Further reading.

For more information about database-style joins in dplyr, see the vignette here.