Merging in R

This tutorial will cover how to merge data frames using dplyr. We will cover the following topics:

Mutating joins
Filtering joins
A note on keys

Mutating joins

dplyr contains joining functions that enable you to combine two data frames into one. These functions mimic database joins. If you aren’t familiar with database joins, this site has a good overview.

We’ll use two simple data frames to demonstrate how joins work.

monitors <- read.table(header=TRUE, text='
  monitorid        lat       long                 
          1  42.467573  -87.810047     
          2  42.049148  -88.273029
          3  39.110539  -90.324080
                       ')

pollutants <- read.table(header=TRUE, text='
  pollutant   duration    monitorid                 
      ozone         1h            1
        so2         1h            1
      ozone         8h            2
        no2         1h            4
                         ')

To merge two data frames, there must be one or more columns that are in both data frames. If we look at the monitors data frame and the pollutants data frame, we can see that they both have a “monitorid” column. That’s the column we would merge on, and in this tutorial we will refer to it as the “key”.

Below is a list of merging functions in dplyr that combine two data frames into one, also referred to as mutating joins.

inner_join() This function combines the columns from both data frames, but only keeps rows where the value in the key column matches in both data frames. Below we do an inner join on our two data frames.

library(dplyr)

inner_join(pollutants, monitors, by = "monitorid")

##   pollutant duration monitorid      lat      long
## 1     ozone       1h         1 42.46757 -87.81005
## 2       so2       1h         1 42.46757 -87.81005
## 3     ozone       8h         2 42.04915 -88.27303

Note that there is only one record for monitor 1 in the monitors data frame, but the information for that monitor is repeated in the output because there is more than one monitor id with the value 1 in the pollutants data frame.

left_join() If we wanted to keep all of the pollutant information, even if there wasn’t a corresponding record in the monitors data frame, then we could use the left_join() function.

left_join(pollutants, monitors, by = "monitorid")

##   pollutant duration monitorid      lat      long
## 1     ozone       1h         1 42.46757 -87.81005
## 2       so2       1h         1 42.46757 -87.81005
## 3     ozone       8h         2 42.04915 -88.27303
## 4       no2       1h         4       NA        NA

full_join() We can also keep all records from both data frames using the full_join() function.

full_join(pollutants, monitors, by = "monitorid")

##   pollutant duration monitorid      lat      long
## 1     ozone       1h         1 42.46757 -87.81005
## 2       so2       1h         1 42.46757 -87.81005
## 3     ozone       8h         2 42.04915 -88.27303
## 4       no2       1h         4       NA        NA
## 5      <NA>     <NA>         3 39.11054 -90.32408

Filtering joins with `dplyr`

Sometimes we would like to subset, or filter, a data frame based on the contents of another data frame. In this kind of task we aren’t combining two data frames, so it’s not a mutating join. We are filtering, so these are called filtering joins.

semi_join() Suppose we wanted to keep the pollutants data frame in the same format, but only keep the records that also have a match in the monitors data frame. We can use the semi_join() function to filter the pollutants data frame.

semi_join(pollutants, monitors, by = "monitorid")

##   pollutant duration monitorid
## 1     ozone       1h         1
## 2       so2       1h         1
## 3     ozone       8h         2

anti_join() We may also want to know what records are in a data frame that do not match the records in another data frame. Suppose we want to know which monitors from the monitors data frame that are not also in the pollutants data frame. We can use the anti_join() function.

anti_join(monitors, pollutants, by = "monitorid")

##   monitorid      lat      long
## 1         3 39.11054 -90.32408

A note on keys

The dplyr merging functions will merge two data frames on multiple columns with differing names. Below we create two data frames for illustration.

sites <- read.table(header=TRUE, text='
  state   site         lat       long                 
     IL      1   42.467573  -87.810047     
     IL      2   42.049148  -88.273029
     IN      1   41.68140   -87.49473
                       ')

parameters <- read.table(header=TRUE, text='
  region   monitor   parameter   duration                     
      IL         1       ozone         1h
      IN         1         so2         1h            
      IL         2       ozone         8h            
      IL         2         no2         1h            
                         ')

These two data frames can only be merged if we join on two keys, the column with state information and the column with site information. Those two columns have different names in the two data frames. We can do a full join of these two data frames if we supply the names of the keys in the by parameter.

full_join(sites, parameters, by = c("state" = "region", "site" = "monitor"))

##   state site      lat      long parameter duration
## 1    IL    1 42.46757 -87.81005     ozone       1h
## 2    IL    2 42.04915 -88.27303     ozone       8h
## 3    IL    2 42.04915 -88.27303       no2       1h
## 4    IN    1 41.68140 -87.49473       so2       1h

Merging in R

Mutating joins

Filtering joins with dplyr

A note on keys

Further reading.

Filtering joins with `dplyr`