This tutorial will cover how to merge data frames using dplyr. We will cover the following topics:
dplyr contains joining functions that enable you to combine two data frames into one. These functions mimic database joins. If you aren’t familiar with database joins, this site has a good overview.
We’ll use two simple data frames to demonstrate how joins work.
monitors <- read.table(header=TRUE, text='
monitorid lat long
1 42.467573 -87.810047
2 42.049148 -88.273029
3 39.110539 -90.324080
')
pollutants <- read.table(header=TRUE, text='
pollutant duration monitorid
ozone 1h 1
so2 1h 1
ozone 8h 2
no2 1h 4
')
To merge two data frames, there must be one or more columns that are in both data frames. If we look at the monitors data frame and the pollutants data frame, we can see that they both have a “monitorid” column. That’s the column we would merge on, and in this tutorial we will refer to it as the “key”.
Below is a list of merging functions in dplyr that combine two data frames into one, also referred to as mutating joins.
inner_join() This function combines the columns from both data frames, but only keeps rows where the value in the key column matches in both data frames. Below we do an inner join on our two data frames.library(dplyr)
inner_join(pollutants, monitors, by = "monitorid")
## pollutant duration monitorid lat long
## 1 ozone 1h 1 42.46757 -87.81005
## 2 so2 1h 1 42.46757 -87.81005
## 3 ozone 8h 2 42.04915 -88.27303
Note that there is only one record for monitor 1 in the monitors data frame, but the information for that monitor is repeated in the output because there is more than one monitor id with the value 1 in the pollutants data frame.
left_join() If we wanted to keep all of the pollutant information, even if there wasn’t a corresponding record in the monitors data frame, then we could use the left_join() function.left_join(pollutants, monitors, by = "monitorid")
## pollutant duration monitorid lat long
## 1 ozone 1h 1 42.46757 -87.81005
## 2 so2 1h 1 42.46757 -87.81005
## 3 ozone 8h 2 42.04915 -88.27303
## 4 no2 1h 4 NA NA
full_join() We can also keep all records from both data frames using the full_join() function.full_join(pollutants, monitors, by = "monitorid")
## pollutant duration monitorid lat long
## 1 ozone 1h 1 42.46757 -87.81005
## 2 so2 1h 1 42.46757 -87.81005
## 3 ozone 8h 2 42.04915 -88.27303
## 4 no2 1h 4 NA NA
## 5 <NA> <NA> 3 39.11054 -90.32408
dplyrSometimes we would like to subset, or filter, a data frame based on the contents of another data frame. In this kind of task we aren’t combining two data frames, so it’s not a mutating join. We are filtering, so these are called filtering joins.
semi_join() Suppose we wanted to keep the pollutants data frame in the same format, but only keep the records that also have a match in the monitors data frame. We can use the semi_join() function to filter the pollutants data frame.semi_join(pollutants, monitors, by = "monitorid")
## pollutant duration monitorid
## 1 ozone 1h 1
## 2 so2 1h 1
## 3 ozone 8h 2
anti_join() We may also want to know what records are in a data frame that do not match the records in another data frame. Suppose we want to know which monitors from the monitors data frame that are not also in the pollutants data frame. We can use the anti_join() function.anti_join(monitors, pollutants, by = "monitorid")
## monitorid lat long
## 1 3 39.11054 -90.32408
The dplyr merging functions will merge two data frames on multiple columns with differing names. Below we create two data frames for illustration.
sites <- read.table(header=TRUE, text='
state site lat long
IL 1 42.467573 -87.810047
IL 2 42.049148 -88.273029
IN 1 41.68140 -87.49473
')
parameters <- read.table(header=TRUE, text='
region monitor parameter duration
IL 1 ozone 1h
IN 1 so2 1h
IL 2 ozone 8h
IL 2 no2 1h
')
These two data frames can only be merged if we join on two keys, the column with state information and the column with site information. Those two columns have different names in the two data frames. We can do a full join of these two data frames if we supply the names of the keys in the by parameter.
full_join(sites, parameters, by = c("state" = "region", "site" = "monitor"))
## state site lat long parameter duration
## 1 IL 1 42.46757 -87.81005 ozone 1h
## 2 IL 2 42.04915 -88.27303 ozone 8h
## 3 IL 2 42.04915 -88.27303 no2 1h
## 4 IN 1 41.68140 -87.49473 so2 1h
For more information about database-style joins in dplyr, see the vignette here.