Preliminary

We will use the dplyr package, which contains small relational datasets for use in the examples. If you do not already have the dplyr package installed, you will first install the package using the install.packages() function.

install.packages("dplyr")

Once installed, we load the package for use in the R session using the library() function.

library(dplyr)

In the lesson that follows we will use two data sets in the dpylr package to demonstrate joins.


Relational Data

The first table, band_members contains the names (name) of some members of the Beatles and Rolling Stones and the band they belonged to. When integrating the data, this data will be the left (x) table.

band_members
##   name    band
## 1 Mick  Stones
## 2 John Beatles
## 3 Paul Beatles

The second table, band_instruments contains the names (name) of some members of the Beatles and Rolling Stones and the instrument (plays) that they played. When integrating the data, this data will be the right (y) table.

band_instruments
##    name  plays
## 1  John guitar
## 2  Paul   bass
## 3 Keith guitar

As shown above, the two tables both have the name variable. We will use the name variable to merge the two tables.


Merging using Joins

Data integration is a key data consolidation step to combining two or more data sets using database operations. Methods to merge data using joins include: Inner, Left Outer, Right Outer, and Full Outer.

To demonstrate joins we will use the merge() function in the base package. Note: the merge() function automatically sorts the resulting table based on the column indicated in the by argument.


Inner Join

Inner joins select records that have matching values in the left and right table. Any rows in which the name value is not in both the left and right table will be discarded from the resulting table.

merge(x = band_members, # left table
      y = band_instruments, # right table
      by = "name") # column used for merging
##   name    band  plays
## 1 John Beatles guitar
## 2 Paul Beatles   bass

The result of the inner join is a table with 2 observations, 3 variables, and 0 missing values. As shown above, the resulting table includes John and Paul, since John and Paul are the only two names present in both the left and right tables.


Left Outer Join

Left Outer joins preserve all unmatched rows from the left table while discarding unmatched rows from the right table. Any names in the left table that are unmatched in the right table will display <NA> in the plays column.

merge(x = band_members, # left table
      y = band_instruments, #right table
      by = "name", # column used for merging
      all.x = TRUE) # include all in left table and matches in right table only
##   name    band  plays
## 1 John Beatles guitar
## 2 Mick  Stones   <NA>
## 3 Paul Beatles   bass

The result of the left outer join is a table with 3 observations, 3 variables, and 1 missing value. As shown above, the resulting table contains all observations in the band_members table and only the values for the play variable in the band_instruments table where the names match. Since Mick is not in the band_instruments table, an<NA> is present in the plays column.


Right Outer Join

Right Outer joins preserves all unmatched rows from the right table while discarding unmatched rows from the left table. Any names in the right table that are unmatched in the left table will display <NA> in the band column.

merge(x = band_members, # left table
      y = band_instruments, # right table
      by = "name", # column used for merging
      all.y = TRUE) # include all in right table and matches in left table only
##    name    band  plays
## 1  John Beatles guitar
## 2 Keith    <NA> guitar
## 3  Paul Beatles   bass

The result of the right outer join is a table with 3 observations, 3 variables, and 1 missing value. As shown above, the resulting table contains all observations in the band_instruments table and only the values for the band variable in the band_members table where the names match. Since Keith is not in the band_members table, an<NA> is present in the band column.


Full Outer Join

Full Outer joins preserve all rows from the tables and each unique name in the left and right tables will be represented in the output. Any unmatched names will display <NA> in the relevant columns (plays and band).

merge(x = band_members, # left table
      y = band_instruments, # right table
      by = "name", # column used for merging
      all = TRUE) # include all from left and right table
##    name    band  plays
## 1  John Beatles guitar
## 2 Keith    <NA> guitar
## 3  Mick  Stones   <NA>
## 4  Paul Beatles   bass

The result of the full outer join is a table with 4 observations, 3 variables, and 2 missing values. As shown above, all 4 people represented in the relational data (John, Keith, Mick and Paul) are represented as observations in the resulting table. Since John and Paul are represented in both band_members and band_instruments, they have values for both band and plays. Since Mick is in the left table but not the right, he has an <NA> for band. Since Keith is in the right table but not the left, he has an <NA> for plays.