We will use the dplyr package, which contains small
relational datasets for use in the examples. If you do not already have
the dplyr package installed, you will first install the
package using the install.packages() function.
install.packages("dplyr")
Once installed, we load the package for use in the R session using
the library() function.
library(dplyr)
In the lesson that follows we will use two data sets in the dpylr package to demonstrate joins.
The first table, band_members contains the names
(name) of some members of the Beatles and Rolling Stones
and the band they belonged to. When integrating the data,
this data will be the left (x) table.
band_members
## name band
## 1 Mick Stones
## 2 John Beatles
## 3 Paul Beatles
The second table, band_instruments contains the names
(name) of some members of the Beatles and Rolling Stones
and the instrument (plays) that they played. When
integrating the data, this data will be the right (y) table.
band_instruments
## name plays
## 1 John guitar
## 2 Paul bass
## 3 Keith guitar
As shown above, the two tables both have the name
variable. We will use the name variable to merge the two
tables.
Data integration is a key data consolidation step to combining two or more data sets using database operations. Methods to merge data using joins include: Inner, Left Outer, Right Outer, and Full Outer.
To demonstrate joins we will use the merge() function in
the base package. Note: the merge()
function automatically sorts the resulting table based on the column
indicated in the by argument.
Inner joins select records that have matching values
in the left and right table. Any rows in which the name
value is not in both the left and right table will be discarded from the
resulting table.
merge(x = band_members, # left table
y = band_instruments, # right table
by = "name") # column used for merging
## name band plays
## 1 John Beatles guitar
## 2 Paul Beatles bass
The result of the inner join is a table with 2 observations, 3 variables, and 0 missing values. As shown above, the resulting table includes John and Paul, since John and Paul are the only two names present in both the left and right tables.
Left Outer joins preserve all unmatched rows from
the left table while discarding unmatched rows from the right table. Any
names in the left table that are unmatched in the right table will
display <NA> in the plays column.
merge(x = band_members, # left table
y = band_instruments, #right table
by = "name", # column used for merging
all.x = TRUE) # include all in left table and matches in right table only
## name band plays
## 1 John Beatles guitar
## 2 Mick Stones <NA>
## 3 Paul Beatles bass
The result of the left outer join is a table with 3 observations, 3
variables, and 1 missing value. As shown above, the resulting table
contains all observations in the band_members table and
only the values for the play variable in the
band_instruments table where the names match. Since Mick is
not in the band_instruments table,
an<NA> is present in the plays
column.
Right Outer joins preserves all unmatched rows from
the right table while discarding unmatched rows from the left table. Any
names in the right table that are unmatched in the left table will
display <NA> in the band column.
merge(x = band_members, # left table
y = band_instruments, # right table
by = "name", # column used for merging
all.y = TRUE) # include all in right table and matches in left table only
## name band plays
## 1 John Beatles guitar
## 2 Keith <NA> guitar
## 3 Paul Beatles bass
The result of the right outer join is a table with 3 observations, 3
variables, and 1 missing value. As shown above, the resulting table
contains all observations in the band_instruments table and
only the values for the band variable in the
band_members table where the names match. Since Keith is
not in the band_members table, an<NA> is
present in the band column.
Full Outer joins preserve all rows from the tables
and each unique name in the left and right tables will be represented in
the output. Any unmatched names will display <NA> in
the relevant columns (plays and band).
merge(x = band_members, # left table
y = band_instruments, # right table
by = "name", # column used for merging
all = TRUE) # include all from left and right table
## name band plays
## 1 John Beatles guitar
## 2 Keith <NA> guitar
## 3 Mick Stones <NA>
## 4 Paul Beatles bass
The result of the full outer join is a table with 4 observations, 3
variables, and 2 missing values. As shown above, all 4 people
represented in the relational data (John, Keith, Mick and Paul) are
represented as observations in the resulting table. Since John and Paul
are represented in both band_members and
band_instruments, they have values for both
band and plays. Since Mick is in the left
table but not the right, he has an <NA> for
band. Since Keith is in the right table but not the left,
he has an <NA> for plays.