Tutorial on how to perform the different types

of SQL-JOINs in R using data.table

 

Synopsis

 

Summary

JOIN type DT syntax data.table::merge() syntax
INNER X[Y, nomatch=0] merge(X, Y, all=FALSE)
LEFT OUTER Y[X] merge(X, Y, all.x=TRUE)
RIGHT OUTER X[Y] merge(X, Y, all.y=TRUE)
FULL OUTER - merge(X, Y, all=TRUE)
FULL OUTER WHERE NULL (NOT INNER) - merge(X, Y, all=TRUE), subset NA
CROSS (cartesian) - - (see below)

 

Example Data

The example data consists of the following two tables:

 

table Employees
Employee EmployeeName Department Salary
1 Alice 11 800
2 Bob 11 600
3 Carla 12 900
4 Daniel 12 1000
5 Evelyn 13 800
6 Ferdinand 21 700

 

table Departments
Department DepartmentName Manager
11 Production 1
12 Sales 4
13 Marketing 5
14 Research NA

 

Inner Join

Figure linked from Pinal Dave
Figure1 from Pinal Dave

the INNER JOIN returns the rows with a match in both tables

 

# set the ON clause as keys of the tables:
setkey(Employees,Department)
setkey(Departments,Department)

# perform the join, eliminating not matched rows from Right
Result <- Employees[Departments, nomatch=0]
 
Inner Join
Employee EmployeeName Department Salary DepartmentName Manager
1 Alice 11 800 Production 1
2 Bob 11 600 Production 1
3 Carla 12 900 Sales 4
4 Daniel 12 1000 Sales 4
5 Evelyn 13 800 Marketing 5

 
note:

Employee #6, Ferdinand, has not been returned, as he’s in a yet to be created department 21.
Neither has department #14, Research, as there are no employees.

 

Left Outer Join

Figure linked from Pinal Dave
Figure2 from Pinal Dave

the LEFT OUTER JOIN returns all the rows from the left table, filling in matched columns (or NA) from the right table

 

# set the ON clause as keys of the tables:
setkey(Employees,Department)
setkey(Departments,Department)

# perform the join using the merge function
Result <- merge(Employees,Departments, all.x=TRUE)
 
Left Outer Join
Department Employee EmployeeName Salary DepartmentName Manager
11 1 Alice 800 Production 1
11 2 Bob 600 Production 1
12 3 Carla 900 Sales 4
12 4 Daniel 1000 Sales 4
13 5 Evelyn 800 Marketing 5
21 6 Ferdinand 700 NA NA

 
note:

Employee #6, Ferdinand, has been returned with department name as NA.
Department #14, Research, has not been returned.
If the column order Left → Right has to be preserved, we need to elaborate:

 

# get the columns of the tables:
leftCols <- colnames(Employees)
rightCols <- colnames(Departments)
# remove the match key of the Right table
rightCols <- setdiff(rightCols,key(Departments))
# set the column order
setcolorder(Result,c(leftCols,rightCols))
 
Left Outer Join, preserved column order
Employee EmployeeName Department Salary DepartmentName Manager
1 Alice 11 800 Production 1
2 Bob 11 600 Production 1
3 Carla 12 900 Sales 4
4 Daniel 12 1000 Sales 4
5 Evelyn 13 800 Marketing 5
6 Ferdinand 21 700 NA NA

 

A typical use case is to match in labels, e.g. in our Employees table

substitute the department number by its name:

 

# set the ON clause as keys of the tables:
setkey(Employees,Department)
setkey(Departments,Department)

# defining the Result columns, substitute Department by DepartmentName
leftCols <- colnames(Employees)
leftCols <- sub("Department","DepartmentName",leftCols)

# perform the join, inverting the tables, return defined columns
Result <- Departments[Employees][, leftCols, with=FALSE]

# -- or --
# Result <- merge(Employees, Departments, all.x=TRUE)
# Result <- Result[, setCols, with=FALSE]
 
Left Outer Join - matching in DepartmentName
Employee EmployeeName DepartmentName Salary
1 Alice Production 800
2 Bob Production 600
3 Carla Sales 900
4 Daniel Sales 1000
5 Evelyn Marketing 800
6 Ferdinand NA 700

 

Right Outer Join

Figure linked from Pinal Dave
Figure2 from Pinal Dave

the RIGHT OUTER JOIN returns all the rows from the right table, filling in matched columns (or NA) from the left table

 

# set the ON clause as keys of the tables:
setkey(Employees,Department)
setkey(Departments,Department)

# perform the join - this is the basic join for data.table
Result <- Employees[Departments]
# this corresponds to
# Result <- merge(Employees,Departments, all.y=TRUE)
 
Right Outer Join
Employee EmployeeName Department Salary DepartmentName Manager
1 Alice 11 800 Production 1
2 Bob 11 600 Production 1
3 Carla 12 900 Sales 4
4 Daniel 12 1000 Sales 4
5 Evelyn 13 800 Marketing 5
NA NA 14 NA Research NA

 

Full Outer Join

Figure linked from Pinal Dave
Figure2 from Pinal Dave

the FULL OUTER JOIN returns all the rows from both tables, filling in matched columns (or NA)

 

# set the ON clause as keys of the tables:
setkey(Employees,Department)
setkey(Departments,Department)

# perform the join
Result <- merge(Employees,Departments, all=TRUE)
 
Full Outer Join
Department Employee EmployeeName Salary DepartmentName Manager
11 1 Alice 800 Production 1
11 2 Bob 600 Production 1
12 3 Carla 900 Sales 4
12 4 Daniel 1000 Sales 4
13 5 Evelyn 800 Marketing 5
14 NA NA NA Research NA
21 6 Ferdinand 700 NA NA

 

Full Outer Join Where NULL - a.k.a “NOT INNER join”

Figure linked from Pinal Dave
Figure2 from Pinal Dave

the NOT INNER JOIN returns all the rows from both tables, where no match was obtained

 

# set the ON clause as keys of the tables:
setkey(Employees,Department)
setkey(Departments,Department)

# perform the join, retain only NA from matched cols on both side
Result <- merge(Employees,Departments, all=TRUE)
Result <- Result[is.na(EmployeeName) | is.na(DepartmentName)]
 
Full Outer Join Where Null
Department Employee EmployeeName Salary DepartmentName Manager
14 NA NA NA Research NA
21 6 Ferdinand 700 NA NA

 

Cross Join

Figure linked from Pinal Dave