Data Wrangling: Two Tables (joins)

Illya Mowerman

Introduction to Joins

Joins are operations that combine two data frames based on a common variable. In this presentation, we’ll cover:

We’ll use dplyr for all examples.

Sample Data

Let’s create two sample data frames:

# Employees data
employees <- tibble(
  emp_id = c(1, 2, 3, 4, 5),
  name = c("John", "Jane", "Alice", "Bob", "Charlie"),
  department = c("HR", "IT", "Finance", "IT", "Marketing")
)

# Salaries data
salaries <- tibble(
  emp_id = c(1, 2, 3, 6, 7),
  salary = c(50000, 60000, 55000, 65000, 70000)
)

# Display the data
employees
## # A tibble: 5 × 3
##   emp_id name    department
##    <dbl> <chr>   <chr>     
## 1      1 John    HR        
## 2      2 Jane    IT        
## 3      3 Alice   Finance   
## 4      4 Bob     IT        
## 5      5 Charlie Marketing
salaries
## # A tibble: 5 × 2
##   emp_id salary
##    <dbl>  <dbl>
## 1      1  50000
## 2      2  60000
## 3      3  55000
## 4      6  65000
## 5      7  70000

Inner Join

Inner join returns all rows from both tables where there is a match.

inner_join(employees, salaries, by = "emp_id")
## # A tibble: 3 × 4
##   emp_id name  department salary
##    <dbl> <chr> <chr>       <dbl>
## 1      1 John  HR          50000
## 2      2 Jane  IT          60000
## 3      3 Alice Finance     55000

Note: Only employees with matching emp_id in both tables are included.

Left Join

Left join returns all rows from the left table and matching rows from the right table.

left_join(employees, salaries, by = "emp_id")
## # A tibble: 5 × 4
##   emp_id name    department salary
##    <dbl> <chr>   <chr>       <dbl>
## 1      1 John    HR          50000
## 2      2 Jane    IT          60000
## 3      3 Alice   Finance     55000
## 4      4 Bob     IT             NA
## 5      5 Charlie Marketing      NA

Note: All employees are included, even those without salary information.

Right Join

Right join returns all rows from the right table and matching rows from the left table.

right_join(employees, salaries, by = "emp_id")
## # A tibble: 5 × 4
##   emp_id name  department salary
##    <dbl> <chr> <chr>       <dbl>
## 1      1 John  HR          50000
## 2      2 Jane  IT          60000
## 3      3 Alice Finance     55000
## 4      6 <NA>  <NA>        65000
## 5      7 <NA>  <NA>        70000

Note: All salary entries are included, even for employees not in the employees table.

Full (Outer) Join

Full join returns all rows when there is a match in either left or right table.

full_join(employees, salaries, by = "emp_id")
## # A tibble: 7 × 4
##   emp_id name    department salary
##    <dbl> <chr>   <chr>       <dbl>
## 1      1 John    HR          50000
## 2      2 Jane    IT          60000
## 3      3 Alice   Finance     55000
## 4      4 Bob     IT             NA
## 5      5 Charlie Marketing      NA
## 6      6 <NA>    <NA>        65000
## 7      7 <NA>    <NA>        70000

Note: All rows from both tables are included, with NA where there’s no match.

Semi Join

Semi join returns all rows from the left table where there is a match in the right table.

semi_join(employees, salaries, by = "emp_id")
## # A tibble: 3 × 3
##   emp_id name  department
##    <dbl> <chr> <chr>     
## 1      1 John  HR        
## 2      2 Jane  IT        
## 3      3 Alice Finance

Note: Only employees with salary information are returned, but salary data isn’t included.

Anti Join

Anti join returns all rows from the left table where there is no match in the right table.

anti_join(employees, salaries, by = "emp_id")
## # A tibble: 2 × 3
##   emp_id name    department
##    <dbl> <chr>   <chr>     
## 1      4 Bob     IT        
## 2      5 Charlie Marketing

Note: Only employees without salary information are returned.

Conclusion

Understanding these joins helps in efficient data manipulation and analysis in R.