R Markdown

library (dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
# Dataset 1: Customers
customers <- tibble(
  customer_id = c(1, 2, 3, 4, 5),
  name = c("Alice", "Bob", "Charlie", "David", "Eve"),
  city = c("New York", "Los Angeles", "Chicago", "Houston", "Phoenix")
)
# Dataset 2: Orders
orders <- tibble(
  order_id = c(101, 102, 103, 104, 105, 106),
  customer_id = c(1, 2, 3, 2, 6, 7),
  product = c("Laptop", "Phone", "Tablet", "Desktop", "Camera", "Printer"),
  amount = c(1200, 800, 300, 1500, 600, 150)
)

Inner Join

q1 <- customers %>%
  inner_join(orders, by = "customer_id")
print(q1)
## # A tibble: 4 × 6
##   customer_id name    city        order_id product amount
##         <dbl> <chr>   <chr>          <dbl> <chr>    <dbl>
## 1           1 Alice   New York         101 Laptop    1200
## 2           2 Bob     Los Angeles      102 Phone      800
## 3           2 Bob     Los Angeles      104 Desktop   1500
## 4           3 Charlie Chicago          103 Tablet     300
nrow(q1)
## [1] 4
"Customers without orders like David and Eve and orders without matching customers like customer_id 6 and 7 are excluded.
"
## [1] "Customers without orders like David and Eve and orders without matching customers like customer_id 6 and 7 are excluded.\n"

Left Join

q2 <- customers %>%
left_join(orders, by = "customer_id")
print(q2)
## # A tibble: 6 × 6
##   customer_id name    city        order_id product amount
##         <dbl> <chr>   <chr>          <dbl> <chr>    <dbl>
## 1           1 Alice   New York         101 Laptop    1200
## 2           2 Bob     Los Angeles      102 Phone      800
## 3           2 Bob     Los Angeles      104 Desktop   1500
## 4           3 Charlie Chicago          103 Tablet     300
## 5           4 David   Houston           NA <NA>        NA
## 6           5 Eve     Phoenix           NA <NA>        NA
nrow(q2)
## [1] 6
"This differs from the inner join because the left join includes all customers, even if they don’t have orders"
## [1] "This differs from the inner join because the left join includes all customers, even if they don’t have orders"

Right Join

q3 <- customers %>%
right_join(orders, by = "customer_id")
print(q3)
## # A tibble: 6 × 6
##   customer_id name    city        order_id product amount
##         <dbl> <chr>   <chr>          <dbl> <chr>    <dbl>
## 1           1 Alice   New York         101 Laptop    1200
## 2           2 Bob     Los Angeles      102 Phone      800
## 3           2 Bob     Los Angeles      104 Desktop   1500
## 4           3 Charlie Chicago          103 Tablet     300
## 5           6 <NA>    <NA>             105 Camera     600
## 6           7 <NA>    <NA>             106 Printer    150
nrow(q3)
## [1] 6
"Customers with customer_id 6 and 7 have NA values for name and city because they don't exist in the customers table."
## [1] "Customers with customer_id 6 and 7 have NA values for name and city because they don't exist in the customers table."

Full Join

q4 <- customers %>%
  full_join(orders, by = "customer_id")  
print(q4)
## # A tibble: 8 × 6
##   customer_id name    city        order_id product amount
##         <dbl> <chr>   <chr>          <dbl> <chr>    <dbl>
## 1           1 Alice   New York         101 Laptop    1200
## 2           2 Bob     Los Angeles      102 Phone      800
## 3           2 Bob     Los Angeles      104 Desktop   1500
## 4           3 Charlie Chicago          103 Tablet     300
## 5           4 David   Houston           NA <NA>        NA
## 6           5 Eve     Phoenix           NA <NA>        NA
## 7           6 <NA>    <NA>             105 Camera     600
## 8           7 <NA>    <NA>             106 Printer    150
nrow(q4)
## [1] 8
"Rows with missing data from either table will have NA values where there's no match."
## [1] "Rows with missing data from either table will have NA values where there's no match."

Semi Join

q5 <- customers %>%
  semi_join(orders, by = "customer_id")
  print (q5)
## # A tibble: 3 × 3
##   customer_id name    city       
##         <dbl> <chr>   <chr>      
## 1           1 Alice   New York   
## 2           2 Bob     Los Angeles
## 3           3 Charlie Chicago
  nrow(q5)
## [1] 3
  "Unlike the inner join, the semi join returns only rows from customers, not the actual joined data."
## [1] "Unlike the inner join, the semi join returns only rows from customers, not the actual joined data."

Anti Join

q6 <- customers %>%
  anti_join(orders, by = "customer_id")
print(q6)
## # A tibble: 2 × 3
##   customer_id name  city   
##         <dbl> <chr> <chr>  
## 1           4 David Houston
## 2           5 Eve   Phoenix
"The result contains customers who haven’t placed any orders, as the anti join returns only rows from the customers table that don’t have a match in the orders table."
## [1] "The result contains customers who haven’t placed any orders, as the anti join returns only rows from the customers table that don’t have a match in the orders table."

Practical Application

q7a <- customers %>%
  left_join(orders, by = "customer_id")
print(q7a)
## # A tibble: 6 × 6
##   customer_id name    city        order_id product amount
##         <dbl> <chr>   <chr>          <dbl> <chr>    <dbl>
## 1           1 Alice   New York         101 Laptop    1200
## 2           2 Bob     Los Angeles      102 Phone      800
## 3           2 Bob     Los Angeles      104 Desktop   1500
## 4           3 Charlie Chicago          103 Tablet     300
## 5           4 David   Houston           NA <NA>        NA
## 6           5 Eve     Phoenix           NA <NA>        NA
"A left join ensures that all customers are included, even if they haven’t placed any orders."
## [1] "A left join ensures that all customers are included, even if they haven’t placed any orders."
q7b <- customers %>%
  inner_join(orders, by = "customer_id")
print(q7b)
## # A tibble: 4 × 6
##   customer_id name    city        order_id product amount
##         <dbl> <chr>   <chr>          <dbl> <chr>    <dbl>
## 1           1 Alice   New York         101 Laptop    1200
## 2           2 Bob     Los Angeles      102 Phone      800
## 3           2 Bob     Los Angeles      104 Desktop   1500
## 4           3 Charlie Chicago          103 Tablet     300
"An inner join includes only customers who have placed at least one order, filtering out customers with no orders."
## [1] "An inner join includes only customers who have placed at least one order, filtering out customers with no orders."

Extra Question

q8 <- customers %>%
  left_join(orders, by = "customer_id") %>%
  group_by(customer_id, name, city) %>%
  summarize(
    total_orders = n(),
    total_amount_spent = sum(amount, na.rm = TRUE)
  ) %>%
  ungroup()
## `summarise()` has grouped output by 'customer_id', 'name'. You can override
## using the `.groups` argument.
print(q8)
## # A tibble: 5 × 5
##   customer_id name    city        total_orders total_amount_spent
##         <dbl> <chr>   <chr>              <int>              <dbl>
## 1           1 Alice   New York               1               1200
## 2           2 Bob     Los Angeles            2               2300
## 3           3 Charlie Chicago                1                300
## 4           4 David   Houston                1                  0
## 5           5 Eve     Phoenix                1                  0
"This code uses a left join to include all customers and aggregates the number of orders and the total amount spent by each customer, even if they haven’t placed any orders (NA values are handled with na.rm = TRUE)."
## [1] "This code uses a left join to include all customers and aggregates the number of orders and the total amount spent by each customer, even if they haven’t placed any orders (NA values are handled with na.rm = TRUE)."