Datasets
# Dataset 1: Customers
customers <- tibble(
customer_id = c(1, 2, 3, 4, 5),
name = c("Alice", "Bob", "Charlie", "David", "Eve"),
city = c("New York", "Los Angeles", "Chicago", "Houston", "Phoenix")
)
# Dataset 2: Orders
orders <- tibble(
order_id = c(101, 102, 103, 104, 105, 106),
customer_id = c(1, 2, 3, 2, 6, 7),
product = c("Laptop", "Phone", "Tablet", "Desktop", "Camera", "Printer"),
amount = c(1200, 800, 300, 1500, 600, 150)
)
1. Inner Join
innerjoin <- inner_join(customers, orders, by = "customer_id")
innerjoin
## # A tibble: 4 × 6
## customer_id name city order_id product amount
## <dbl> <chr> <chr> <dbl> <chr> <dbl>
## 1 1 Alice New York 101 Laptop 1200
## 2 2 Bob Los Angeles 102 Phone 800
## 3 2 Bob Los Angeles 104 Desktop 1500
## 4 3 Charlie Chicago 103 Tablet 300
A. There are 4 rows because only matching customer_ids are
included.
B. Customers 4 and 5 are excluded because they have no orders.
Orders 6 and 7 are excluded because they are not in the customer
table.
2. Left Join
leftjoin <- left_join(customers, orders, by = "customer_id")
leftjoin
## # A tibble: 6 × 6
## customer_id name city order_id product amount
## <dbl> <chr> <chr> <dbl> <chr> <dbl>
## 1 1 Alice New York 101 Laptop 1200
## 2 2 Bob Los Angeles 102 Phone 800
## 3 2 Bob Los Angeles 104 Desktop 1500
## 4 3 Charlie Chicago 103 Tablet 300
## 5 4 David Houston NA <NA> NA
## 6 5 Eve Phoenix NA <NA> NA
A. There are 6 rows in the result.
B. A left join keeps all the rows from the customers table, even if
there are no matching row in the orders table.
Customers 4 and 5 appear with NA values because they have not placed
any orders.
This differs from the inner join because it only keeps rows where
there is a match in both tables.
3. Right Join
rightjoin <- right_join(customers, orders, by = "customer_id")
rightjoin
## # A tibble: 6 × 6
## customer_id name city order_id product amount
## <dbl> <chr> <chr> <dbl> <chr> <dbl>
## 1 1 Alice New York 101 Laptop 1200
## 2 2 Bob Los Angeles 102 Phone 800
## 3 2 Bob Los Angeles 104 Desktop 1500
## 4 3 Charlie Chicago 103 Tablet 300
## 5 6 <NA> <NA> 105 Camera 600
## 6 7 <NA> <NA> 106 Printer 150
A. There are 6 rows in the result.
B. customer_id 6 & 7 have NA for name and city because those
customer_ids appear in the orders table but do not exist in the
customers table.
4. Full Join
fulljoin <- full_join(customers, orders, by = "customer_id")
fulljoin
## # A tibble: 8 × 6
## customer_id name city order_id product amount
## <dbl> <chr> <chr> <dbl> <chr> <dbl>
## 1 1 Alice New York 101 Laptop 1200
## 2 2 Bob Los Angeles 102 Phone 800
## 3 2 Bob Los Angeles 104 Desktop 1500
## 4 3 Charlie Chicago 103 Tablet 300
## 5 4 David Houston NA <NA> NA
## 6 5 Eve Phoenix NA <NA> NA
## 7 6 <NA> <NA> 105 Camera 600
## 8 7 <NA> <NA> 106 Printer 150
A. There are 8 rows in the result.
B. Customers 4 & 5 have NA for order columns because they have
no orders
Orders with customer_id 6 & 7 have NA for customer name/city
because those customers are not in the customer dataset.
5. Semi Join
semijoin <- semi_join(customers, orders, by = "customer_id")
semijoin
## # A tibble: 3 × 3
## customer_id name city
## <dbl> <chr> <chr>
## 1 1 Alice New York
## 2 2 Bob Los Angeles
## 3 3 Charlie Chicago
A. There are 3 rows in the result.
B. A semi join returns only the matching rows from the left table
(customers) and does not include any columns from orders.
It also does not duplicate customer rows when a customer has
multiple orders (so customer 2 will only appear once here).
6. Anti Join
antijoin <- anti_join(customers, orders, by = "customer_id")
antijoin
## # A tibble: 2 × 3
## customer_id name city
## <dbl> <chr> <chr>
## 1 4 David Houston
## 2 5 Eve Phoenix
A. The customers in the result are David and Eve
B. These customers are in the customers table but have no matching
orders, meaning they haven’t placed any orders in the orders
dataset.
Practical Application
A. I would use a left join because it keeps every customer and fills
in NA for customers with no orders.
all_customers <- left_join(customers, orders, by = "customer_id")
all_customers
## # A tibble: 6 × 6
## customer_id name city order_id product amount
## <dbl> <chr> <chr> <dbl> <chr> <dbl>
## 1 1 Alice New York 101 Laptop 1200
## 2 2 Bob Los Angeles 102 Phone 800
## 3 2 Bob Los Angeles 104 Desktop 1500
## 4 3 Charlie Chicago 103 Tablet 300
## 5 4 David Houston NA <NA> NA
## 6 5 Eve Phoenix NA <NA> NA
B. I would use a semi join because it returns only customers who
have at least one order and avoids duplicates and extra order
columns.
customers_with_orders <- semi_join(customers, orders, by = "customer_id")
customers_with_orders
## # A tibble: 3 × 3
## customer_id name city
## <dbl> <chr> <chr>
## 1 1 Alice New York
## 2 2 Bob Los Angeles
## 3 3 Charlie Chicago
Challenge Question
customer_summary <- customers %>%
left_join(orders, by = "customer_id") %>%
group_by(customer_id, name, city) %>%
summarise(
total_orders = sum(!is.na(order_id)),
total_amount_spent = sum(amount, na.rm = TRUE),
.groups = "drop"
)
customer_summary
## # A tibble: 5 × 5
## customer_id name city total_orders total_amount_spent
## <dbl> <chr> <chr> <int> <dbl>
## 1 1 Alice New York 1 1200
## 2 2 Bob Los Angeles 2 2300
## 3 3 Charlie Chicago 1 300
## 4 4 David Houston 0 0
## 5 5 Eve Phoenix 0 0