q1 <- inner_join(customers, orders, by = "customer_id")
There are 4 rows.
Inner join returns all rows from both tables where there is a match. Therefore the rows not returned did not have a match in the other table.
kable(q1)
| customer_id | name | city | order_id | product | amount |
|---|---|---|---|---|---|
| 1 | Alice | New York | 101 | Laptop | 1200 |
| 2 | Bob | Los Angeles | 102 | Phone | 800 |
| 2 | Bob | Los Angeles | 104 | Desktop | 1500 |
| 3 | Charlie | Chicago | 103 | Tablet | 300 |
q2 <- left_join(customers, orders, by = "customer_id")
There are 6 rows.
Left join keeps ALL customers (1..5). Those without orders (4, 5) appear once with NA order fields. Customers with multiple orders (customer_id = 2) appear multiple times (once per matching order).
kable(q2)
| customer_id | name | city | order_id | product | amount |
|---|---|---|---|---|---|
| 1 | Alice | New York | 101 | Laptop | 1200 |
| 2 | Bob | Los Angeles | 102 | Phone | 800 |
| 2 | Bob | Los Angeles | 104 | Desktop | 1500 |
| 3 | Charlie | Chicago | 103 | Tablet | 300 |
| 4 | David | Houston | NA | NA | NA |
| 5 | Eve | Phoenix | NA | NA | NA |
q3 <- right_join(customers, orders, by = "customer_id")
There are 6 rows.
q3 %>% filter(is.na(name) | is.na(city)) %>% select(customer_id, order_id, product, amount)
## # A tibble: 2 × 4
## customer_id order_id product amount
## <dbl> <dbl> <chr> <dbl>
## 1 6 105 Camera 600
## 2 7 106 Printer 150
A right join keeps all rows from the right table, filling in missing values for non-matching rows from the left. Since customers 6 and 7 do not exist in customers, their name and city show up as NULL.
kable(q3)
| customer_id | name | city | order_id | product | amount |
|---|---|---|---|---|---|
| 1 | Alice | New York | 101 | Laptop | 1200 |
| 2 | Bob | Los Angeles | 102 | Phone | 800 |
| 2 | Bob | Los Angeles | 104 | Desktop | 1500 |
| 3 | Charlie | Chicago | 103 | Tablet | 300 |
| 6 | NA | NA | 105 | Camera | 600 |
| 7 | NA | NA | 106 | Printer | 150 |
q4 <- full_join(customers, orders, by = "customer_id")
There are 8 rows.
only_one <- q4 %>%
filter(is.na(order_id) | is.na(name)) %>%
select(customer_id, name, city, order_id, product, amount)
kable(only_one)
| customer_id | name | city | order_id | product | amount |
|---|---|---|---|---|---|
| 4 | David | Houston | NA | NA | NA |
| 5 | Eve | Phoenix | NA | NA | NA |
| 6 | NA | NA | 105 | Camera | 600 |
| 7 | NA | NA | 106 | Printer | 150 |
Full join combines all rows from both tables. If a row doesn’t have a match in the other table, the unmatched columns are filled with NA.
kable(q4)
| customer_id | name | city | order_id | product | amount |
|---|---|---|---|---|---|
| 1 | Alice | New York | 101 | Laptop | 1200 |
| 2 | Bob | Los Angeles | 102 | Phone | 800 |
| 2 | Bob | Los Angeles | 104 | Desktop | 1500 |
| 3 | Charlie | Chicago | 103 | Tablet | 300 |
| 4 | David | Houston | NA | NA | NA |
| 5 | Eve | Phoenix | NA | NA | NA |
| 6 | NA | NA | 105 | Camera | 600 |
| 7 | NA | NA | 106 | Printer | 150 |
q5 <- semi_join(customers, orders, by = "customer_id")
There are 3 rows.
This result differs from the inner join because the semi join only returns the customers who placed orders, without repeating them for each order.
kable(q5)
| customer_id | name | city |
|---|---|---|
| 1 | Alice | New York |
| 2 | Bob | Los Angeles |
| 3 | Charlie | Chicago |
q6 <- anti_join(customers, orders, by = "customer_id")
David and Eve.
This result tells us which customers have never placed any orders, since they exist in the customers table but not in the orders table.
kable(q6)
| customer_id | name | city |
|---|---|---|
| 4 | David | Houston |
| 5 | Eve | Phoenix |
Use a left join, because it keeps every row from customers and fills in missing order info with NA for customers who haven’t ordered.
all_customers <- left_join(customers, orders, by = "customer_id")
kable(all_customers)
| customer_id | name | city | order_id | product | amount |
|---|---|---|---|---|---|
| 1 | Alice | New York | 101 | Laptop | 1200 |
| 2 | Bob | Los Angeles | 102 | Phone | 800 |
| 2 | Bob | Los Angeles | 104 | Desktop | 1500 |
| 3 | Charlie | Chicago | 103 | Tablet | 300 |
| 4 | David | Houston | NA | NA | NA |
| 5 | Eve | Phoenix | NA | NA | NA |
Use a semi join because it returns only customers rows that have a match in orders without duplicating customers per order.
customers_with_orders <- semi_join(customers, orders, by = "customer_id")
kable(customers_with_orders)
| customer_id | name | city |
|---|---|---|
| 1 | Alice | New York |
| 2 | Bob | Los Angeles |
| 3 | Charlie | Chicago |
challenge <- customers %>%
left_join(orders, by = "customer_id") %>%
group_by(customer_id, name, city) %>%
summarise(
total_orders = n_distinct(order_id, na.rm = TRUE),
total_spent = sum(amount, na.rm = TRUE),
.groups = "drop"
)
kable(challenge)
| customer_id | name | city | total_orders | total_spent |
|---|---|---|---|---|
| 1 | Alice | New York | 1 | 1200 |
| 2 | Bob | Los Angeles | 2 | 2300 |
| 3 | Charlie | Chicago | 1 | 300 |
| 4 | David | Houston | 0 | 0 |
| 5 | Eve | Phoenix | 0 | 0 |