q1 <- inner_join(customers, orders)
## Joining with `by = join_by(customer_id)`
There are 4 rows in the result
Because the customers and orders that are not included did not have a match in the other table
head(q1)
## # A tibble: 4 × 6
## customer_id name city order_id product amount
## <dbl> <chr> <chr> <dbl> <chr> <dbl>
## 1 1 Alice New York 101 Laptop 1200
## 2 2 Bob Los Angeles 102 Phone 800
## 3 2 Bob Los Angeles 104 Desktop 1500
## 4 3 Charlie Chicago 103 Tablet 300
q2 <- left_join(customers, orders)
## Joining with `by = join_by(customer_id)`
There are 6 rows in the result.
A left join keeps all the rows from the customers table, even if there are no matching row in the orders table. Customers 4 and 5 appear with NA values because they have not placed any orders. This differs from the inner join because it only keeps rows where there is a match in both tables.
head(q2)
## # A tibble: 6 × 6
## customer_id name city order_id product amount
## <dbl> <chr> <chr> <dbl> <chr> <dbl>
## 1 1 Alice New York 101 Laptop 1200
## 2 2 Bob Los Angeles 102 Phone 800
## 3 2 Bob Los Angeles 104 Desktop 1500
## 4 3 Charlie Chicago 103 Tablet 300
## 5 4 David Houston NA <NA> NA
## 6 5 Eve Phoenix NA <NA> NA
q3 <- right_join(customers, orders)
## Joining with `by = join_by(customer_id)`
There are 6 rows in the results
customer_id 6 & 7 have NA for name and city because those customer_ids appear in the orders table but do not exist in the customers table.
head(q3)
## # A tibble: 6 × 6
## customer_id name city order_id product amount
## <dbl> <chr> <chr> <dbl> <chr> <dbl>
## 1 1 Alice New York 101 Laptop 1200
## 2 2 Bob Los Angeles 102 Phone 800
## 3 2 Bob Los Angeles 104 Desktop 1500
## 4 3 Charlie Chicago 103 Tablet 300
## 5 6 <NA> <NA> 105 Camera 600
## 6 7 <NA> <NA> 106 Printer 150
q4 <- full_join(customers, orders)
## Joining with `by = join_by(customer_id)`
There are 8 rows in the result.
Customers 4 & 5 have NA for order columns because they have no orders. Orders with customer_id 6 & 7 have NA for customer name/city because those customers are not in the customer dataset.
head(q4)
## # A tibble: 6 × 6
## customer_id name city order_id product amount
## <dbl> <chr> <chr> <dbl> <chr> <dbl>
## 1 1 Alice New York 101 Laptop 1200
## 2 2 Bob Los Angeles 102 Phone 800
## 3 2 Bob Los Angeles 104 Desktop 1500
## 4 3 Charlie Chicago 103 Tablet 300
## 5 4 David Houston NA <NA> NA
## 6 5 Eve Phoenix NA <NA> NA
q5 <- semi_join(customers, orders)
## Joining with `by = join_by(customer_id)`
There are 3 rows in the result.
A semi join returns only the matching rows from the left table (customers) and does not include any columns from orders. It also does not duplicate customer rows when a customer has multiple orders (so customer 2 will only appear once here).
head(q5)
## # A tibble: 3 × 3
## customer_id name city
## <dbl> <chr> <chr>
## 1 1 Alice New York
## 2 2 Bob Los Angeles
## 3 3 Charlie Chicago
q6 <- anti_join(customers, orders)
## Joining with `by = join_by(customer_id)`
The customers in the result are David and Eve
These customers are in the customers table but have no matching orders, meaning they haven’t placed any orders in the orders data set.
head(q6)
## # A tibble: 2 × 3
## customer_id name city
## <dbl> <chr> <chr>
## 1 4 David Houston
## 2 5 Eve Phoenix
I would use a left join because it keeps every customer and fills in NA for customers with no orders.
q7a <- left_join(customers, orders)
## Joining with `by = join_by(customer_id)`
head(q7a)
## # A tibble: 6 × 6
## customer_id name city order_id product amount
## <dbl> <chr> <chr> <dbl> <chr> <dbl>
## 1 1 Alice New York 101 Laptop 1200
## 2 2 Bob Los Angeles 102 Phone 800
## 3 2 Bob Los Angeles 104 Desktop 1500
## 4 3 Charlie Chicago 103 Tablet 300
## 5 4 David Houston NA <NA> NA
## 6 5 Eve Phoenix NA <NA> NA
I would use a semi join because it returns only customers who have at least one order and avoids duplicates and extra order columns.
q7b <- semi_join(customers, orders)
## Joining with `by = join_by(customer_id)`
head(q7b)
## # A tibble: 3 × 3
## customer_id name city
## <dbl> <chr> <chr>
## 1 1 Alice New York
## 2 2 Bob Los Angeles
## 3 3 Charlie Chicago
q8 <- customers %>%
left_join(orders, by = "customer_id") %>%
group_by(customer_id, name, city) %>%
summarise(
total_orders = sum(!is.na(order_id)),
total_amount_spent = sum(amount, na.rm = TRUE),
.groups = "drop"
)
head(q8)
## # A tibble: 5 × 5
## customer_id name city total_orders total_amount_spent
## <dbl> <chr> <chr> <int> <dbl>
## 1 1 Alice New York 1 1200
## 2 2 Bob Los Angeles 2 2300
## 3 3 Charlie Chicago 1 300
## 4 4 David Houston 0 0
## 5 5 Eve Phoenix 0 0