q1 <- inner_join(customers, orders)
## Joining with `by = join_by(customer_id)`
There are 4.
Because the customers and orders that are not included did not have a match in the other table.
head(q1)
## # A tibble: 4 × 6
## customer_id name city order_id product amount
## <dbl> <chr> <chr> <dbl> <chr> <dbl>
## 1 1 Alice New York 101 Laptop 1200
## 2 2 Bob Los Angeles 102 Phone 800
## 3 2 Bob Los Angeles 104 Desktop 1500
## 4 3 Charlie Chicago 103 Tablet 300
q2 <- left_join(customers, orders)
## Joining with `by = join_by(customer_id)`
There are 6 rows in the result.
A left join keeps all the rows from the customers table, even if there are no matching row in the orders table. Customers 4 and 5 appear with NA values because they have not placed any orders. This differs from the inner join because it only keeps rows where there is a match in both tables.
head(q2)
## # A tibble: 6 × 6
## customer_id name city order_id product amount
## <dbl> <chr> <chr> <dbl> <chr> <dbl>
## 1 1 Alice New York 101 Laptop 1200
## 2 2 Bob Los Angeles 102 Phone 800
## 3 2 Bob Los Angeles 104 Desktop 1500
## 4 3 Charlie Chicago 103 Tablet 300
## 5 4 David Houston NA <NA> NA
## 6 5 Eve Phoenix NA <NA> NA
q3 <- right_join(customers, orders)
## Joining with `by = join_by(customer_id)`
There are 6 rows in the results
customer_id 6 & 7 have NA for name and city because those customer_ids appear in the orders table but do not exist in the customers table.
head(q3)
## # A tibble: 6 × 6
## customer_id name city order_id product amount
## <dbl> <chr> <chr> <dbl> <chr> <dbl>
## 1 1 Alice New York 101 Laptop 1200
## 2 2 Bob Los Angeles 102 Phone 800
## 3 2 Bob Los Angeles 104 Desktop 1500
## 4 3 Charlie Chicago 103 Tablet 300
## 5 6 <NA> <NA> 105 Camera 600
## 6 7 <NA> <NA> 106 Printer 150
q4 <- full_join(customers, orders)
## Joining with `by = join_by(customer_id)`
There are 8 rows in the result.
Customers 4 & 5 have NA for order columns because they have no orders. Orders with customer_id 6 & 7 have NA for customer name/city because those customers are not in the customer dataset.
head(q4)
## # A tibble: 6 × 6
## customer_id name city order_id product amount
## <dbl> <chr> <chr> <dbl> <chr> <dbl>
## 1 1 Alice New York 101 Laptop 1200
## 2 2 Bob Los Angeles 102 Phone 800
## 3 2 Bob Los Angeles 104 Desktop 1500
## 4 3 Charlie Chicago 103 Tablet 300
## 5 4 David Houston NA <NA> NA
## 6 5 Eve Phoenix NA <NA> NA
q5 <- semi_join(customers, orders)
## Joining with `by = join_by(customer_id)`
There are 3 rows in the result.
A semi join returns only the matching rows from the left table (customers) and does not include any columns from orders. It also does not duplicate customer rows when a customer has multiple orders (so customer 2 will only appear once here).
head(q5)
## # A tibble: 3 × 3
## customer_id name city
## <dbl> <chr> <chr>
## 1 1 Alice New York
## 2 2 Bob Los Angeles
## 3 3 Charlie Chicago
q6 <- anti_join(customers, orders)
## Joining with `by = join_by(customer_id)`
# The customers in the result are David and Eve
# These customers are in the customers table but have no matching orders, meaning they haven’t placed any orders in the orders data set.
head(q6)
## # A tibble: 2 × 3
## customer_id name city
## <dbl> <chr> <chr>
## 1 4 David Houston
## 2 5 Eve Phoenix
I would use a left join because it keeps every customer and fills in NA for customers with no orders.
I would use a semi join because it returns only customers who have at least one order and avoids duplicates and extra order columns.
q7b <- semi_join(customers, orders)
## Joining with `by = join_by(customer_id)`
head(q7b)
## # A tibble: 3 × 3
## customer_id name city
## <dbl> <chr> <chr>
## 1 1 Alice New York
## 2 2 Bob Los Angeles
## 3 3 Charlie Chicago
q8 <- customers %>%
left_join(orders, by = "customer_id") %>%
group_by(customer_id, name, city) %>%
summarise(
total_orders = sum(!is.na(order_id)),
total_amount_spent = sum(amount, na.rm = TRUE),
.groups = "drop"
)
head(q8)
## # A tibble: 5 × 5
## customer_id name city total_orders total_amount_spent
## <dbl> <chr> <chr> <int> <dbl>
## 1 1 Alice New York 1 1200
## 2 2 Bob Los Angeles 2 2300
## 3 3 Charlie Chicago 1 300
## 4 4 David Houston 0 0
## 5 5 Eve Phoenix 0 0