q1 <- inner_join(customers , orders , by = 'customer_id')
There are 4 rows
They did not have a match in the other table
q1
## # A tibble: 4 × 6
## customer_id name city order_id product amount
## <dbl> <chr> <chr> <dbl> <chr> <dbl>
## 1 1 Alice New York 101 Laptop 1200
## 2 2 Bob Los Angeles 102 Phone 800
## 3 2 Bob Los Angeles 104 Desktop 1500
## 4 3 Charlie Chicago 103 Tablet 300
q2 <- left_join(customers , orders, by = 'customer_id')
There are 6 rows
The inner join had 4 rows because it only included customers who had matching orders. The left join includes all customers from the customers table, even if they do not have a matching order.
q2
## # A tibble: 6 × 6
## customer_id name city order_id product amount
## <dbl> <chr> <chr> <dbl> <chr> <dbl>
## 1 1 Alice New York 101 Laptop 1200
## 2 2 Bob Los Angeles 102 Phone 800
## 3 2 Bob Los Angeles 104 Desktop 1500
## 4 3 Charlie Chicago 103 Tablet 300
## 5 4 David Houston NA <NA> NA
## 6 5 Eve Phoenix NA <NA> NA
q3 <- right_join(customers, orders , by = 'customer_id')
There are 6 rows
Customer_id 6 and customer_id 7 results have NULL for customer name and city. These IDs exist in the orders table but do not exist in the customers table. Since a right join keeps all rows from the orders table but there are no matching customer records, the name and city fields are filled with NA.
q3
## # A tibble: 6 × 6
## customer_id name city order_id product amount
## <dbl> <chr> <chr> <dbl> <chr> <dbl>
## 1 1 Alice New York 101 Laptop 1200
## 2 2 Bob Los Angeles 102 Phone 800
## 3 2 Bob Los Angeles 104 Desktop 1500
## 4 3 Charlie Chicago 103 Tablet 300
## 5 6 <NA> <NA> 105 Camera 600
## 6 7 <NA> <NA> 106 Printer 150
q4 <- full_join(customers , orders , by = 'customer_id')
There are 8 rows.
There is information from only customers in rows 5 and 6. Their order_id, product, and amount are NA. There is information from only orders in rows 7 and 8. Their name and city are NA. A full join keeps all rows from both tables but if there is no match in one table, NA appears in the missing fields.
q4
## # A tibble: 8 × 6
## customer_id name city order_id product amount
## <dbl> <chr> <chr> <dbl> <chr> <dbl>
## 1 1 Alice New York 101 Laptop 1200
## 2 2 Bob Los Angeles 102 Phone 800
## 3 2 Bob Los Angeles 104 Desktop 1500
## 4 3 Charlie Chicago 103 Tablet 300
## 5 4 David Houston NA <NA> NA
## 6 5 Eve Phoenix NA <NA> NA
## 7 6 <NA> <NA> 105 Camera 600
## 8 7 <NA> <NA> 106 Printer 150
q5 <- semi_join(customers , orders, by = 'customer_id')
There are 3 rows
The inner join returns both matching rows from customers and corresponding columns from orders. The semi join returns only unique customers that have orders, without including order details. Duplicate customer entries are removed.
q5
## # A tibble: 3 × 3
## customer_id name city
## <dbl> <chr> <chr>
## 1 1 Alice New York
## 2 2 Bob Los Angeles
## 3 3 Charlie Chicago
q6 <- anti_join(customers , orders , by = 'customer_id')
David and Eve are the only customers in the result
These customers do not have any matching orders in the orders table meaning they have never placed an order but are registered in the system.
q6
## # A tibble: 2 × 3
## customer_id name city
## <dbl> <chr> <chr>
## 1 4 David Houston
## 2 5 Eve Phoenix
You would use left join because it ensures all customers are included even if they have not placed any orders.
You would use inner join because it ensures we only include customers who have at least one order and customers without orders will be excluded.
all_customers <- left_join(customers, orders, by = "customer_id")
order_customers <- inner_join(customers, orders, by = "customer_id")
head(all_customers)
## # A tibble: 6 × 6
## customer_id name city order_id product amount
## <dbl> <chr> <chr> <dbl> <chr> <dbl>
## 1 1 Alice New York 101 Laptop 1200
## 2 2 Bob Los Angeles 102 Phone 800
## 3 2 Bob Los Angeles 104 Desktop 1500
## 4 3 Charlie Chicago 103 Tablet 300
## 5 4 David Houston NA <NA> NA
## 6 5 Eve Phoenix NA <NA> NA
head(order_customers)
## # A tibble: 4 × 6
## customer_id name city order_id product amount
## <dbl> <chr> <chr> <dbl> <chr> <dbl>
## 1 1 Alice New York 101 Laptop 1200
## 2 2 Bob Los Angeles 102 Phone 800
## 3 2 Bob Los Angeles 104 Desktop 1500
## 4 3 Charlie Chicago 103 Tablet 300
customer_orders <- left_join(customers, orders, by = "customer_id")
customer_summary <- customer_orders %>%
group_by(customer_id, name, city) %>%
summarize(
total_orders = n(), # Count number of rows per customer
total_spent = sum(amount, na.rm = TRUE), # Sum order amounts
.groups = "drop" # Remove grouping to avoid warning message
) %>%
mutate(
total_orders = ifelse(total_spent == 0, 0, total_orders), # Ensure order count is 0 if no orders
total_spent = replace_na(total_spent, 0) # Replace NA with 0 for total spent
)
head(customer_summary)
## # A tibble: 5 × 5
## customer_id name city total_orders total_spent
## <dbl> <chr> <chr> <dbl> <dbl>
## 1 1 Alice New York 1 1200
## 2 2 Bob Los Angeles 2 2300
## 3 3 Charlie Chicago 1 300
## 4 4 David Houston 0 0
## 5 5 Eve Phoenix 0 0