4 rows
An inner join only includes rows where there is a match in both datasets, which is why some customers (David & Eve) are not included.
inner_join(customers, orders, by = "customer_id")
## # A tibble: 4 × 6
## customer_id name city order_id product amount
## <dbl> <chr> <chr> <dbl> <chr> <dbl>
## 1 1 Alice New York 101 Laptop 1200
## 2 2 Bob Los Angeles 102 Phone 800
## 3 2 Bob Los Angeles 104 Desktop 1500
## 4 3 Charlie Chicago 103 Tablet 300
6 rows
This is different than inner join because left join displays all rows, even if they do not have a match in both datasets. So, David and Eve appear with an NA result in the orders column.
left_join(customers, orders, by = "customer_id")
## # A tibble: 6 × 6
## customer_id name city order_id product amount
## <dbl> <chr> <chr> <dbl> <chr> <dbl>
## 1 1 Alice New York 101 Laptop 1200
## 2 2 Bob Los Angeles 102 Phone 800
## 3 2 Bob Los Angeles 104 Desktop 1500
## 4 3 Charlie Chicago 103 Tablet 300
## 5 4 David Houston NA <NA> NA
## 6 5 Eve Phoenix NA <NA> NA
6 rows
The 5th and 6th customers have NULL. This is because they exist in the orders table but do not exist in the customers table. Since there’s no matching customer, their details are missing.
right_join(customers, orders, by = "customer_id")
## # A tibble: 6 × 6
## customer_id name city order_id product amount
## <dbl> <chr> <chr> <dbl> <chr> <dbl>
## 1 1 Alice New York 101 Laptop 1200
## 2 2 Bob Los Angeles 102 Phone 800
## 3 2 Bob Los Angeles 104 Desktop 1500
## 4 3 Charlie Chicago 103 Tablet 300
## 5 6 <NA> <NA> 105 Camera 600
## 6 7 <NA> <NA> 106 Printer 150
8 rows
David and Eve do not have order information, and customers 7 and 8 do not have names or cities. This is because a full join includes all records from both tables. Some customers don’t have orders, and some orders don’t have matching customers, leading to NA values in the result.
full_join(customers, orders, by = "customer_id")
## # A tibble: 8 × 6
## customer_id name city order_id product amount
## <dbl> <chr> <chr> <dbl> <chr> <dbl>
## 1 1 Alice New York 101 Laptop 1200
## 2 2 Bob Los Angeles 102 Phone 800
## 3 2 Bob Los Angeles 104 Desktop 1500
## 4 3 Charlie Chicago 103 Tablet 300
## 5 4 David Houston NA <NA> NA
## 6 5 Eve Phoenix NA <NA> NA
## 7 6 <NA> <NA> 105 Camera 600
## 8 7 <NA> <NA> 106 Printer 150
3 rows
This result differs from the inner join result because the semi join only returns customers who have at least one order.The semi join has fewer rows than the inner join because it keeps only unique customers who placed orders.
semi_join(customers, orders, by = "customer_id")
## # A tibble: 3 × 3
## customer_id name city
## <dbl> <chr> <chr>
## 1 1 Alice New York
## 2 2 Bob Los Angeles
## 3 3 Charlie Chicago
David and Eve
These customers are in the result because they have not yet made a purchase.
anti_join(customers, orders, by = "customer_id")
## # A tibble: 2 × 3
## customer_id name city
## <dbl> <chr> <chr>
## 1 4 David Houston
## 2 5 Eve Phoenix
The only customers that count are the ones that appear in the customer table. To find these customers, I would use left join because this includes all customers, and those without orders will still appear, with their order details as NA.
To find only customers who have placed orders, I would use inner join because this will return only customers who have matching orders, and exclude those who haven’t placed any.
left_join(customers, orders, by = "customer_id")
## # A tibble: 6 × 6
## customer_id name city order_id product amount
## <dbl> <chr> <chr> <dbl> <chr> <dbl>
## 1 1 Alice New York 101 Laptop 1200
## 2 2 Bob Los Angeles 102 Phone 800
## 3 2 Bob Los Angeles 104 Desktop 1500
## 4 3 Charlie Chicago 103 Tablet 300
## 5 4 David Houston NA <NA> NA
## 6 5 Eve Phoenix NA <NA> NA
inner_join(customers, orders, by = "customer_id")
## # A tibble: 4 × 6
## customer_id name city order_id product amount
## <dbl> <chr> <chr> <dbl> <chr> <dbl>
## 1 1 Alice New York 101 Laptop 1200
## 2 2 Bob Los Angeles 102 Phone 800
## 3 2 Bob Los Angeles 104 Desktop 1500
## 4 3 Charlie Chicago 103 Tablet 300
The only customers that count are the ones that appear in the customer table.
customer_summary <- customers %>%
left_join(orders, by = "customer_id") %>%
group_by(customer_id, name, city) %>%
summarize(
total_orders = n(),
total_spent = sum(amount, na.rm = TRUE),
.groups = "drop")
head(customer_summary)
## # A tibble: 5 × 5
## customer_id name city total_orders total_spent
## <dbl> <chr> <chr> <int> <dbl>
## 1 1 Alice New York 1 1200
## 2 2 Bob Los Angeles 2 2300
## 3 3 Charlie Chicago 1 300
## 4 4 David Houston 1 0
## 5 5 Eve Phoenix 1 0