q1 <- inner_join(customers , orders, by = "customer_id")
There are 4 rows in the result.
Some customers are not in the result because their customer id is not included in the orders table.
q1
## # A tibble: 4 × 6
## customer_id name city order_id product amount
## <dbl> <chr> <chr> <dbl> <chr> <dbl>
## 1 1 Alice New York 101 Laptop 1200
## 2 2 Bob Los Angeles 102 Phone 800
## 3 2 Bob Los Angeles 104 Desktop 1500
## 4 3 Charlie Chicago 103 Tablet 300
q2 <- left_join(customers, orders, by = "customer_id")
There are six rows in the result.
This differs from the inner join result because it includes the two customers who did not have their customer id included in the orders table.
q2
## # A tibble: 6 × 6
## customer_id name city order_id product amount
## <dbl> <chr> <chr> <dbl> <chr> <dbl>
## 1 1 Alice New York 101 Laptop 1200
## 2 2 Bob Los Angeles 102 Phone 800
## 3 2 Bob Los Angeles 104 Desktop 1500
## 4 3 Charlie Chicago 103 Tablet 300
## 5 4 David Houston NA <NA> NA
## 6 5 Eve Phoenix NA <NA> NA
q3 <- right_join(customers, orders, by= "customer_id")
There are six rows in the result.
Customer ids 6 and 7 are NULL for name and city. This is because the order id does not match with any customer id in the customer table.
q3
## # A tibble: 6 × 6
## customer_id name city order_id product amount
## <dbl> <chr> <chr> <dbl> <chr> <dbl>
## 1 1 Alice New York 101 Laptop 1200
## 2 2 Bob Los Angeles 102 Phone 800
## 3 2 Bob Los Angeles 104 Desktop 1500
## 4 3 Charlie Chicago 103 Tablet 300
## 5 6 <NA> <NA> 105 Camera 600
## 6 7 <NA> <NA> 106 Printer 150
q4 <- full_join(customers, orders, by= "customer_id")
There are 8 rows in the result.
In rows 5 and 6, there is information from only the customers table, and in rows 7 and 8 there is information from only the orders table. These results show all data from both tables, even if there is no matching data from the other table.
q4
## # A tibble: 8 × 6
## customer_id name city order_id product amount
## <dbl> <chr> <chr> <dbl> <chr> <dbl>
## 1 1 Alice New York 101 Laptop 1200
## 2 2 Bob Los Angeles 102 Phone 800
## 3 2 Bob Los Angeles 104 Desktop 1500
## 4 3 Charlie Chicago 103 Tablet 300
## 5 4 David Houston NA <NA> NA
## 6 5 Eve Phoenix NA <NA> NA
## 7 6 <NA> <NA> 105 Camera 600
## 8 7 <NA> <NA> 106 Printer 150
q5 <- semi_join(customers, orders, by= "customer_id")
There are three rows in the result.
This does not include the information from the orders table, but instead only includes the customers who have orders and does not display the order information.
q5
## # A tibble: 3 × 3
## customer_id name city
## <dbl> <chr> <chr>
## 1 1 Alice New York
## 2 2 Bob Los Angeles
## 3 3 Charlie Chicago
q6 <- anti_join(customers, orders, by= "customer_id")
David and Eve are the customers in the result.
This result tells me that David and Eve are the customers who are not associated with an order.
q6
## # A tibble: 2 × 3
## customer_id name city
## <dbl> <chr> <chr>
## 1 4 David Houston
## 2 5 Eve Phoenix
I would use a left join, where the customers table is the left table. This is because the result will display all of the customers in the customers table, even if they are not associated with any orders. This is assuming you are tracking customers with customer names and ids. If you are looking for all orders placed regardless if the customer has a name or not, I would use full join command as it joins both tables together regardless if there is a any name attached to the customer id.
I would use a semi join because it returns only the customers who have placed orders. Unlike an inner join, a semi join does not duplicate customer records when multiple matching orders exist; instead, it simply verifies the extistence of a match in the orders table and returns only the corresponding customers from the customer table.
q7a <- left_join(customers, orders, by= "customer_id")
q7b <- semi_join(customers, orders, by= "customer_id")
q7a
## # A tibble: 6 × 6
## customer_id name city order_id product amount
## <dbl> <chr> <chr> <dbl> <chr> <dbl>
## 1 1 Alice New York 101 Laptop 1200
## 2 2 Bob Los Angeles 102 Phone 800
## 3 2 Bob Los Angeles 104 Desktop 1500
## 4 3 Charlie Chicago 103 Tablet 300
## 5 4 David Houston NA <NA> NA
## 6 5 Eve Phoenix NA <NA> NA
q7b
## # A tibble: 3 × 3
## customer_id name city
## <dbl> <chr> <chr>
## 1 1 Alice New York
## 2 2 Bob Los Angeles
## 3 3 Charlie Chicago
q8 <- customers %>%
left_join(orders, by = "customer_id") %>%
group_by(customer_id, name, city) %>%
summarize(
total_orders = n_distinct(order_id, na.rm = TRUE),
total_amount_spent = sum(amount, na.rm = TRUE),
.groups = "drop"
)
head(q8)
## # A tibble: 5 × 5
## customer_id name city total_orders total_amount_spent
## <dbl> <chr> <chr> <int> <dbl>
## 1 1 Alice New York 1 1200
## 2 2 Bob Los Angeles 2 2300
## 3 3 Charlie Chicago 1 300
## 4 4 David Houston 0 0
## 5 5 Eve Phoenix 0 0