## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.4 ✔ readr 2.1.5
## ✔ forcats 1.0.0 ✔ stringr 1.5.1
## ✔ ggplot2 3.5.1 ✔ tibble 3.2.1
## ✔ lubridate 1.9.4 ✔ tidyr 1.3.1
## ✔ purrr 1.0.4
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag() masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
4 rows
An inner join only includes rows where there is a match in both datasets, which is why some customers (David & Eve) are not included.
# Perform an inner join
inner_join_result <- inner_join(customers, orders, by = "customer_id")
# Display the result
inner_join_result
## # A tibble: 4 × 6
## customer_id name city order_id product amount
## <dbl> <chr> <chr> <dbl> <chr> <dbl>
## 1 1 Alice New York 101 Laptop 1200
## 2 2 Bob Los Angeles 102 Phone 800
## 3 2 Bob Los Angeles 104 Desktop 1500
## 4 3 Charlie Chicago 103 Tablet 300
6 rows
This is different from Inner Join because a Left Join includes all rows from the customers table, even if there is no matching record in the orders table. Customers David and Eve are included in the result, even though they have not placed any orders. Their corresponding values in the orders columns appear as NA, indicating no matching record in the orders dataset.
# Perform a left join (customers as left table)
left_join_result <- left_join(customers, orders, by = "customer_id")
# Display the result
left_join_result
## # A tibble: 6 × 6
## customer_id name city order_id product amount
## <dbl> <chr> <chr> <dbl> <chr> <dbl>
## 1 1 Alice New York 101 Laptop 1200
## 2 2 Bob Los Angeles 102 Phone 800
## 3 2 Bob Los Angeles 104 Desktop 1500
## 4 3 Charlie Chicago 103 Tablet 300
## 5 4 David Houston NA <NA> NA
## 6 5 Eve Phoenix NA <NA> NA
6 rows
The 5th and 6th customers have NULL values for name and city. This is because they exist in the orders table but do not exist in the customers table. Since there is no matching customer record, their details appear as missing (NA) in the result.
# Perform a right join (orders as the right table)
right_join_result <- right_join(customers, orders, by = "customer_id")
# Display the result
right_join_result
## # A tibble: 6 × 6
## customer_id name city order_id product amount
## <dbl> <chr> <chr> <dbl> <chr> <dbl>
## 1 1 Alice New York 101 Laptop 1200
## 2 2 Bob Los Angeles 102 Phone 800
## 3 2 Bob Los Angeles 104 Desktop 1500
## 4 3 Charlie Chicago 103 Tablet 300
## 5 6 <NA> <NA> 105 Camera 600
## 6 7 <NA> <NA> 106 Printer 150
8 rows
David and Eve do not have order information because they have never placed an order. Customers 6 and 7 appear with NULL values for name and city because they exist in the orders table but do not exist in the customers table. This is because a full join includes all records from both tables, and where there is no match, NULL values are used in the result.
# Perform a full join between customers and orders
full_join_result <- full_join(customers, orders, by = "customer_id")
# Display the result
full_join_result
## # A tibble: 8 × 6
## customer_id name city order_id product amount
## <dbl> <chr> <chr> <dbl> <chr> <dbl>
## 1 1 Alice New York 101 Laptop 1200
## 2 2 Bob Los Angeles 102 Phone 800
## 3 2 Bob Los Angeles 104 Desktop 1500
## 4 3 Charlie Chicago 103 Tablet 300
## 5 4 David Houston NA <NA> NA
## 6 5 Eve Phoenix NA <NA> NA
## 7 6 <NA> <NA> 105 Camera 600
## 8 7 <NA> <NA> 106 Printer 150
3 rows
This result differs from the Inner Join result because the Semi Join only returns customers who have at least one order. The Semi Join has fewer rows than the Inner Join because it keeps only unique customers who placed orders, rather than including multiple order records for the same customer.
# Perform a semi join (customers as the left table)
semi_join_result <- semi_join(customers, orders, by = "customer_id")
# Display the result
semi_join_result
## # A tibble: 3 × 3
## customer_id name city
## <dbl> <chr> <chr>
## 1 1 Alice New York
## 2 2 Bob Los Angeles
## 3 3 Charlie Chicago
David and Eve
These customers are in the result because they have not yet made a purchase. The Anti Join returns only the customers who do not have a matching record in the orders table, meaning they have not placed any orders.
# Perform an anti join (customers as the left table)
anti_join_result <- anti_join(customers, orders, by = "customer_id")
# Display the result
anti_join_result
## # A tibble: 2 × 3
## customer_id name city
## <dbl> <chr> <chr>
## 1 4 David Houston
## 2 5 Eve Phoenix
To find all customers, including those who haven’t placed any orders, I would use a Left Join. A Left Join ensures that all customers from the customers table are included in the result, even if they have not placed any orders. If a customer has not placed an order, their corresponding order details will appear as NA.
To find only customers who have placed orders, I would use an Inner Join. An Inner Join includes only the rows where there is a match in both tables, meaning it will return only customers who have at least one order in the orders table, excluding those who haven’t placed any orders.
# Perform a left join to find all customers, including those who haven't placed any orders
left_join_result_all_customers <- left_join(customers, orders, by = "customer_id")
# Display the result
left_join_result_all_customers
## # A tibble: 6 × 6
## customer_id name city order_id product amount
## <dbl> <chr> <chr> <dbl> <chr> <dbl>
## 1 1 Alice New York 101 Laptop 1200
## 2 2 Bob Los Angeles 102 Phone 800
## 3 2 Bob Los Angeles 104 Desktop 1500
## 4 3 Charlie Chicago 103 Tablet 300
## 5 4 David Houston NA <NA> NA
## 6 5 Eve Phoenix NA <NA> NA
# Perform a semi join to find only customers who have placed orders
semi_join_result_customers_with_orders <- semi_join(customers, orders, by = "customer_id")
# Display the result
semi_join_result_customers_with_orders
## # A tibble: 3 × 3
## customer_id name city
## <dbl> <chr> <chr>
## 1 1 Alice New York
## 2 2 Bob Los Angeles
## 3 3 Charlie Chicago
To ensure all customers are included in the summary, we will use a Left Join to merge the customers and orders tables. This allows us to keep all customer records, even if they have not placed any orders. We then use group_by() and summarize() to calculate the total number of orders and the total amount spent by each customer.
# Perform a left join to include all customers, even those without orders
summary_result <- customers %>%
left_join(orders, by = "customer_id") %>%
group_by(customer_id, name, city) %>%
summarize(
total_orders = n(), # Count the number of orders for each customer
total_spent = sum(amount, na.rm = TRUE) # Sum the amount spent by each customer, ignoring NA
) %>%
ungroup() # Remove grouping for easier display
## `summarise()` has grouped output by 'customer_id', 'name'. You can override
## using the `.groups` argument.
# Display the result
summary_result
## # A tibble: 5 × 5
## customer_id name city total_orders total_spent
## <dbl> <chr> <chr> <int> <dbl>
## 1 1 Alice New York 1 1200
## 2 2 Bob Los Angeles 2 2300
## 3 3 Charlie Chicago 1 300
## 4 4 David Houston 1 0
## 5 5 Eve Phoenix 1 0