q1 <- inner_join(customers , orders , by = 'customer_id')
There are four rows.
Inner join returns all rows from both tables where there is a match therefore the rows not returned did not have a match in the other table.
head(q1)
## # A tibble: 4 × 6
## customer_id name city order_id product amount
## <dbl> <chr> <chr> <dbl> <chr> <dbl>
## 1 1 Alice New York 101 Laptop 1200
## 2 2 Bob Los Angeles 102 Phone 800
## 3 2 Bob Los Angeles 104 Desktop 1500
## 4 3 Charlie Chicago 103 Tablet 300
q2 <- left_join(customers , orders , by = 'customer_id')
There are six rows.
This number differs from the inner join result because it uses left join, which returns all rows from the left table and matching rows from the right table.
head(q2)
## # A tibble: 6 × 6
## customer_id name city order_id product amount
## <dbl> <chr> <chr> <dbl> <chr> <dbl>
## 1 1 Alice New York 101 Laptop 1200
## 2 2 Bob Los Angeles 102 Phone 800
## 3 2 Bob Los Angeles 104 Desktop 1500
## 4 3 Charlie Chicago 103 Tablet 300
## 5 4 David Houston NA <NA> NA
## 6 5 Eve Phoenix NA <NA> NA
q3 <- right_join(customers , orders , by = 'customer_id')
There are six rows.
The customer_ids that have NULL for customer name and city are customer_id 6 and customer_id 7. These customer IDs appear in the orders table but do not exist in the customers table.
missing_customers <- q3 %>%
filter(is.na(name) | is.na(city)) %>%
select(customer_id, name, city)
head(q3)
## # A tibble: 6 × 6
## customer_id name city order_id product amount
## <dbl> <chr> <chr> <dbl> <chr> <dbl>
## 1 1 Alice New York 101 Laptop 1200
## 2 2 Bob Los Angeles 102 Phone 800
## 3 2 Bob Los Angeles 104 Desktop 1500
## 4 3 Charlie Chicago 103 Tablet 300
## 5 6 <NA> <NA> 105 Camera 600
## 6 7 <NA> <NA> 106 Printer 150
q4 <- full_join(customers , orders , by = 'customer_id')
There are eight rows.
missing_rows <- q4 %>%
filter(is.na(order_id) | is.na(name)) %>%
select(customer_id, name, city, order_id, product, amount)
missing_rows
## # A tibble: 4 × 6
## customer_id name city order_id product amount
## <dbl> <chr> <chr> <dbl> <chr> <dbl>
## 1 4 David Houston NA <NA> NA
## 2 5 Eve Phoenix NA <NA> NA
## 3 6 <NA> <NA> 105 Camera 600
## 4 7 <NA> <NA> 106 Printer 150
The rows where there’s information from only one table are Customer Ids 4, 5, 6, and 7.
This occurs because a full join includes all rows from both tables, so when a match isn’t found, the columns from the missing table are filled with NA values.
head(q4)
## # A tibble: 6 × 6
## customer_id name city order_id product amount
## <dbl> <chr> <chr> <dbl> <chr> <dbl>
## 1 1 Alice New York 101 Laptop 1200
## 2 2 Bob Los Angeles 102 Phone 800
## 3 2 Bob Los Angeles 104 Desktop 1500
## 4 3 Charlie Chicago 103 Tablet 300
## 5 4 David Houston NA <NA> NA
## 6 5 Eve Phoenix NA <NA> NA
q5 <- semi_join(customers , orders , by = 'customer_id')
There are three rows.
This result differs from the inner join result because it uses semi join, which returns all rows from the left table where there is a match in the right table. Therefore, semi join only included columns from the customers table.
head(q5)
## # A tibble: 3 × 3
## customer_id name city
## <dbl> <chr> <chr>
## 1 1 Alice New York
## 2 2 Bob Los Angeles
## 3 3 Charlie Chicago
q6 <- anti_join(customers , orders , by = 'customer_id')
Customers David and Eve are in the result. They are customer ID # 4 and 5.
This result tells us that David and Eve are customers who have not placed any orders. The anti join keeps only rows from the customers table that have no matching records in the orders table. Therefore, these customers exist in the system but have no associated order data.
head(q6)
## # A tibble: 2 × 3
## customer_id name city
## <dbl> <chr> <chr>
## 1 4 David Houston
## 2 5 Eve Phoenix
The join function you would use to find all customers is the left join function. A left join keeps all rows from the left table (customers) and adds matching rows from the right table (orders) where possible. If a customer has not placed any orders, their order_id, product, and amount will appear as NA.
To find only the customers who have placed orders, you would use a semi join. A semi join returns only the rows from the left table (customers) that have matching entries in the right table (orders). It shows only customers who have placed at least one order, but does not include order details.
all_customers <- left_join(customers, orders, by = "customer_id")
customers_with_orders <- semi_join(customers, orders, by = "customer_id")
print(all_customers)
## # A tibble: 6 × 6
## customer_id name city order_id product amount
## <dbl> <chr> <chr> <dbl> <chr> <dbl>
## 1 1 Alice New York 101 Laptop 1200
## 2 2 Bob Los Angeles 102 Phone 800
## 3 2 Bob Los Angeles 104 Desktop 1500
## 4 3 Charlie Chicago 103 Tablet 300
## 5 4 David Houston NA <NA> NA
## 6 5 Eve Phoenix NA <NA> NA
print(customers_with_orders)
## # A tibble: 3 × 3
## customer_id name city
## <dbl> <chr> <chr>
## 1 1 Alice New York
## 2 2 Bob Los Angeles
## 3 3 Charlie Chicago
customer_orders <- left_join(customers, orders, by = "customer_id")
customer_summary <- customer_orders %>%
group_by(customer_id, name, city) %>%
summarise(
total_orders = sum(!is.na(order_id)),
total_amount_spent = sum(amount, na.rm = TRUE)
)
## `summarise()` has grouped output by 'customer_id', 'name'. You can override
## using the `.groups` argument.
print(customer_summary)
## # A tibble: 5 × 5
## # Groups: customer_id, name [5]
## customer_id name city total_orders total_amount_spent
## <dbl> <chr> <chr> <int> <dbl>
## 1 1 Alice New York 1 1200
## 2 2 Bob Los Angeles 2 2300
## 3 3 Charlie Chicago 1 300
## 4 4 David Houston 0 0
## 5 5 Eve Phoenix 0 0
The data uses a left join to include all customers, even those with no orders. Then, using group_by() and summarise(), we calculated each customer’s total number of orders and total amount spent. Customers with no orders show 0 for both totals.