4 rows
Inner join only returns when there is a match on both tables. The customer data only has 3 customers in the order table with one customer having 2 orders.
q1 <- inner_join(customers , orders, by ='customer_id')
head(q1)
## # A tibble: 4 × 6
## customer_id name city order_id product amount
## <dbl> <chr> <chr> <dbl> <chr> <dbl>
## 1 1 Alice New York 101 Laptop 1200
## 2 2 Bob Los Angeles 102 Phone 800
## 3 2 Bob Los Angeles 104 Desktop 1500
## 4 3 Charlie Chicago 103 Tablet 300
6
In a left join, all of the items from the left table(customers), are included even if they don’t match to what is in the right table(orders). Those that don’t have a match return ’N/A’. Inner join returns items where there is a matching record between both tables.
q2 <- left_join(customers, orders, by ='customer_id')
head(q2)
## # A tibble: 6 × 6
## customer_id name city order_id product amount
## <dbl> <chr> <chr> <dbl> <chr> <dbl>
## 1 1 Alice New York 101 Laptop 1200
## 2 2 Bob Los Angeles 102 Phone 800
## 3 2 Bob Los Angeles 104 Desktop 1500
## 4 3 Charlie Chicago 103 Tablet 300
## 5 4 David Houston NA <NA> NA
## 6 5 Eve Phoenix NA <NA> NA
6 rows
Customer_ids 6 and 7 have no values for name and city. There is no matching data for customer_ids 6 and 7 in the customers table(left table). When there is no match for a row that is in the right table, then no value will be returned in the columns from the left table.
q3 <-right_join(customers, orders, by = 'customer_id')
head(q3)
## # A tibble: 6 × 6
## customer_id name city order_id product amount
## <dbl> <chr> <chr> <dbl> <chr> <dbl>
## 1 1 Alice New York 101 Laptop 1200
## 2 2 Bob Los Angeles 102 Phone 800
## 3 2 Bob Los Angeles 104 Desktop 1500
## 4 3 Charlie Chicago 103 Tablet 300
## 5 6 <NA> <NA> 105 Camera 600
## 6 7 <NA> <NA> 106 Printer 150
8 rows
The first set of rows that have information from only one table are Customer ID 6 and 7 have results from one table. These customer ID are located in the orders table and there is no corresponding data in the customers table. Customer ID 4 and 5 are located in the customers table and do not have matching data in the orders table. N/A is returnee indicating there is no corresponding data.
q4 <- full_join(customers, orders, by = "customer_id")
print(q4)
## # A tibble: 8 × 6
## customer_id name city order_id product amount
## <dbl> <chr> <chr> <dbl> <chr> <dbl>
## 1 1 Alice New York 101 Laptop 1200
## 2 2 Bob Los Angeles 102 Phone 800
## 3 2 Bob Los Angeles 104 Desktop 1500
## 4 3 Charlie Chicago 103 Tablet 300
## 5 4 David Houston NA <NA> NA
## 6 5 Eve Phoenix NA <NA> NA
## 7 6 <NA> <NA> 105 Camera 600
## 8 7 <NA> <NA> 106 Printer 150
3 rows
The inner join included both customer and order details.The semi join only returns customers who have placed at least one order without order details. Customers 4 and 5 are excluded because they have no orders.
q5 <-semi_join(customers, orders, by ='customer_id')
head(q5)
## # A tibble: 3 × 3
## customer_id name city
## <dbl> <chr> <chr>
## 1 1 Alice New York
## 2 2 Bob Los Angeles
## 3 3 Charlie Chicago
Customer ID 4 and 5
Results show that customer ID 4 and 5 have not made any purchases since they are in the customer table but not in the orders table.
q6 <-anti_join(customers, orders, by = 'customer_id')
head(q6)
## # A tibble: 2 × 3
## customer_id name city
## <dbl> <chr> <chr>
## 1 4 David Houston
## 2 5 Eve Phoenix
The full join function would be used because the full join returns all rows from both the customers and orders table when there is a match in either the left or right table. This would include all customers even ones who do not have orders placed within the orders table.
To return customers who have placed orders you would use right join . This will return all rows from the right table which is the orders and find correlating rows in the left table which is customers. Only customers who have placed orders will be displayed in the results since the orders are in the right table. Even if an order does not have a corresponding customers in the table, it will still appear in the results.
q7a <-full_join(customers, orders)
## Joining with `by = join_by(customer_id)`
q7b <-right_join(customers, orders)
## Joining with `by = join_by(customer_id)`
print(q7a)
## # A tibble: 8 × 6
## customer_id name city order_id product amount
## <dbl> <chr> <chr> <dbl> <chr> <dbl>
## 1 1 Alice New York 101 Laptop 1200
## 2 2 Bob Los Angeles 102 Phone 800
## 3 2 Bob Los Angeles 104 Desktop 1500
## 4 3 Charlie Chicago 103 Tablet 300
## 5 4 David Houston NA <NA> NA
## 6 5 Eve Phoenix NA <NA> NA
## 7 6 <NA> <NA> 105 Camera 600
## 8 7 <NA> <NA> 106 Printer 150
print(q7b)
## # A tibble: 6 × 6
## customer_id name city order_id product amount
## <dbl> <chr> <chr> <dbl> <chr> <dbl>
## 1 1 Alice New York 101 Laptop 1200
## 2 2 Bob Los Angeles 102 Phone 800
## 3 2 Bob Los Angeles 104 Desktop 1500
## 4 3 Charlie Chicago 103 Tablet 300
## 5 6 <NA> <NA> 105 Camera 600
## 6 7 <NA> <NA> 106 Printer 150
challenge_question <- customers %>%
full_join(orders, by = "customer_id") %>%
group_by(customer_id, name, city) %>%
summarize( total_orders = sum(!is.na(order_id)), total_amount_spent = sum(amount, na.rm = TRUE),
.groups = "drop"
)
head(challenge_question)
## # A tibble: 6 × 5
## customer_id name city total_orders total_amount_spent
## <dbl> <chr> <chr> <int> <dbl>
## 1 1 Alice New York 1 1200
## 2 2 Bob Los Angeles 2 2300
## 3 3 Charlie Chicago 1 300
## 4 4 David Houston 0 0
## 5 5 Eve Phoenix 0 0
## 6 6 <NA> <NA> 1 600