inner <- inner_join(customers, orders, by = 'customer_id')
There are 4 rows in the result.
The only rows showing have a customer_id that matches in BOTH the “orders” AND “customers” tables (occurs in both tables).
print(inner)
## # A tibble: 4 × 6
## customer_id name city order_id product amount
## <dbl> <chr> <chr> <dbl> <chr> <dbl>
## 1 1 Alice New York 101 Laptop 1200
## 2 2 Bob Los Angeles 102 Phone 800
## 3 2 Bob Los Angeles 104 Desktop 1500
## 4 3 Charlie Chicago 103 Tablet 300
leftjoin <- left_join(customers, orders, by = 'customer_id')
There are 6 rows in the result.
This number includes all rows from the left table, customers, and all matching rows from the right table, orders. It is different from the inner join as all rows from the left table are included, whether it matches with a row from the right table or not.
print(leftjoin)
## # A tibble: 6 × 6
## customer_id name city order_id product amount
## <dbl> <chr> <chr> <dbl> <chr> <dbl>
## 1 1 Alice New York 101 Laptop 1200
## 2 2 Bob Los Angeles 102 Phone 800
## 3 2 Bob Los Angeles 104 Desktop 1500
## 4 3 Charlie Chicago 103 Tablet 300
## 5 4 David Houston NA <NA> NA
## 6 5 Eve Phoenix NA <NA> NA
rightjoin <- right_join(customers, orders, by = 'customer_id')
7 rows are in the result.
Customer_ids “6” and “7” have NULL for customer name and city because the ID is not included in tables “customers”.
print(rightjoin)
## # A tibble: 6 × 6
## customer_id name city order_id product amount
## <dbl> <chr> <chr> <dbl> <chr> <dbl>
## 1 1 Alice New York 101 Laptop 1200
## 2 2 Bob Los Angeles 102 Phone 800
## 3 2 Bob Los Angeles 104 Desktop 1500
## 4 3 Charlie Chicago 103 Tablet 300
## 5 6 <NA> <NA> 105 Camera 600
## 6 7 <NA> <NA> 106 Printer 150
fulljoin <- full_join(customers, orders, by = 'customer_id')
There are 8 rows in the result.
There are several rows with information from only one table. Rows for customer IDs 4 and 5 only contain data from table “customers,” and rows for customer IDs 6 and 7 only have data from table “orders”. This is because full_join returns ALL rows from both tables, without there having to be a match between them.
print(fulljoin)
## # A tibble: 8 × 6
## customer_id name city order_id product amount
## <dbl> <chr> <chr> <dbl> <chr> <dbl>
## 1 1 Alice New York 101 Laptop 1200
## 2 2 Bob Los Angeles 102 Phone 800
## 3 2 Bob Los Angeles 104 Desktop 1500
## 4 3 Charlie Chicago 103 Tablet 300
## 5 4 David Houston NA <NA> NA
## 6 5 Eve Phoenix NA <NA> NA
## 7 6 <NA> <NA> 105 Camera 600
## 8 7 <NA> <NA> 106 Printer 150
semijoin <- semi_join(customers, orders, by = 'customer_id')
There are 3 rows in the result.
The inner join result shows ALL rows where there is a match between the left and right table, while the semi join result shows ONLY rows from the left table where there is a match within the right table.
print(semijoin)
## # A tibble: 3 × 3
## customer_id name city
## <dbl> <chr> <chr>
## 1 1 Alice New York
## 2 2 Bob Los Angeles
## 3 3 Charlie Chicago
antijoin <- anti_join(customers, orders, by = 'customer_id')
Customers with the IDs “4” and “5” are in the result.
This result tells us that these two customers are not included within the “orders” table, and have not ordered anything.
print(antijoin)
## # A tibble: 2 × 3
## customer_id name city
## <dbl> <chr> <chr>
## 1 4 David Houston
## 2 5 Eve Phoenix
We would use left join, as it includes all data from table “customers” (therefore, all customer names regardless of orders placed) and corresponding data from table “orders” only if it is related to the listed customers.
leftjoin <- left_join(customers, orders, by = 'customer_id')
print(leftjoin)
## # A tibble: 6 × 6
## customer_id name city order_id product amount
## <dbl> <chr> <chr> <dbl> <chr> <dbl>
## 1 1 Alice New York 101 Laptop 1200
## 2 2 Bob Los Angeles 102 Phone 800
## 3 2 Bob Los Angeles 104 Desktop 1500
## 4 3 Charlie Chicago 103 Tablet 300
## 5 4 David Houston NA <NA> NA
## 6 5 Eve Phoenix NA <NA> NA
We would use right join, as it includes all data from table “orders” and corresponding data from table “customers” only if it corresponds to the listed orders.
rightjoin <- right_join(customers, orders, by = 'customer_id')
print(leftjoin)
## # A tibble: 6 × 6
## customer_id name city order_id product amount
## <dbl> <chr> <chr> <dbl> <chr> <dbl>
## 1 1 Alice New York 101 Laptop 1200
## 2 2 Bob Los Angeles 102 Phone 800
## 3 2 Bob Los Angeles 104 Desktop 1500
## 4 3 Charlie Chicago 103 Tablet 300
## 5 4 David Houston NA <NA> NA
## 6 5 Eve Phoenix NA <NA> NA
summary_data <- customers %>%
left_join(orders, by = "customer_id") %>%
group_by(customer_id, name, city) %>%
summarize(
total_orders = sum(!is.na(order_id)),
total_spent = sum(amount, na.rm = TRUE)
) %>%
ungroup()
print(summary_data)
## # A tibble: 5 × 5
## customer_id name city total_orders total_spent
## <dbl> <chr> <chr> <int> <dbl>
## 1 1 Alice New York 1 1200
## 2 2 Bob Los Angeles 2 2300
## 3 3 Charlie Chicago 1 300
## 4 4 David Houston 0 0
## 5 5 Eve Phoenix 0 0