R Markdown
library (dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
# Dataset 1: Customers
customers <- tibble(
customer_id = c(1, 2, 3, 4, 5),
name = c("Alice", "Bob", "Charlie", "David", "Eve"),
city = c("New York", "Los Angeles", "Chicago", "Houston", "Phoenix")
)
# Dataset 2: Orders
orders <- tibble(
order_id = c(101, 102, 103, 104, 105, 106),
customer_id = c(1, 2, 3, 2, 6, 7),
product = c("Laptop", "Phone", "Tablet", "Desktop", "Camera", "Printer"),
amount = c(1200, 800, 300, 1500, 600, 150)
)
Inner Join
q1 <- customers %>%
inner_join(orders, by = "customer_id")
print(q1)
## # A tibble: 4 × 6
## customer_id name city order_id product amount
## <dbl> <chr> <chr> <dbl> <chr> <dbl>
## 1 1 Alice New York 101 Laptop 1200
## 2 2 Bob Los Angeles 102 Phone 800
## 3 2 Bob Los Angeles 104 Desktop 1500
## 4 3 Charlie Chicago 103 Tablet 300
nrow(q1)
## [1] 4
"Customers without orders like David and Eve and orders without matching customers like customer_id 6 and 7 are excluded.
"
## [1] "Customers without orders like David and Eve and orders without matching customers like customer_id 6 and 7 are excluded.\n"
Left Join
q2 <- customers %>%
left_join(orders, by = "customer_id")
print(q2)
## # A tibble: 6 × 6
## customer_id name city order_id product amount
## <dbl> <chr> <chr> <dbl> <chr> <dbl>
## 1 1 Alice New York 101 Laptop 1200
## 2 2 Bob Los Angeles 102 Phone 800
## 3 2 Bob Los Angeles 104 Desktop 1500
## 4 3 Charlie Chicago 103 Tablet 300
## 5 4 David Houston NA <NA> NA
## 6 5 Eve Phoenix NA <NA> NA
nrow(q2)
## [1] 6
"This differs from the inner join because the left join includes all customers, even if they don’t have orders"
## [1] "This differs from the inner join because the left join includes all customers, even if they don’t have orders"
Right Join
q3 <- customers %>%
right_join(orders, by = "customer_id")
print(q3)
## # A tibble: 6 × 6
## customer_id name city order_id product amount
## <dbl> <chr> <chr> <dbl> <chr> <dbl>
## 1 1 Alice New York 101 Laptop 1200
## 2 2 Bob Los Angeles 102 Phone 800
## 3 2 Bob Los Angeles 104 Desktop 1500
## 4 3 Charlie Chicago 103 Tablet 300
## 5 6 <NA> <NA> 105 Camera 600
## 6 7 <NA> <NA> 106 Printer 150
nrow(q3)
## [1] 6
"Customers with customer_id 6 and 7 have NA values for name and city because they don't exist in the customers table."
## [1] "Customers with customer_id 6 and 7 have NA values for name and city because they don't exist in the customers table."
Full Join
q4 <- customers %>%
full_join(orders, by = "customer_id")
print(q4)
## # A tibble: 8 × 6
## customer_id name city order_id product amount
## <dbl> <chr> <chr> <dbl> <chr> <dbl>
## 1 1 Alice New York 101 Laptop 1200
## 2 2 Bob Los Angeles 102 Phone 800
## 3 2 Bob Los Angeles 104 Desktop 1500
## 4 3 Charlie Chicago 103 Tablet 300
## 5 4 David Houston NA <NA> NA
## 6 5 Eve Phoenix NA <NA> NA
## 7 6 <NA> <NA> 105 Camera 600
## 8 7 <NA> <NA> 106 Printer 150
nrow(q4)
## [1] 8
"Rows with missing data from either table will have NA values where there's no match."
## [1] "Rows with missing data from either table will have NA values where there's no match."
Semi Join
q5 <- customers %>%
semi_join(orders, by = "customer_id")
print (q5)
## # A tibble: 3 × 3
## customer_id name city
## <dbl> <chr> <chr>
## 1 1 Alice New York
## 2 2 Bob Los Angeles
## 3 3 Charlie Chicago
nrow(q5)
## [1] 3
"Unlike the inner join, the semi join returns only rows from customers, not the actual joined data."
## [1] "Unlike the inner join, the semi join returns only rows from customers, not the actual joined data."
Anti Join
q6 <- customers %>%
anti_join(orders, by = "customer_id")
print(q6)
## # A tibble: 2 × 3
## customer_id name city
## <dbl> <chr> <chr>
## 1 4 David Houston
## 2 5 Eve Phoenix
"The result contains customers who haven’t placed any orders, as the anti join returns only rows from the customers table that don’t have a match in the orders table."
## [1] "The result contains customers who haven’t placed any orders, as the anti join returns only rows from the customers table that don’t have a match in the orders table."
Practical Application
q7a <- customers %>%
left_join(orders, by = "customer_id")
print(q7a)
## # A tibble: 6 × 6
## customer_id name city order_id product amount
## <dbl> <chr> <chr> <dbl> <chr> <dbl>
## 1 1 Alice New York 101 Laptop 1200
## 2 2 Bob Los Angeles 102 Phone 800
## 3 2 Bob Los Angeles 104 Desktop 1500
## 4 3 Charlie Chicago 103 Tablet 300
## 5 4 David Houston NA <NA> NA
## 6 5 Eve Phoenix NA <NA> NA
"A left join ensures that all customers are included, even if they haven’t placed any orders."
## [1] "A left join ensures that all customers are included, even if they haven’t placed any orders."
q7b <- customers %>%
inner_join(orders, by = "customer_id")
print(q7b)
## # A tibble: 4 × 6
## customer_id name city order_id product amount
## <dbl> <chr> <chr> <dbl> <chr> <dbl>
## 1 1 Alice New York 101 Laptop 1200
## 2 2 Bob Los Angeles 102 Phone 800
## 3 2 Bob Los Angeles 104 Desktop 1500
## 4 3 Charlie Chicago 103 Tablet 300
"An inner join includes only customers who have placed at least one order, filtering out customers with no orders."
## [1] "An inner join includes only customers who have placed at least one order, filtering out customers with no orders."