library(dplyr)
# Dataset 1: Customers
customers <- tibble(
customer_id = c(1, 2, 3, 4, 5),
name = c("Alice", "Bob", "Charlie", "David", "Eve"),
city = c("New York", "Los Angeles", "Chicago", "Houston", "Phoenix")
)
# Dataset 2: Orders
orders <- tibble(
order_id = c(101, 102, 103, 104, 105, 106),
customer_id = c(1, 2, 3, 2, 6, 7),
product = c("Laptop", "Phone", "Tablet", "Desktop", "Camera", "Printer"),
amount = c(1200, 800, 300, 1500, 600, 150)
)
q1 <- customers %>%
inner_join(orders, by = "customer_id")
print(q1)
## # A tibble: 4 × 6
## customer_id name city order_id product amount
## <dbl> <chr> <chr> <dbl> <chr> <dbl>
## 1 1 Alice New York 101 Laptop 1200
## 2 2 Bob Los Angeles 102 Phone 800
## 3 2 Bob Los Angeles 104 Desktop 1500
## 4 3 Charlie Chicago 103 Tablet 300
nrow(q1)
## [1] 4
A. There are four rows in the table
B. Customer id number 5 (Eve) and id number 4 (David) is not represented in the orders data set (did not buy anything) so it is removed
q2 <- customers %>%
left_join(orders, by = "customer_id")
print(q2)
## # A tibble: 6 × 6
## customer_id name city order_id product amount
## <dbl> <chr> <chr> <dbl> <chr> <dbl>
## 1 1 Alice New York 101 Laptop 1200
## 2 2 Bob Los Angeles 102 Phone 800
## 3 2 Bob Los Angeles 104 Desktop 1500
## 4 3 Charlie Chicago 103 Tablet 300
## 5 4 David Houston NA <NA> NA
## 6 5 Eve Phoenix NA <NA> NA
nrow(q2)
## [1] 6
A. Six rows in the table
B. The number of rows is larger than the inner join because all customers are included, regardless of whether they have matching orders
q3 <- customers %>%
right_join(orders, by = "customer_id")
print(q3)
## # A tibble: 6 × 6
## customer_id name city order_id product amount
## <dbl> <chr> <chr> <dbl> <chr> <dbl>
## 1 1 Alice New York 101 Laptop 1200
## 2 2 Bob Los Angeles 102 Phone 800
## 3 2 Bob Los Angeles 104 Desktop 1500
## 4 3 Charlie Chicago 103 Tablet 300
## 5 6 <NA> <NA> 105 Camera 600
## 6 7 <NA> <NA> 106 Printer 150
nrow(q3)
## [1] 6
A. Six rows in the table
B. Customer ids six and seven are null in the table, this is because their numbers are not within the orders data set
q4 <- customers %>%
full_join(orders, by = "customer_id")
print(q4)
## # A tibble: 8 × 6
## customer_id name city order_id product amount
## <dbl> <chr> <chr> <dbl> <chr> <dbl>
## 1 1 Alice New York 101 Laptop 1200
## 2 2 Bob Los Angeles 102 Phone 800
## 3 2 Bob Los Angeles 104 Desktop 1500
## 4 3 Charlie Chicago 103 Tablet 300
## 5 4 David Houston NA <NA> NA
## 6 5 Eve Phoenix NA <NA> NA
## 7 6 <NA> <NA> 105 Camera 600
## 8 7 <NA> <NA> 106 Printer 150
nrow(q4)
## [1] 8
A. Eight rows in the table
B. If a row has NULL in the orders column, that means the customer has not placed any orders. If a row has NULL in the customers column, that means there is an order that does not correspond to any existing customer
q5 <- customers %>%
semi_join(customers, orders, by = "customer_id")
print(q5)
## # A tibble: 5 × 3
## customer_id name city
## <dbl> <chr> <chr>
## 1 1 Alice New York
## 2 2 Bob Los Angeles
## 3 3 Charlie Chicago
## 4 4 David Houston
## 5 5 Eve Phoenix
nrow(q5)
## [1] 5
A. Five rows in the table
B. The inner join would return all matching rows between the two tables and includes columns from both tables. A semi join only returns rows from the left table that have matches in the right table, without including columns from the right table.
q6 <- customers %>%
anti_join(orders, by = "customer_id")
print(q6)
## # A tibble: 2 × 3
## customer_id name city
## <dbl> <chr> <chr>
## 1 4 David Houston
## 2 5 Eve Phoenix
nrow(q6)
## [1] 2
A. Two rows in the table
B. The anti join would result in returning the customer ids for those who do not have an order (ID)
q7a <- customers %>%
full_join(orders, by = "customer_id")
print(q7a)
## # A tibble: 8 × 6
## customer_id name city order_id product amount
## <dbl> <chr> <chr> <dbl> <chr> <dbl>
## 1 1 Alice New York 101 Laptop 1200
## 2 2 Bob Los Angeles 102 Phone 800
## 3 2 Bob Los Angeles 104 Desktop 1500
## 4 3 Charlie Chicago 103 Tablet 300
## 5 4 David Houston NA <NA> NA
## 6 5 Eve Phoenix NA <NA> NA
## 7 6 <NA> <NA> 105 Camera 600
## 8 7 <NA> <NA> 106 Printer 150
q7b <- customers %>%
right_join(orders, by = "customer_id")
print(q7b)
## # A tibble: 6 × 6
## customer_id name city order_id product amount
## <dbl> <chr> <chr> <dbl> <chr> <dbl>
## 1 1 Alice New York 101 Laptop 1200
## 2 2 Bob Los Angeles 102 Phone 800
## 3 2 Bob Los Angeles 104 Desktop 1500
## 4 3 Charlie Chicago 103 Tablet 300
## 5 6 <NA> <NA> 105 Camera 600
## 6 7 <NA> <NA> 106 Printer 150
A. Using left Join gives a full list of customers, including
those with no orders (customers without orders will show NA
for order details)
B. Using Inner Join will only include customers who have placed at least one order, as it shows only the matches between the two tables.