# Dataset 1: Customers
customers <- tibble(
customer_id = c(1, 2, 3, 4, 5),
name = c("Alice", "Bob", "Charlie", "David", "Eve"),
city = c("New York", "Los Angeles", "Chicago", "Houston", "Phoenix")
)
# Dataset 2: Orders
orders <- tibble(
order_id = c(101, 102, 103, 104, 105, 106),
customer_id = c(1, 2, 3, 2, 6, 7),
product = c("Laptop", "Phone", "Tablet", "Desktop", "Camera", "Printer"),
amount = c(1200, 800, 300, 1500, 600, 150)
)
print(customers)
## # A tibble: 5 × 3
## customer_id name city
## <dbl> <chr> <chr>
## 1 1 Alice New York
## 2 2 Bob Los Angeles
## 3 3 Charlie Chicago
## 4 4 David Houston
## 5 5 Eve Phoenix
print(orders)
## # A tibble: 6 × 4
## order_id customer_id product amount
## <dbl> <dbl> <chr> <dbl>
## 1 101 1 Laptop 1200
## 2 102 2 Phone 800
## 3 103 3 Tablet 300
## 4 104 2 Desktop 1500
## 5 105 6 Camera 600
## 6 106 7 Printer 150
innerjoin <- inner_join(customers,
orders,
by = "customer_id")
There are 4 rows.
Some customers are not included in the result because they didn’t make any orders; there are no matches in the other table
innerjoin
## # A tibble: 4 × 6
## customer_id name city order_id product amount
## <dbl> <chr> <chr> <dbl> <chr> <dbl>
## 1 1 Alice New York 101 Laptop 1200
## 2 2 Bob Los Angeles 102 Phone 800
## 3 2 Bob Los Angeles 104 Desktop 1500
## 4 3 Charlie Chicago 103 Tablet 300
leftjoin <- left_join(customers,
orders,
by = "customer_id")
There are 6 rows.
The two extra rows are from the customers table. They are included in the result even though they don’t appear in the orders table because they are on the left table (customers).
leftjoin
## # A tibble: 6 × 6
## customer_id name city order_id product amount
## <dbl> <chr> <chr> <dbl> <chr> <dbl>
## 1 1 Alice New York 101 Laptop 1200
## 2 2 Bob Los Angeles 102 Phone 800
## 3 2 Bob Los Angeles 104 Desktop 1500
## 4 3 Charlie Chicago 103 Tablet 300
## 5 4 David Houston NA <NA> NA
## 6 5 Eve Phoenix NA <NA> NA
rightjoin <- right_join(customers,
orders,
by = "customer_id")
There are 6 rows.
customer_ids 6 and 7 have null for customer name and city because there is no match on the left table (customers). They only exist in the orders table.
rightjoin
## # A tibble: 6 × 6
## customer_id name city order_id product amount
## <dbl> <chr> <chr> <dbl> <chr> <dbl>
## 1 1 Alice New York 101 Laptop 1200
## 2 2 Bob Los Angeles 102 Phone 800
## 3 2 Bob Los Angeles 104 Desktop 1500
## 4 3 Charlie Chicago 103 Tablet 300
## 5 6 <NA> <NA> 105 Camera 600
## 6 7 <NA> <NA> 106 Printer 150
fulljoin <- full_join(customers,
orders,
by = "customer_id")
There are 8 rows.
The order_id, product, and amount columns for customer_id 4 and 5 are null because they only exist in the customers table; there is no match in the orders table. The name and city for customer_id 6 and 7 are null because they only exist in the orders table but not the customers table; there is no match in the customers table.
fulljoin
## # A tibble: 8 × 6
## customer_id name city order_id product amount
## <dbl> <chr> <chr> <dbl> <chr> <dbl>
## 1 1 Alice New York 101 Laptop 1200
## 2 2 Bob Los Angeles 102 Phone 800
## 3 2 Bob Los Angeles 104 Desktop 1500
## 4 3 Charlie Chicago 103 Tablet 300
## 5 4 David Houston NA <NA> NA
## 6 5 Eve Phoenix NA <NA> NA
## 7 6 <NA> <NA> 105 Camera 600
## 8 7 <NA> <NA> 106 Printer 150
semijoin <- semi_join(customers,
orders,
by = "customer_id")
There are 3 rows.
There are only three rows, customer_id 2 only shows up once, and the information on the orders table are not shown. It only returned rows from the left table where there is a match in the right table. This only shows the names and cities of the people who made orders and are a match in the orders table.
semijoin
## # A tibble: 3 × 3
## customer_id name city
## <dbl> <chr> <chr>
## 1 1 Alice New York
## 2 2 Bob Los Angeles
## 3 3 Charlie Chicago
antijoin <- anti_join(customers,
orders,
by = 'customer_id')
David and Eve
These customers’ customer_ids don’t show up in the orders table - they didn’t make any orders.
antijoin
## # A tibble: 2 × 3
## customer_id name city
## <dbl> <chr> <chr>
## 1 4 David Houston
## 2 5 Eve Phoenix
I would use full_join because it shows how many people made orders based on the customer_ids made. In total, there are 7 unique customer_ids so there must be 7 unique customers. It also shows David and Eve’s names even if they didn’t make any orders.
I would use right_join because it shows the number and the names of customers who placed orders. Although customer_ids 6 and 7 don’t have a name or city, it shows how many orders were made. We can also use their customer_id to identify them.
fulljoin <- full_join(customers,
orders,
by = "customer_id")
rightjoin <- right_join(customers,
orders,
by = "customer_id")
fulljoin
## # A tibble: 8 × 6
## customer_id name city order_id product amount
## <dbl> <chr> <chr> <dbl> <chr> <dbl>
## 1 1 Alice New York 101 Laptop 1200
## 2 2 Bob Los Angeles 102 Phone 800
## 3 2 Bob Los Angeles 104 Desktop 1500
## 4 3 Charlie Chicago 103 Tablet 300
## 5 4 David Houston NA <NA> NA
## 6 5 Eve Phoenix NA <NA> NA
## 7 6 <NA> <NA> 105 Camera 600
## 8 7 <NA> <NA> 106 Printer 150
rightjoin
## # A tibble: 6 × 6
## customer_id name city order_id product amount
## <dbl> <chr> <chr> <dbl> <chr> <dbl>
## 1 1 Alice New York 101 Laptop 1200
## 2 2 Bob Los Angeles 102 Phone 800
## 3 2 Bob Los Angeles 104 Desktop 1500
## 4 3 Charlie Chicago 103 Tablet 300
## 5 6 <NA> <NA> 105 Camera 600
## 6 7 <NA> <NA> 106 Printer 150
# Create a full list of customer IDs from both datasets
all_customers <- customers %>%
select(customer_id, name, city) %>%
full_join(orders %>% select(customer_id), by = "customer_id") %>%
distinct()
# Join with orders to get order details
customer_summary <- all_customers %>%
left_join(orders, by = "customer_id") %>%
group_by(customer_id, name, city) %>%
summarize(
total_orders = n(),
total_amount_spent = sum(amount, na.rm = TRUE),
.groups = "drop"
) %>%
# Replace NA names and cities for unknown customers
mutate(
name = ifelse(is.na(name), paste0("Unknown_", customer_id), name),
city = ifelse(is.na(city), "Unknown City", city)
)
# Print the summary
print(customer_summary)
## # A tibble: 7 × 5
## customer_id name city total_orders total_amount_spent
## <dbl> <chr> <chr> <int> <dbl>
## 1 1 Alice New York 1 1200
## 2 2 Bob Los Angeles 2 2300
## 3 3 Charlie Chicago 1 300
## 4 4 David Houston 1 0
## 5 5 Eve Phoenix 1 0
## 6 6 Unknown_6 Unknown City 1 600
## 7 7 Unknown_7 Unknown City 1 150