library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
# Dataset 1: Customers
customers <- tibble(
customer_id = c(1, 2, 3, 4, 5),
name = c("Alice", "Bob", "Charlie", "David", "Eve"),
city = c("New York", "Los Angeles", "Chicago", "Houston", "Phoenix")
)
# Dataset 2: Orders
orders <- tibble(
order_id = c(101, 102, 103, 104, 105, 106),
customer_id = c(1, 2, 3, 2, 6, 7),
product = c("Laptop", "Phone", "Tablet", "Desktop", "Camera", "Printer"),
amount = c(1200, 800, 300, 1500, 600, 150)
)
Practical Application (4 points) Imagine you’re analyzing customer
behavior.
Which join would you use to find all customers, including those who
haven’t placed any orders? Why?
You should use the left join to find all customers, because left
join includes all customers even those who have not placed orders.
Which join would you use to find only the customers who have placed
orders? Why?
You should use the inner join to find only the customers who have
placed orders because that table will only return customers with
matching customer_id in both tables.
Write the R code for both scenarios
## All customers (even without orders)
all_customers <- left_join(customers, orders )
## Joining with `by = join_by(customer_id)`
## Only customers who have placed orders
customers_with_orders <- inner_join(customers, orders)
## Joining with `by = join_by(customer_id)`
Display the result
head(all_customers)
## # A tibble: 6 × 6
## customer_id name city order_id product amount
## <dbl> <chr> <chr> <dbl> <chr> <dbl>
## 1 1 Alice New York 101 Laptop 1200
## 2 2 Bob Los Angeles 102 Phone 800
## 3 2 Bob Los Angeles 104 Desktop 1500
## 4 3 Charlie Chicago 103 Tablet 300
## 5 4 David Houston NA <NA> NA
## 6 5 Eve Phoenix NA <NA> NA
head(customers_with_orders)
## # A tibble: 4 × 6
## customer_id name city order_id product amount
## <dbl> <chr> <chr> <dbl> <chr> <dbl>
## 1 1 Alice New York 101 Laptop 1200
## 2 2 Bob Los Angeles 102 Phone 800
## 3 2 Bob Los Angeles 104 Desktop 1500
## 4 3 Charlie Chicago 103 Tablet 300
Challenge Question (3 points) Create a summary that shows each
customer’s name, city, total number of orders, and total amount spent.
Include all customers, even those without orders. Hint: You’ll need to
use a combination of joins and group_by/summarize operations.
customer_summary <- customers %>%
left_join(orders) %>%
group_by(name, city) %>%
summarize(
total_orders = n_distinct(order_id, na.rm = TRUE),
total_spent = sum(amount, na.rm = TRUE),
.groups = "drop"
)
## Joining with `by = join_by(customer_id)`
customer_summary
## # A tibble: 5 × 4
## name city total_orders total_spent
## <chr> <chr> <int> <dbl>
## 1 Alice New York 1 1200
## 2 Bob Los Angeles 2 2300
## 3 Charlie Chicago 1 300
## 4 David Houston 0 0
## 5 Eve Phoenix 0 0