q1 <- inner_join(customers , orders)
## Joining with `by = join_by(customer_id)`
The resulting data has 4 rows
Because the customers and orders that are not included did not have a match in the other table
head(q1)
## # A tibble: 4 × 6
## customer_id name city order_id product amount
## <dbl> <chr> <chr> <dbl> <chr> <dbl>
## 1 1 Alice New York 101 Laptop 1200
## 2 2 Bob Los Angeles 102 Phone 800
## 3 2 Bob Los Angeles 104 Desktop 1500
## 4 3 Charlie Chicago 103 Tablet 300
inner_result <- customers %>%
inner_join(orders, by = "customer_id")
nrow(inner_result)
## [1] 4
customers_without_orders <- customers %>% anti_join(orders, by = "customer_id")
orders_without_customers <- orders %>% anti_join(customers, by = "customer_id")
nrow(customers_without_orders)
## [1] 2
nrow(orders_without_customers)
## [1] 2
inner_result
## # A tibble: 4 × 6
## customer_id name city order_id product amount
## <dbl> <chr> <chr> <dbl> <chr> <dbl>
## 1 1 Alice New York 101 Laptop 1200
## 2 2 Bob Los Angeles 102 Phone 800
## 3 2 Bob Los Angeles 104 Desktop 1500
## 4 3 Charlie Chicago 103 Tablet 300
names(customers)
## [1] "customer_id" "name" "city"
names(orders)
## [1] "order_id" "customer_id" "product" "amount"
A right join keeps all rows from the right
table (orders) and fills in NA for any columns
from the left table (customers) where there is no match.
right_result <- right_join(customers, orders, by = "customer_id")
right_result
## # A tibble: 6 × 6
## customer_id name city order_id product amount
## <dbl> <chr> <chr> <dbl> <chr> <dbl>
## 1 1 Alice New York 101 Laptop 1200
## 2 2 Bob Los Angeles 102 Phone 800
## 3 2 Bob Los Angeles 104 Desktop 1500
## 4 3 Charlie Chicago 103 Tablet 300
## 5 6 <NA> <NA> 105 Camera 600
## 6 7 <NA> <NA> 106 Printer 150
3a. How many rows are in the result?
nrow(right_result)
## [1] 6
There are 6 rows in the right join result.
3b. Which customer_ids have NULL (NA) for name and city? Why?
right_result %>% filter(is.na(name))
## # A tibble: 2 × 6
## customer_id name city order_id product amount
## <dbl> <chr> <chr> <dbl> <chr> <dbl>
## 1 6 <NA> <NA> 105 Camera 600
## 2 7 <NA> <NA> 106 Printer 150
Customer_ids 6 and 7 have NA for
name and city. This is because orders 105 and
106 were placed by customer_ids 6 and 7, who do not
exist in the customers table. Since the right join preserves
all rows from the orders (right) table, these rows appear but have
missing customer information.
A full join keeps all rows from both
tables, filling in NA wherever there is no match
on either side.
full_result <- full_join(customers, orders, by = "customer_id")
full_result
## # A tibble: 8 × 6
## customer_id name city order_id product amount
## <dbl> <chr> <chr> <dbl> <chr> <dbl>
## 1 1 Alice New York 101 Laptop 1200
## 2 2 Bob Los Angeles 102 Phone 800
## 3 2 Bob Los Angeles 104 Desktop 1500
## 4 3 Charlie Chicago 103 Tablet 300
## 5 4 David Houston NA <NA> NA
## 6 5 Eve Phoenix NA <NA> NA
## 7 6 <NA> <NA> 105 Camera 600
## 8 7 <NA> <NA> 106 Printer 150
4a. How many rows are in the result?
nrow(full_result)
## [1] 8
There are 8 rows in the full join result.
4b. Rows with information from only one table:
# Rows from customers only (no matching order)
full_result %>% filter(is.na(order_id))
## # A tibble: 2 × 6
## customer_id name city order_id product amount
## <dbl> <chr> <chr> <dbl> <chr> <dbl>
## 1 4 David Houston NA <NA> NA
## 2 5 Eve Phoenix NA <NA> NA
# Rows from orders only (no matching customer)
full_result %>% filter(is.na(name))
## # A tibble: 2 × 6
## customer_id name city order_id product amount
## <dbl> <chr> <chr> <dbl> <chr> <dbl>
## 1 6 <NA> <NA> 105 Camera 600
## 2 7 <NA> <NA> 106 Printer 150
NA for all order columns — they are in the customers table
but have placed no orders.NA for name and city — they
exist in the orders table but have no corresponding customer
record.A semi join returns all rows from the left table (customers) that have a match in the right table (orders), but does not add any columns from the right table.
semi_result <- semi_join(customers, orders, by = "customer_id")
semi_result
## # A tibble: 3 × 3
## customer_id name city
## <dbl> <chr> <chr>
## 1 1 Alice New York
## 2 2 Bob Los Angeles
## 3 3 Charlie Chicago
5a. How many rows are in the result?
nrow(semi_result)
## [1] 3
There are 3 rows in the semi join result.
5b. How does this differ from the inner join?
The inner join returned 4 rows with columns from both tables, and Bob appeared twice (because he had 2 orders). The semi join returns 3 rows — one per unique customer who has at least one order — and includes only the customer columns. It de-duplicates: even though Bob has 2 orders, he only appears once. It’s a filtering join, not a merging join.
An anti join returns all rows from the left table (customers) that have no match in the right table (orders).
anti_result <- anti_join(customers, orders, by = "customer_id")
anti_result
## # A tibble: 2 × 3
## customer_id name city
## <dbl> <chr> <chr>
## 1 4 David Houston
## 2 5 Eve Phoenix
6a. Which customers are in the result?
The result contains David (customer_id = 4) and Eve (customer_id = 5).
6b. What does this tell us about these customers?
David and Eve have never placed any orders. Their customer_ids (4 and 5) do not appear anywhere in the orders table. This anti join is a useful tool for identifying inactive customers — for example, a business could use this result to target David and Eve with marketing campaigns to encourage their first purchase.
7a. Which join to find ALL customers, including those without orders?
Use a left join with customers as the
left table. This preserves every customer record regardless of whether
they’ve placed an order. Customers without orders will show
NA in the order columns, making it easy to identify
them.
7b. Which join to find ONLY customers who have placed orders?
Use a semi join (or inner join) with
customers as the left table. The semi join returns only
customers with at least one matching order, without duplicating customer
rows for multiple orders. Use an inner join if you also want the order
details.
7c & 7d. R code and results for both scenarios:
# Scenario 1: All customers, including those without orders (Left Join)
all_customers <- left_join(customers, orders, by = "customer_id")
cat("Scenario 1 - All customers (left join):\n")
## Scenario 1 - All customers (left join):
all_customers
## # A tibble: 6 × 6
## customer_id name city order_id product amount
## <dbl> <chr> <chr> <dbl> <chr> <dbl>
## 1 1 Alice New York 101 Laptop 1200
## 2 2 Bob Los Angeles 102 Phone 800
## 3 2 Bob Los Angeles 104 Desktop 1500
## 4 3 Charlie Chicago 103 Tablet 300
## 5 4 David Houston NA <NA> NA
## 6 5 Eve Phoenix NA <NA> NA
# Scenario 2: Only customers who have placed orders (Semi Join)
active_customers <- semi_join(customers, orders, by = "customer_id")
cat("Scenario 2 - Customers with orders only (semi join):\n")
## Scenario 2 - Customers with orders only (semi join):
active_customers
## # A tibble: 3 × 3
## customer_id name city
## <dbl> <chr> <chr>
## 1 1 Alice New York
## 2 2 Bob Los Angeles
## 3 3 Charlie Chicago
Create a summary showing each customer’s name, city, total number of orders, and total amount spent — including customers with no orders.
customer_summary <- customers %>%
left_join(orders, by = "customer_id") %>%
group_by(customer_id, name, city) %>%
summarize(
total_orders = sum(!is.na(order_id)),
total_amount_spent = sum(amount, na.rm = TRUE),
.groups = "drop"
) %>%
arrange(customer_id)
customer_summary
## # A tibble: 5 × 5
## customer_id name city total_orders total_amount_spent
## <dbl> <chr> <chr> <int> <dbl>
## 1 1 Alice New York 1 1200
## 2 2 Bob Los Angeles 2 2300
## 3 3 Charlie Chicago 1 300
## 4 4 David Houston 0 0
## 5 5 Eve Phoenix 0 0