R Markdown

Task 1: Inner Join

q1 <- inner_join(customers, orders, by = "customer_id")

How many rows are in the result?

There are 4 rows.

Why are some customers or orders not included in the result?

Inner join returns all rows from both tables where there is a match. Therefore the rows not returned did not have a match in the other table.

Display the result

kable(q1)
customer_id name city order_id product amount
1 Alice New York 101 Laptop 1200
2 Bob Los Angeles 102 Phone 800
2 Bob Los Angeles 104 Desktop 1500
3 Charlie Chicago 103 Tablet 300

Task 2: Left Join

q2 <- left_join(customers, orders, by = "customer_id")

How many rows are in the result?

There are 6 rows.

Explain why this number differs from the inner join result.

Left join keeps ALL customers (1..5). Those without orders (4, 5) appear once with NA order fields. Customers with multiple orders (customer_id = 2) appear multiple times (once per matching order).

Display Results

kable(q2)
customer_id name city order_id product amount
1 Alice New York 101 Laptop 1200
2 Bob Los Angeles 102 Phone 800
2 Bob Los Angeles 104 Desktop 1500
3 Charlie Chicago 103 Tablet 300
4 David Houston NA NA NA
5 Eve Phoenix NA NA NA

Task 3: Right Join

q3 <- right_join(customers, orders, by = "customer_id")

How many rows are in the result?

There are 6 rows.

Which customer_ids in the result have NULL for customer name and city? Explain why.

q3 %>% filter(is.na(name) | is.na(city)) %>% select(customer_id, order_id, product, amount)
## # A tibble: 2 × 4
##   customer_id order_id product amount
##         <dbl>    <dbl> <chr>    <dbl>
## 1           6      105 Camera     600
## 2           7      106 Printer    150

A right join keeps all rows from the right table, filling in missing values for non-matching rows from the left. Since customers 6 and 7 do not exist in customers, their name and city show up as NULL.

Display Results

kable(q3)
customer_id name city order_id product amount
1 Alice New York 101 Laptop 1200
2 Bob Los Angeles 102 Phone 800
2 Bob Los Angeles 104 Desktop 1500
3 Charlie Chicago 103 Tablet 300
6 NA NA 105 Camera 600
7 NA NA 106 Printer 150

Task 4: Full Join

q4 <- full_join(customers, orders, by = "customer_id")

How many rows are in the result?

There are 8 rows.

Identify any rows where there’s information from only one table. Explain these results.

only_one <- q4 %>% 
  filter(is.na(order_id) | is.na(name)) %>% 
  select(customer_id, name, city, order_id, product, amount)
kable(only_one)
customer_id name city order_id product amount
4 David Houston NA NA NA
5 Eve Phoenix NA NA NA
6 NA NA 105 Camera 600
7 NA NA 106 Printer 150

Full join combines all rows from both tables. If a row doesn’t have a match in the other table, the unmatched columns are filled with NA.

Display Results

kable(q4)
customer_id name city order_id product amount
1 Alice New York 101 Laptop 1200
2 Bob Los Angeles 102 Phone 800
2 Bob Los Angeles 104 Desktop 1500
3 Charlie Chicago 103 Tablet 300
4 David Houston NA NA NA
5 Eve Phoenix NA NA NA
6 NA NA 105 Camera 600
7 NA NA 106 Printer 150

Task 5: Semi Join

q5 <- semi_join(customers, orders, by = "customer_id")

How many rows are in the result?

There are 3 rows.

How does this result differ from the inner join result?

This result differs from the inner join because the semi join only returns the customers who placed orders, without repeating them for each order.

Display Results

kable(q5)
customer_id name city
1 Alice New York
2 Bob Los Angeles
3 Charlie Chicago

Task 6: Anti Join

q6 <- anti_join(customers, orders, by = "customer_id")

Which customers are in the result?

David and Eve.

Explain what this result tells you about these customers.

This result tells us which customers have never placed any orders, since they exist in the customers table but not in the orders table.

Display Results

kable(q6)
customer_id name city
4 David Houston
5 Eve Phoenix

Task 7: Practical Application

Which join to find all customers, including those without orders? Why?

Use a left join, because it keeps every row from customers and fills in missing order info with NA for customers who haven’t ordered.

all_customers <- left_join(customers, orders, by = "customer_id")
kable(all_customers)
customer_id name city order_id product amount
1 Alice New York 101 Laptop 1200
2 Bob Los Angeles 102 Phone 800
2 Bob Los Angeles 104 Desktop 1500
3 Charlie Chicago 103 Tablet 300
4 David Houston NA NA NA
5 Eve Phoenix NA NA NA

Which join to find only customers who have placed orders? Why?

Use a semi join because it returns only customers rows that have a match in orders without duplicating customers per order.

customers_with_orders <- semi_join(customers, orders, by = "customer_id")
kable(customers_with_orders)
customer_id name city
1 Alice New York
2 Bob Los Angeles
3 Charlie Chicago

Task 8: Challenge Question

challenge <- customers %>%
  left_join(orders, by = "customer_id") %>%
  group_by(customer_id, name, city) %>%
  summarise(
    total_orders = n_distinct(order_id, na.rm = TRUE),
    total_spent = sum(amount, na.rm = TRUE),
    .groups = "drop"
  )

kable(challenge)
customer_id name city total_orders total_spent
1 Alice New York 1 1200
2 Bob Los Angeles 2 2300
3 Charlie Chicago 1 300
4 David Houston 0 0
5 Eve Phoenix 0 0