Task 1 Inner Join:

q1 <- inner_join(customers , orders)
## Joining with `by = join_by(customer_id)`

How many rows are in the result?

The resulting data has 4 rows

Why are some customers or orders not included in the result?

Because the customers and orders that are not included did not have a match in the other table

head(q1)
## # A tibble: 4 × 6
##   customer_id name    city        order_id product amount
##         <dbl> <chr>   <chr>          <dbl> <chr>    <dbl>
## 1           1 Alice   New York         101 Laptop    1200
## 2           2 Bob     Los Angeles      102 Phone      800
## 3           2 Bob     Los Angeles      104 Desktop   1500
## 4           3 Charlie Chicago          103 Tablet     300

Task 2: Left Join

inner_result <- customers %>%
  inner_join(orders, by = "customer_id")

How many rows are in the result?

nrow(inner_result)
## [1] 4

Why are some customers or orders not included?

customers_without_orders <- customers %>% anti_join(orders, by = "customer_id")
orders_without_customers <- orders %>% anti_join(customers, by = "customer_id")

nrow(customers_without_orders)
## [1] 2
nrow(orders_without_customers)
## [1] 2

Display the result

inner_result
## # A tibble: 4 × 6
##   customer_id name    city        order_id product amount
##         <dbl> <chr>   <chr>          <dbl> <chr>    <dbl>
## 1           1 Alice   New York         101 Laptop    1200
## 2           2 Bob     Los Angeles      102 Phone      800
## 3           2 Bob     Los Angeles      104 Desktop   1500
## 4           3 Charlie Chicago          103 Tablet     300
names(customers)
## [1] "customer_id" "name"        "city"
names(orders)
## [1] "order_id"    "customer_id" "product"     "amount"

Task 3: Right Join

A right join keeps all rows from the right table (orders) and fills in NA for any columns from the left table (customers) where there is no match.

right_result <- right_join(customers, orders, by = "customer_id")
right_result
## # A tibble: 6 × 6
##   customer_id name    city        order_id product amount
##         <dbl> <chr>   <chr>          <dbl> <chr>    <dbl>
## 1           1 Alice   New York         101 Laptop    1200
## 2           2 Bob     Los Angeles      102 Phone      800
## 3           2 Bob     Los Angeles      104 Desktop   1500
## 4           3 Charlie Chicago          103 Tablet     300
## 5           6 <NA>    <NA>             105 Camera     600
## 6           7 <NA>    <NA>             106 Printer    150

3a. How many rows are in the result?

nrow(right_result)
## [1] 6

There are 6 rows in the right join result.

3b. Which customer_ids have NULL (NA) for name and city? Why?

right_result %>% filter(is.na(name))
## # A tibble: 2 × 6
##   customer_id name  city  order_id product amount
##         <dbl> <chr> <chr>    <dbl> <chr>    <dbl>
## 1           6 <NA>  <NA>       105 Camera     600
## 2           7 <NA>  <NA>       106 Printer    150

Customer_ids 6 and 7 have NA for name and city. This is because orders 105 and 106 were placed by customer_ids 6 and 7, who do not exist in the customers table. Since the right join preserves all rows from the orders (right) table, these rows appear but have missing customer information.


Task 4: Full Join

A full join keeps all rows from both tables, filling in NA wherever there is no match on either side.

full_result <- full_join(customers, orders, by = "customer_id")
full_result
## # A tibble: 8 × 6
##   customer_id name    city        order_id product amount
##         <dbl> <chr>   <chr>          <dbl> <chr>    <dbl>
## 1           1 Alice   New York         101 Laptop    1200
## 2           2 Bob     Los Angeles      102 Phone      800
## 3           2 Bob     Los Angeles      104 Desktop   1500
## 4           3 Charlie Chicago          103 Tablet     300
## 5           4 David   Houston           NA <NA>        NA
## 6           5 Eve     Phoenix           NA <NA>        NA
## 7           6 <NA>    <NA>             105 Camera     600
## 8           7 <NA>    <NA>             106 Printer    150

4a. How many rows are in the result?

nrow(full_result)
## [1] 8

There are 8 rows in the full join result.

4b. Rows with information from only one table:

# Rows from customers only (no matching order)
full_result %>% filter(is.na(order_id))
## # A tibble: 2 × 6
##   customer_id name  city    order_id product amount
##         <dbl> <chr> <chr>      <dbl> <chr>    <dbl>
## 1           4 David Houston       NA <NA>        NA
## 2           5 Eve   Phoenix       NA <NA>        NA
# Rows from orders only (no matching customer)
full_result %>% filter(is.na(name))
## # A tibble: 2 × 6
##   customer_id name  city  order_id product amount
##         <dbl> <chr> <chr>    <dbl> <chr>    <dbl>
## 1           6 <NA>  <NA>       105 Camera     600
## 2           7 <NA>  <NA>       106 Printer    150

Task 5: Semi Join

A semi join returns all rows from the left table (customers) that have a match in the right table (orders), but does not add any columns from the right table.

semi_result <- semi_join(customers, orders, by = "customer_id")
semi_result
## # A tibble: 3 × 3
##   customer_id name    city       
##         <dbl> <chr>   <chr>      
## 1           1 Alice   New York   
## 2           2 Bob     Los Angeles
## 3           3 Charlie Chicago

5a. How many rows are in the result?

nrow(semi_result)
## [1] 3

There are 3 rows in the semi join result.

5b. How does this differ from the inner join?

The inner join returned 4 rows with columns from both tables, and Bob appeared twice (because he had 2 orders). The semi join returns 3 rows — one per unique customer who has at least one order — and includes only the customer columns. It de-duplicates: even though Bob has 2 orders, he only appears once. It’s a filtering join, not a merging join.


Task 6: Anti Join

An anti join returns all rows from the left table (customers) that have no match in the right table (orders).

anti_result <- anti_join(customers, orders, by = "customer_id")
anti_result
## # A tibble: 2 × 3
##   customer_id name  city   
##         <dbl> <chr> <chr>  
## 1           4 David Houston
## 2           5 Eve   Phoenix

6a. Which customers are in the result?

The result contains David (customer_id = 4) and Eve (customer_id = 5).

6b. What does this tell us about these customers?

David and Eve have never placed any orders. Their customer_ids (4 and 5) do not appear anywhere in the orders table. This anti join is a useful tool for identifying inactive customers — for example, a business could use this result to target David and Eve with marketing campaigns to encourage their first purchase.


Task 7: Practical Application

7a. Which join to find ALL customers, including those without orders?

Use a left join with customers as the left table. This preserves every customer record regardless of whether they’ve placed an order. Customers without orders will show NA in the order columns, making it easy to identify them.

7b. Which join to find ONLY customers who have placed orders?

Use a semi join (or inner join) with customers as the left table. The semi join returns only customers with at least one matching order, without duplicating customer rows for multiple orders. Use an inner join if you also want the order details.

7c & 7d. R code and results for both scenarios:

# Scenario 1: All customers, including those without orders (Left Join)
all_customers <- left_join(customers, orders, by = "customer_id")
cat("Scenario 1 - All customers (left join):\n")
## Scenario 1 - All customers (left join):
all_customers
## # A tibble: 6 × 6
##   customer_id name    city        order_id product amount
##         <dbl> <chr>   <chr>          <dbl> <chr>    <dbl>
## 1           1 Alice   New York         101 Laptop    1200
## 2           2 Bob     Los Angeles      102 Phone      800
## 3           2 Bob     Los Angeles      104 Desktop   1500
## 4           3 Charlie Chicago          103 Tablet     300
## 5           4 David   Houston           NA <NA>        NA
## 6           5 Eve     Phoenix           NA <NA>        NA
# Scenario 2: Only customers who have placed orders (Semi Join)
active_customers <- semi_join(customers, orders, by = "customer_id")
cat("Scenario 2 - Customers with orders only (semi join):\n")
## Scenario 2 - Customers with orders only (semi join):
active_customers
## # A tibble: 3 × 3
##   customer_id name    city       
##         <dbl> <chr>   <chr>      
## 1           1 Alice   New York   
## 2           2 Bob     Los Angeles
## 3           3 Charlie Chicago

Task 8: Challenge Question

Create a summary showing each customer’s name, city, total number of orders, and total amount spent — including customers with no orders.

customer_summary <- customers %>%
  left_join(orders, by = "customer_id") %>%
  group_by(customer_id, name, city) %>%
  summarize(
    total_orders = sum(!is.na(order_id)),
    total_amount_spent = sum(amount, na.rm = TRUE),
    .groups = "drop"
  ) %>%
  arrange(customer_id)

customer_summary
## # A tibble: 5 × 5
##   customer_id name    city        total_orders total_amount_spent
##         <dbl> <chr>   <chr>              <int>              <dbl>
## 1           1 Alice   New York               1               1200
## 2           2 Bob     Los Angeles            2               2300
## 3           3 Charlie Chicago                1                300
## 4           4 David   Houston                0                  0
## 5           5 Eve     Phoenix                0                  0