1. Inner Join (3 points) Perform an inner join between the customers and orders datasets.

q1 <- inner_join(customers , orders , by = 'customer_id')

How many rows are in the result?

There are 4 rows

Why are some customers or orders not included in the result?

They did not have a match in the other table

Display the result

head(q1)
## # A tibble: 4 × 6
##   customer_id name    city        order_id product amount
##         <dbl> <chr>   <chr>          <dbl> <chr>    <dbl>
## 1           1 Alice   New York         101 Laptop    1200
## 2           2 Bob     Los Angeles      102 Phone      800
## 3           2 Bob     Los Angeles      104 Desktop   1500
## 4           3 Charlie Chicago          103 Tablet     300

2. Left Join (3 points) Perform a left join with customers as the left table and orders as the right table.

q2 <- left_join(customers, orders, by = "customer_id")

How many rows are in the result?

There are 6 rows

Explain why this number differs from the inner join result.

Because the left join table includes all customers regardless of the order_id table. There is less in q1 because some people are NA for the order_id table.

Display the result

head(q2)
## # A tibble: 6 × 6
##   customer_id name    city        order_id product amount
##         <dbl> <chr>   <chr>          <dbl> <chr>    <dbl>
## 1           1 Alice   New York         101 Laptop    1200
## 2           2 Bob     Los Angeles      102 Phone      800
## 3           2 Bob     Los Angeles      104 Desktop   1500
## 4           3 Charlie Chicago          103 Tablet     300
## 5           4 David   Houston           NA <NA>        NA
## 6           5 Eve     Phoenix           NA <NA>        NA

3. Right Join (3 points) Perform a right join with customers as the left table and orders as the right table.

q3 <- right_join(customers, orders, by = "customer_id")

How many rows are in the result?

There are 6 rows

Which customer_ids in the result have NULL for customer name and city? Explain why.

The customer ids 6 and 7 have NULL for customer name and city because these id numbers do not exist in the customer table.

Display the result

head(q3)
## # A tibble: 6 × 6
##   customer_id name    city        order_id product amount
##         <dbl> <chr>   <chr>          <dbl> <chr>    <dbl>
## 1           1 Alice   New York         101 Laptop    1200
## 2           2 Bob     Los Angeles      102 Phone      800
## 3           2 Bob     Los Angeles      104 Desktop   1500
## 4           3 Charlie Chicago          103 Tablet     300
## 5           6 <NA>    <NA>             105 Camera     600
## 6           7 <NA>    <NA>             106 Printer    150

4. Full Join (3 points) Perform a full join between customers and orders.

q4 <- full_join(customers, orders, by = "customer_id")

How many rows are in the result?

There are 8 rows

Identify any rows where there’s information from only one table. Explain these results.

Customers 5, 6, 7, 8 all have parts of their row with NA values. This is because they have information from only one of the tables. Thus is the case because the full join makes sure no data is left out, so it uses NA to fill in some of the missing slots for incomplete rows.

Display the result

head(q4)
## # A tibble: 6 × 6
##   customer_id name    city        order_id product amount
##         <dbl> <chr>   <chr>          <dbl> <chr>    <dbl>
## 1           1 Alice   New York         101 Laptop    1200
## 2           2 Bob     Los Angeles      102 Phone      800
## 3           2 Bob     Los Angeles      104 Desktop   1500
## 4           3 Charlie Chicago          103 Tablet     300
## 5           4 David   Houston           NA <NA>        NA
## 6           5 Eve     Phoenix           NA <NA>        NA

5. Semi Join (3 points) Perform a semi join with customers as the left table and orders as the right table.

q5 <- semi_join(customers, orders, by = "customer_id")

How many rows are in the result?

There are 3 rows

How does this result differ from the inner join result?

Semi join is only displaying results from customers that have a match in orders. It is different than inner join because inner join shows all orders while semi join just shows the customers.

Display the result

head(q5)
## # A tibble: 3 × 3
##   customer_id name    city       
##         <dbl> <chr>   <chr>      
## 1           1 Alice   New York   
## 2           2 Bob     Los Angeles
## 3           3 Charlie Chicago

6. Anti Join (3 points) Perform an anti join with customers as the left table and orders as the right table.

q6 <- anti_join(customers, orders, by = "customer_id")

Which customers are in the result?

David and Eve. Customer 4 and 5.

Explain what this result tells you about these customers.

These customers are the ones who have never placed an order.

Display the result

head(q6)
## # A tibble: 2 × 3
##   customer_id name  city   
##         <dbl> <chr> <chr>  
## 1           4 David Houston
## 2           5 Eve   Phoenix

7. Practical Application (4 points) Imagine you’re analyzing customer behavior.

Which join would you use to find all customers, including those who haven’t placed any orders? Why?

Left Join would be used to find all customers. It shows all customers who have placed orders and those who have not.

Which join would you use to find only the customers who have placed orders? Why?

Semi Join because it shows just the customers who placed orders. It does not show what they ordered or the cost.

Write the R code for both scenarios.

all_customers <- left_join(customers, orders, by = "customer_id")

active_customers <- semi_join(customers, orders, by = "customer_id")

Display the result

head(all_customers)
## # A tibble: 6 × 6
##   customer_id name    city        order_id product amount
##         <dbl> <chr>   <chr>          <dbl> <chr>    <dbl>
## 1           1 Alice   New York         101 Laptop    1200
## 2           2 Bob     Los Angeles      102 Phone      800
## 3           2 Bob     Los Angeles      104 Desktop   1500
## 4           3 Charlie Chicago          103 Tablet     300
## 5           4 David   Houston           NA <NA>        NA
## 6           5 Eve     Phoenix           NA <NA>        NA
head(active_customers)
## # A tibble: 3 × 3
##   customer_id name    city       
##         <dbl> <chr>   <chr>      
## 1           1 Alice   New York   
## 2           2 Bob     Los Angeles
## 3           3 Charlie Chicago

8. Challenge Question (3 points) Create a summary that shows each customer’s name, city, total number of orders, and total amount spent. Include all customers, even those without orders. Hint: You’ll need to use a combination of joins and group_by/summarize operations.

customer_summary <- customers %>%
  left_join(orders, by = "customer_id") %>% 
  group_by(customer_id, name, city) %>% 
  # Step 2: Group by customer
  summarize(
    total_orders = n(),                     
    # Count number of orders
    total_spent = sum(amount, na.rm = TRUE),
    # Sum order amounts, treating NA as 0
    .groups = "drop"   
  ) %>%
  mutate(total_orders = ifelse(is.na(total_orders), 0, total_orders))
head(customer_summary)
## # A tibble: 5 × 5
##   customer_id name    city        total_orders total_spent
##         <dbl> <chr>   <chr>              <int>       <dbl>
## 1           1 Alice   New York               1        1200
## 2           2 Bob     Los Angeles            2        2300
## 3           3 Charlie Chicago                1         300
## 4           4 David   Houston                1           0
## 5           5 Eve     Phoenix                1           0