Assignment 4

1. Inner Join (3 points) Perform an inner join between the customers and orders datasets.

q1 <- inner_join(customers, orders, by = 'customer_id')

How many rows are in the result?

There are 4 rows

Why are some customers or orders not included in the result

Because their CustomerID or OrderID are not included in the desired range

Display the result

q1

## # A tibble: 4 × 6
##   customer_id name    city        order_id product amount
##         <dbl> <chr>   <chr>          <dbl> <chr>    <dbl>
## 1           1 Alice   New York         101 Laptop    1200
## 2           2 Bob     Los Angeles      102 Phone      800
## 3           2 Bob     Los Angeles      104 Desktop   1500
## 4           3 Charlie Chicago          103 Tablet     300

2. Left Join (3 points) Perform a left join with customers as the left table and orders as the right table.

q2 <- left_join(customers, orders, by = 'customer_id')

How many rows are in the result?

There are 6 rows

Explain why this number differs from the inner join result.

Because the inner join does not include records without order IDs.

Display the result

q2

## # A tibble: 6 × 6
##   customer_id name    city        order_id product amount
##         <dbl> <chr>   <chr>          <dbl> <chr>    <dbl>
## 1           1 Alice   New York         101 Laptop    1200
## 2           2 Bob     Los Angeles      102 Phone      800
## 3           2 Bob     Los Angeles      104 Desktop   1500
## 4           3 Charlie Chicago          103 Tablet     300
## 5           4 David   Houston           NA <NA>        NA
## 6           5 Eve     Phoenix           NA <NA>        NA

3. Right Join (3 points) Perform a right join with customers as the left table and orders as the right table.

q3 <- right_join(customers, orders, by = 'customer_id')

How many rows are in the result?

There are 6 rows

Which customer_ids in the result have NULL for customer name and city? Explain why.

If ‘orders’ contained a ‘customerID’ that is not in ‘customers’, tgen that row in the right join result would have NULL for ‘name’ and ‘city’

Display the result

q3

## # A tibble: 6 × 6
##   customer_id name    city        order_id product amount
##         <dbl> <chr>   <chr>          <dbl> <chr>    <dbl>
## 1           1 Alice   New York         101 Laptop    1200
## 2           2 Bob     Los Angeles      102 Phone      800
## 3           2 Bob     Los Angeles      104 Desktop   1500
## 4           3 Charlie Chicago          103 Tablet     300
## 5           6 <NA>    <NA>             105 Camera     600
## 6           7 <NA>    <NA>             106 Printer    150

4. Full Join (3 points) Perform a full join between customers and orders.

q4 <- full_join(customers, orders, by = 'customer_id')

How many rows are in the result?

There are 8 rows

Identify any rows where there’s information from only one table. Explain these results.

Rows with information from only one table occur when there is no matching value in the other table based on the join condition.

Display the result

q4

## # A tibble: 8 × 6
##   customer_id name    city        order_id product amount
##         <dbl> <chr>   <chr>          <dbl> <chr>    <dbl>
## 1           1 Alice   New York         101 Laptop    1200
## 2           2 Bob     Los Angeles      102 Phone      800
## 3           2 Bob     Los Angeles      104 Desktop   1500
## 4           3 Charlie Chicago          103 Tablet     300
## 5           4 David   Houston           NA <NA>        NA
## 6           5 Eve     Phoenix           NA <NA>        NA
## 7           6 <NA>    <NA>             105 Camera     600
## 8           7 <NA>    <NA>             106 Printer    150

5. Semi Join (3 points) Perform an anti join with customers as the left table and orders as the right table.

q5 <- semi_join(customers, orders, by = 'customer_id')

How many rows are in the result?

There are 3 rows

How does this result differ from the inner join result?

Semi Join does not have orderID, product, or amount columns. Semi join filters records but does not combine them, while an inner join combines and returns columns from both tables.

Display the result

q5

## # A tibble: 3 × 3
##   customer_id name    city       
##         <dbl> <chr>   <chr>      
## 1           1 Alice   New York   
## 2           2 Bob     Los Angeles
## 3           3 Charlie Chicago

6. Anti Join (3 points) Perform an anti join with customers as the left table and orders as the right table.

q6 <- anti_join(customers, orders, by = 'customer_id')

How many rows are in the result?

There are 2 rows

Explain what this result tells you about these customers.

It tells you their customerID and their city.

Display the result

q6

## # A tibble: 2 × 3
##   customer_id name  city   
##         <dbl> <chr> <chr>  
## 1           4 David Houston
## 2           5 Eve   Phoenix

7. Practical Application (4 points) Imagine you’re analyzing customer behavior.

Which join would you use to find all customers, including those who haven’t placed any orders? Why?

Use left join, because a left join returns all records from the left table (Customers), and if there is a matching record in the right table (Orders), it includes that data. If no match is found, the result still includes the customer, but with NULLs for the missing order details.

Which join would you use to find only the customers who have placed orders? Why?

Use inner join, because an inner join returns only the rows where there is a match between both tables based on the join condition. If a customer has no matching order, they will be excluded from the result.

Write the R code for both scenarios.

left_join(customers, orders, by = 'customer_id')

## # A tibble: 6 × 6
##   customer_id name    city        order_id product amount
##         <dbl> <chr>   <chr>          <dbl> <chr>    <dbl>
## 1           1 Alice   New York         101 Laptop    1200
## 2           2 Bob     Los Angeles      102 Phone      800
## 3           2 Bob     Los Angeles      104 Desktop   1500
## 4           3 Charlie Chicago          103 Tablet     300
## 5           4 David   Houston           NA <NA>        NA
## 6           5 Eve     Phoenix           NA <NA>        NA

inner_join(customers, orders, by = 'customer_id')

## # A tibble: 4 × 6
##   customer_id name    city        order_id product amount
##         <dbl> <chr>   <chr>          <dbl> <chr>    <dbl>
## 1           1 Alice   New York         101 Laptop    1200
## 2           2 Bob     Los Angeles      102 Phone      800
## 3           2 Bob     Los Angeles      104 Desktop   1500
## 4           3 Charlie Chicago          103 Tablet     300

8. Challenge Question (3 points) Create a summary that shows each customer’s name, city, total number of orders, and total amount spent. Include all customers, even those without orders. Hint: You’ll need to use a combination of joins and group_by/summarize operations.

library(dplyr)

# Assuming 'customers' and 'orders' data frames are already defined as in the assignment



# Group by customer and summarize
q8_summary <- left_join(customers, orders, by = 'customer_id') %>%
  group_by(customer_id, name, city) %>%
  summarize(
    total_orders = n(),
    total_amount_spent = sum(amount, na.rm = TRUE)
  ) %>%
  ungroup() #remove grouping

## `summarise()` has grouped output by 'customer_id', 'name'. You can override
## using the `.groups` argument.

# Display the result
print(q8_summary)

## # A tibble: 5 × 5
##   customer_id name    city        total_orders total_amount_spent
##         <dbl> <chr>   <chr>              <int>              <dbl>
## 1           1 Alice   New York               1               1200
## 2           2 Bob     Los Angeles            2               2300
## 3           3 Charlie Chicago                1                300
## 4           4 David   Houston                1                  0
## 5           5 Eve     Phoenix                1                  0

Assignment 4

Patrick O’Connell and Kevin Hanson

2025-02-18

1. Inner Join (3 points) Perform an inner join between the customers and orders datasets.

How many rows are in the result?

Why are some customers or orders not included in the result

Display the result

2. Left Join (3 points) Perform a left join with customers as the left table and orders as the right table.

How many rows are in the result?

Explain why this number differs from the inner join result.

Display the result

3. Right Join (3 points) Perform a right join with customers as the left table and orders as the right table.

How many rows are in the result?

Which customer_ids in the result have NULL for customer name and city? Explain why.

Display the result

4. Full Join (3 points) Perform a full join between customers and orders.

How many rows are in the result?

Identify any rows where there’s information from only one table. Explain these results.

Display the result

5. Semi Join (3 points) Perform an anti join with customers as the left table and orders as the right table.

How many rows are in the result?

How does this result differ from the inner join result?

Display the result

6. Anti Join (3 points) Perform an anti join with customers as the left table and orders as the right table.

How many rows are in the result?

Explain what this result tells you about these customers.

Display the result

7. Practical Application (4 points) Imagine you’re analyzing customer behavior.

Which join would you use to find all customers, including those who haven’t placed any orders? Why?

Which join would you use to find only the customers who have placed orders? Why?

Write the R code for both scenarios.

8. Challenge Question (3 points) Create a summary that shows each customer’s name, city, total number of orders, and total amount spent. Include all customers, even those without orders. Hint: You’ll need to use a combination of joins and group_by/summarize operations.