R Markdown

1. Inner Join (3 points) Perform an inner join between the customers and orders datasets.

q1 <- inner_join(customers, orders, by = 'customer_id')

How many rows are in the result?

There are 4 rows

Why are some customers or orders not included in the result?

They did not have a match in the other table.

Display the result

q1
## # A tibble: 4 × 6
##   customer_id name    city        order_id product amount
##         <dbl> <chr>   <chr>          <dbl> <chr>    <dbl>
## 1           1 Alice   New York         101 Laptop    1200
## 2           2 Bob     Los Angeles      102 Phone      800
## 3           2 Bob     Los Angeles      104 Desktop   1500
## 4           3 Charlie Chicago          103 Tablet     300

2. Left Join (3 points) Perform a left join with customers as the left table and orders as the right table.

q2 <- left_join(customers, orders)
## Joining with `by = join_by(customer_id)`

How many rows are in the result?

nrow (q2)
## [1] 6

Explain why this number differs from the inner join result.

It includes the customers with no match in the other table

Display the result

q2
## # A tibble: 6 × 6
##   customer_id name    city        order_id product amount
##         <dbl> <chr>   <chr>          <dbl> <chr>    <dbl>
## 1           1 Alice   New York         101 Laptop    1200
## 2           2 Bob     Los Angeles      102 Phone      800
## 3           2 Bob     Los Angeles      104 Desktop   1500
## 4           3 Charlie Chicago          103 Tablet     300
## 5           4 David   Houston           NA <NA>        NA
## 6           5 Eve     Phoenix           NA <NA>        NA

3. Right Join (3 points) Perform a right join with customers as the left table and orders as the right table.

q3 <- right_join(customers, orders)
## Joining with `by = join_by(customer_id)`

How many rows are in the result?

nrow (q3)
## [1] 6

Which customer_ids in the result have NULL for customer name and city? Explain why.

customer_id 6 and customer_id 7. This is because there are entries for order_id 105 and order_id 106, but no customer_ids to match, represented by “N/A”.

Display the result

q3
## # A tibble: 6 × 6
##   customer_id name    city        order_id product amount
##         <dbl> <chr>   <chr>          <dbl> <chr>    <dbl>
## 1           1 Alice   New York         101 Laptop    1200
## 2           2 Bob     Los Angeles      102 Phone      800
## 3           2 Bob     Los Angeles      104 Desktop   1500
## 4           3 Charlie Chicago          103 Tablet     300
## 5           6 <NA>    <NA>             105 Camera     600
## 6           7 <NA>    <NA>             106 Printer    150

4. Full Join (3 points) Perform a full join between customers and orders.

q4 <- full_join(customers, orders)
## Joining with `by = join_by(customer_id)`

How many rows are in the result?

nrow(q4)
## [1] 8

Identify any rows where there’s information from only one table. Explain these results.

Rows 5-8 have information from only one table. Rows 5 and 6 have information only for customer_id, meaning that they have made no orders. Rows 7 and 8 have informatipon only for order_id, meaning that no customers ordered these items.

Display the result

q4
## # A tibble: 8 × 6
##   customer_id name    city        order_id product amount
##         <dbl> <chr>   <chr>          <dbl> <chr>    <dbl>
## 1           1 Alice   New York         101 Laptop    1200
## 2           2 Bob     Los Angeles      102 Phone      800
## 3           2 Bob     Los Angeles      104 Desktop   1500
## 4           3 Charlie Chicago          103 Tablet     300
## 5           4 David   Houston           NA <NA>        NA
## 6           5 Eve     Phoenix           NA <NA>        NA
## 7           6 <NA>    <NA>             105 Camera     600
## 8           7 <NA>    <NA>             106 Printer    150

5. Semi Join (3 points) Perform a semi join with customers as the left table and orders as the right table.

q5 <- semi_join(customers, orders)
## Joining with `by = join_by(customer_id)`

How many rows are in the result?

nrow(q5)
## [1] 3

How does this result differ from the inner join result?

the inner join result produces 4 rows. Bob is displayed twice because he has made 2 orders.

Display the result

q5
## # A tibble: 3 × 3
##   customer_id name    city       
##         <dbl> <chr>   <chr>      
## 1           1 Alice   New York   
## 2           2 Bob     Los Angeles
## 3           3 Charlie Chicago

6. Anti Join (3 points) Perform an anti join with customers as the left table and orders as the right table.

q6 <- anti_join(customers, orders)
## Joining with `by = join_by(customer_id)`

Which customers are in the result?

David and Eve.

Explain what this result tells you about these customers.

#####David and Eve have placed no orders.

Display the result

q6
## # A tibble: 2 × 3
##   customer_id name  city   
##         <dbl> <chr> <chr>  
## 1           4 David Houston
## 2           5 Eve   Phoenix

7. Practical Application (4 points) Imagine you’re analyzing customer behavior.

Which join would you use to find all customers, including those who haven’t placed any orders? Why?

I would use a left join because it would result in the customers with no orders being presented with N/A, keeping them included in the table.

Which join would you use to find only the customers who have placed orders? Why?

I would use an inner join because it would result in only customers who have a matching order entry.

Write the R code for both scenarios.

q7 <- left_join(customers, orders)
## Joining with `by = join_by(customer_id)`
q8 <- inner_join(customers, orders)
## Joining with `by = join_by(customer_id)`

Display the result

q7
## # A tibble: 6 × 6
##   customer_id name    city        order_id product amount
##         <dbl> <chr>   <chr>          <dbl> <chr>    <dbl>
## 1           1 Alice   New York         101 Laptop    1200
## 2           2 Bob     Los Angeles      102 Phone      800
## 3           2 Bob     Los Angeles      104 Desktop   1500
## 4           3 Charlie Chicago          103 Tablet     300
## 5           4 David   Houston           NA <NA>        NA
## 6           5 Eve     Phoenix           NA <NA>        NA
q8
## # A tibble: 4 × 6
##   customer_id name    city        order_id product amount
##         <dbl> <chr>   <chr>          <dbl> <chr>    <dbl>
## 1           1 Alice   New York         101 Laptop    1200
## 2           2 Bob     Los Angeles      102 Phone      800
## 3           2 Bob     Los Angeles      104 Desktop   1500
## 4           3 Charlie Chicago          103 Tablet     300

8. Challenge Question (3 points) Create a summary that shows each customer’s name, city, total number of orders, and total amount spent. Include all customers, even those without orders. Hint: You’ll need to use a combination of joins and group_by/summarize operations.

q9 <- customers %>%
  left_join(orders, by = "customer_id") %>%
  group_by(customer_id, name, city) %>%  # Group by customer details
  summarize(
    total_orders = n_distinct(order_id, na.rm = TRUE),  # Count unique orders, avoiding NA
    total_spent = sum(amount, na.rm = TRUE)  # Sum total amount spent, treating NA as 0
  )
## `summarise()` has grouped output by 'customer_id', 'name'. You can override
## using the `.groups` argument.
q9
## # A tibble: 5 × 5
## # Groups:   customer_id, name [5]
##   customer_id name    city        total_orders total_spent
##         <dbl> <chr>   <chr>              <int>       <dbl>
## 1           1 Alice   New York               1        1200
## 2           2 Bob     Los Angeles            2        2300
## 3           3 Charlie Chicago                1         300
## 4           4 David   Houston                0           0
## 5           5 Eve     Phoenix                0           0