Inner Join (3 points) Perform an inner join between the customers and orders datasets

How many rows are in the result?

4 rows

Why are some customers or orders not included in the result?

Inner join only returns when there is a match on both tables. The customer data only has 3 customers in the order table with one customer having 2 orders.

Display the result
q1 <- inner_join(customers , orders, by ='customer_id')
head(q1)
## # A tibble: 4 × 6
##   customer_id name    city        order_id product amount
##         <dbl> <chr>   <chr>          <dbl> <chr>    <dbl>
## 1           1 Alice   New York         101 Laptop    1200
## 2           2 Bob     Los Angeles      102 Phone      800
## 3           2 Bob     Los Angeles      104 Desktop   1500
## 4           3 Charlie Chicago          103 Tablet     300

Left Join (3 points) Perform a left join with customers as the left table and orders as the right table

How many rows are in the result?

6

Explain why this number differs from the inner join result.

In a left join, all of the items from the left table(customers), are included even if they don’t match to what is in the right table(orders). Those that don’t have a match return ’N/A’. Inner join returns items where there is a matching record between both tables.

Display the result
q2 <- left_join(customers, orders, by ='customer_id')
head(q2)
## # A tibble: 6 × 6
##   customer_id name    city        order_id product amount
##         <dbl> <chr>   <chr>          <dbl> <chr>    <dbl>
## 1           1 Alice   New York         101 Laptop    1200
## 2           2 Bob     Los Angeles      102 Phone      800
## 3           2 Bob     Los Angeles      104 Desktop   1500
## 4           3 Charlie Chicago          103 Tablet     300
## 5           4 David   Houston           NA <NA>        NA
## 6           5 Eve     Phoenix           NA <NA>        NA

Right Join (3 points) Perform a right join with customers as the left table and orders as the right table

How many rows are in the result?

6 rows

Which customer_ids in the result have NULL for customer name and city? Explain why.

Customer_ids 6 and 7 have no values for name and city. There is no matching data for customer_ids 6 and 7 in the customers table(left table). When there is no match for a row that is in the right table, then no value will be returned in the columns from the left table.

Display the result
q3 <-right_join(customers, orders, by = 'customer_id')
head(q3)
## # A tibble: 6 × 6
##   customer_id name    city        order_id product amount
##         <dbl> <chr>   <chr>          <dbl> <chr>    <dbl>
## 1           1 Alice   New York         101 Laptop    1200
## 2           2 Bob     Los Angeles      102 Phone      800
## 3           2 Bob     Los Angeles      104 Desktop   1500
## 4           3 Charlie Chicago          103 Tablet     300
## 5           6 <NA>    <NA>             105 Camera     600
## 6           7 <NA>    <NA>             106 Printer    150

Full Join (3 points) Perform a full join between customers and orders.

How many rows are in the result?

8 rows

Identify any rows where there’s information from only one table. Explain these results.

The first set of rows that have information from only one table are Customer ID 6 and 7 have results from one table. These customer ID are located in the orders table and there is no corresponding data in the customers table. Customer ID 4 and 5 are located in the customers table and do not have matching data in the orders table. N/A is returnee indicating there is no corresponding data.

Display the result
q4 <- full_join(customers, orders, by = "customer_id")
print(q4)
## # A tibble: 8 × 6
##   customer_id name    city        order_id product amount
##         <dbl> <chr>   <chr>          <dbl> <chr>    <dbl>
## 1           1 Alice   New York         101 Laptop    1200
## 2           2 Bob     Los Angeles      102 Phone      800
## 3           2 Bob     Los Angeles      104 Desktop   1500
## 4           3 Charlie Chicago          103 Tablet     300
## 5           4 David   Houston           NA <NA>        NA
## 6           5 Eve     Phoenix           NA <NA>        NA
## 7           6 <NA>    <NA>             105 Camera     600
## 8           7 <NA>    <NA>             106 Printer    150

Semi Join (3 points) Perform a semi join with customers as the left table and orders as the right table.

How many rows are in the result?

3 rows

How does this result differ from the inner join result?

The inner join included both customer and order details.The semi join only returns customers who have placed at least one order without order details. Customers 4 and 5 are excluded because they have no orders.

Display the result
q5 <-semi_join(customers, orders, by ='customer_id')
head(q5)
## # A tibble: 3 × 3
##   customer_id name    city       
##         <dbl> <chr>   <chr>      
## 1           1 Alice   New York   
## 2           2 Bob     Los Angeles
## 3           3 Charlie Chicago

Anti Join (3 points) Perform an anti join with customers as the left table and orders as the right table.

Which customers are in the result?

Customer ID 4 and 5

Explain what this result tells you about these customers.

Results show that customer ID 4 and 5 have not made any purchases since they are in the customer table but not in the orders table.

Display the result
q6 <-anti_join(customers, orders, by = 'customer_id')
head(q6)
## # A tibble: 2 × 3
##   customer_id name  city   
##         <dbl> <chr> <chr>  
## 1           4 David Houston
## 2           5 Eve   Phoenix

Practical Application (4 points) Imagine you’re analyzing customer behavior.

Which join would you use to find all customers, including those who haven’t placed any orders? Why?

The full join function would be used because the full join returns all rows from both the customers and orders table when there is a match in either the left or right table. This would include all customers even ones who do not have orders placed within the orders table.

Which join would you use to find only the customers who have placed orders? Why?

To return customers who have placed orders you would use right join . This will return all rows from the right table which is the orders and find correlating rows in the left table which is customers. Only customers who have placed orders will be displayed in the results since the orders are in the right table. Even if an order does not have a corresponding customers in the table, it will still appear in the results.

Write the R code for both scenarios.
q7a <-full_join(customers, orders)
## Joining with `by = join_by(customer_id)`
q7b <-right_join(customers, orders)
## Joining with `by = join_by(customer_id)`
Display the result
print(q7a)
## # A tibble: 8 × 6
##   customer_id name    city        order_id product amount
##         <dbl> <chr>   <chr>          <dbl> <chr>    <dbl>
## 1           1 Alice   New York         101 Laptop    1200
## 2           2 Bob     Los Angeles      102 Phone      800
## 3           2 Bob     Los Angeles      104 Desktop   1500
## 4           3 Charlie Chicago          103 Tablet     300
## 5           4 David   Houston           NA <NA>        NA
## 6           5 Eve     Phoenix           NA <NA>        NA
## 7           6 <NA>    <NA>             105 Camera     600
## 8           7 <NA>    <NA>             106 Printer    150
print(q7b)
## # A tibble: 6 × 6
##   customer_id name    city        order_id product amount
##         <dbl> <chr>   <chr>          <dbl> <chr>    <dbl>
## 1           1 Alice   New York         101 Laptop    1200
## 2           2 Bob     Los Angeles      102 Phone      800
## 3           2 Bob     Los Angeles      104 Desktop   1500
## 4           3 Charlie Chicago          103 Tablet     300
## 5           6 <NA>    <NA>             105 Camera     600
## 6           7 <NA>    <NA>             106 Printer    150

Challenge Question (3 points) Create a summary that shows each customer’s name, city, total number of orders, and total amount spent. Include all customers, even those without orders. Hint: You’ll need to use a combination of joins and group_by/summarize operations.

challenge_question <- customers %>%
  full_join(orders, by = "customer_id") %>%
  group_by(customer_id, name, city) %>%
  summarize( total_orders = sum(!is.na(order_id)), total_amount_spent = sum(amount, na.rm = TRUE), 
             .groups = "drop"
  )
head(challenge_question)
## # A tibble: 6 × 5
##   customer_id name    city        total_orders total_amount_spent
##         <dbl> <chr>   <chr>              <int>              <dbl>
## 1           1 Alice   New York               1               1200
## 2           2 Bob     Los Angeles            2               2300
## 3           3 Charlie Chicago                1                300
## 4           4 David   Houston                0                  0
## 5           5 Eve     Phoenix                0                  0
## 6           6 <NA>    <NA>                   1                600