Inner join (3 points) Perform an inner join between the customers and orders datasets. How many rows are in the result?
q1 <- inner_join(customers , orders, by = 'customer_id')
nrow(q1)
## [1] 4
print(q1)
## # A tibble: 4 × 6
## customer_id name city order_id product amount
## <dbl> <chr> <chr> <dbl> <chr> <dbl>
## 1 1 Alice New York 101 Laptop 1200
## 2 2 Bob Los Angeles 102 Phone 800
## 3 2 Bob Los Angeles 104 Desktop 1500
## 4 3 Charlie Chicago 103 Tablet 300
Why are some customers or orders not included in the result?
They do not have a match on the other table
Left Join (3 points) Perform a left join with customers as the left table and orders as the right table.
q2 <- left_join(customers, orders, by = "customer_id")
# Display result
print(q2)
## # A tibble: 6 × 6
## customer_id name city order_id product amount
## <dbl> <chr> <chr> <dbl> <chr> <dbl>
## 1 1 Alice New York 101 Laptop 1200
## 2 2 Bob Los Angeles 102 Phone 800
## 3 2 Bob Los Angeles 104 Desktop 1500
## 4 3 Charlie Chicago 103 Tablet 300
## 5 4 David Houston NA <NA> NA
## 6 5 Eve Phoenix NA <NA> NA
How many rows are in the result? Explain why this number
differs from the inner join result.
There are 6 rows.
The result is different because the left join shows all customers even
if they had no orders.
Right Join (3 points) Perform a right join with customers as the left table and orders as the right table.How many rows are in the result? Display the result
q3 <- right_join(customers, orders, by = “customer_id”)
q3 <- right_join(customers, orders, by = "customer_id")
# Display the result
print(q3)
## # A tibble: 6 × 6
## customer_id name city order_id product amount
## <dbl> <chr> <chr> <dbl> <chr> <dbl>
## 1 1 Alice New York 101 Laptop 1200
## 2 2 Bob Los Angeles 102 Phone 800
## 3 2 Bob Los Angeles 104 Desktop 1500
## 4 3 Charlie Chicago 103 Tablet 300
## 5 6 <NA> <NA> 105 Camera 600
## 6 7 <NA> <NA> 106 Printer 150
Count number of rows
There are 6 rows.
Which customer_ids in the result have NULL for customer name
and city? Explain why.
IDs 6 and 7 have NULL because the order exists but it has no matching
customers
full Join (3 points) Perform a full join between customers and orders. How many rows are in the result? Display the result
q4 <- full_join(customers, orders, by = 'customer_id')
print(q4)
## # A tibble: 8 × 6
## customer_id name city order_id product amount
## <dbl> <chr> <chr> <dbl> <chr> <dbl>
## 1 1 Alice New York 101 Laptop 1200
## 2 2 Bob Los Angeles 102 Phone 800
## 3 2 Bob Los Angeles 104 Desktop 1500
## 4 3 Charlie Chicago 103 Tablet 300
## 5 4 David Houston NA <NA> NA
## 6 5 Eve Phoenix NA <NA> NA
## 7 6 <NA> <NA> 105 Camera 600
## 8 7 <NA> <NA> 106 Printer 150
How many rows are in the result?
There are 8
Identify any rows where there’s information from only one
table. Explain these results.
Customers 4 & 5 appear with NA for order details (they have no
orders). Orders from customer_id 6 & 7 appear with NA for customer
details (no matching customer).
Semi Join (3 points) Perform a semi join with customers as the left table and orders as the right table. How many rows are in the result? Display the result
q5 <- semi_join(customers, orders, by = "customer_id")
print(q5)
## # A tibble: 3 × 3
## customer_id name city
## <dbl> <chr> <chr>
## 1 1 Alice New York
## 2 2 Bob Los Angeles
## 3 3 Charlie Chicago
How does this result differ from the inner join
result?
Its different from an inner join because it excludes order details
Anti Join (3 points) Perform an anti join with customers as the left table and orders as the right table. Display the result
q6 <- anti_join(customers, orders, by = "customer_id")
print(q6)
## # A tibble: 2 × 3
## customer_id name city
## <dbl> <chr> <chr>
## 1 4 David Houston
## 2 5 Eve Phoenix
Which customers are in the result? Explain what this result
tells you about these customers.
Customers 4 and 5 are in the result. It tells you that these customers
are inactive
Practical Application (4 points) Imagine you’re analyzing customer
behavior. q7a) Which join would you use to find all customers,
including those who haven’t placed any orders? Why?
left join because it ensures all customers are included, even if they
haven’t placed orders and is useful for analyzing the total customer
base.
q7b) Which join would you use to find only the customers who
have placed orders? Why?
Inner join because it ensures only customers who have placed orders are
included and is useful for analyzing active customers only.
q7a <- left_join(customers, orders, by = "customer_id")
# Display result
print(q7a)
## # A tibble: 6 × 6
## customer_id name city order_id product amount
## <dbl> <chr> <chr> <dbl> <chr> <dbl>
## 1 1 Alice New York 101 Laptop 1200
## 2 2 Bob Los Angeles 102 Phone 800
## 3 2 Bob Los Angeles 104 Desktop 1500
## 4 3 Charlie Chicago 103 Tablet 300
## 5 4 David Houston NA <NA> NA
## 6 5 Eve Phoenix NA <NA> NA
q7b <- inner_join(customers, orders, by = "customer_id")
# Display result
print(q7b)
## # A tibble: 4 × 6
## customer_id name city order_id product amount
## <dbl> <chr> <chr> <dbl> <chr> <dbl>
## 1 1 Alice New York 101 Laptop 1200
## 2 2 Bob Los Angeles 102 Phone 800
## 3 2 Bob Los Angeles 104 Desktop 1500
## 4 3 Charlie Chicago 103 Tablet 300
Challenge Question (3 points) Create a summary that shows each customer’s name, city, total number of orders, and total amount spent. Include all customers, even those without orders.
q8 <- customers %>%
left_join(orders, by = "customer_id") %>%
group_by(customer_id, name, city) %>%
summarize(total_orders = n(),
total_amount_spent = sum(amount, na.rm = TRUE))
## `summarise()` has grouped output by 'customer_id', 'name'. You can override
## using the `.groups` argument.
# Display the summary
print(q8)
## # A tibble: 5 × 5
## # Groups: customer_id, name [5]
## customer_id name city total_orders total_amount_spent
## <dbl> <chr> <chr> <int> <dbl>
## 1 1 Alice New York 1 1200
## 2 2 Bob Los Angeles 2 2300
## 3 3 Charlie Chicago 1 300
## 4 4 David Houston 1 0
## 5 5 Eve Phoenix 1 0