q1 <- inner_join(customers, orders, by = 'customer_id')
There are 4 rows
Because their CustomerID or OrderID are not included in the desired range
q1
## # A tibble: 4 × 6
## customer_id name city order_id product amount
## <dbl> <chr> <chr> <dbl> <chr> <dbl>
## 1 1 Alice New York 101 Laptop 1200
## 2 2 Bob Los Angeles 102 Phone 800
## 3 2 Bob Los Angeles 104 Desktop 1500
## 4 3 Charlie Chicago 103 Tablet 300
q2 <- left_join(customers, orders, by = 'customer_id')
There are 6 rows
Because the inner join does not include records without order IDs.
q2
## # A tibble: 6 × 6
## customer_id name city order_id product amount
## <dbl> <chr> <chr> <dbl> <chr> <dbl>
## 1 1 Alice New York 101 Laptop 1200
## 2 2 Bob Los Angeles 102 Phone 800
## 3 2 Bob Los Angeles 104 Desktop 1500
## 4 3 Charlie Chicago 103 Tablet 300
## 5 4 David Houston NA <NA> NA
## 6 5 Eve Phoenix NA <NA> NA
q3 <- right_join(customers, orders, by = 'customer_id')
There are 6 rows
If ‘orders’ contained a ‘customerID’ that is not in ‘customers’, tgen that row in the right join result would have NULL for ‘name’ and ‘city’
q3
## # A tibble: 6 × 6
## customer_id name city order_id product amount
## <dbl> <chr> <chr> <dbl> <chr> <dbl>
## 1 1 Alice New York 101 Laptop 1200
## 2 2 Bob Los Angeles 102 Phone 800
## 3 2 Bob Los Angeles 104 Desktop 1500
## 4 3 Charlie Chicago 103 Tablet 300
## 5 6 <NA> <NA> 105 Camera 600
## 6 7 <NA> <NA> 106 Printer 150
q4 <- full_join(customers, orders, by = 'customer_id')
There are 8 rows
Rows with information from only one table occur when there is no matching value in the other table based on the join condition.
q4
## # A tibble: 8 × 6
## customer_id name city order_id product amount
## <dbl> <chr> <chr> <dbl> <chr> <dbl>
## 1 1 Alice New York 101 Laptop 1200
## 2 2 Bob Los Angeles 102 Phone 800
## 3 2 Bob Los Angeles 104 Desktop 1500
## 4 3 Charlie Chicago 103 Tablet 300
## 5 4 David Houston NA <NA> NA
## 6 5 Eve Phoenix NA <NA> NA
## 7 6 <NA> <NA> 105 Camera 600
## 8 7 <NA> <NA> 106 Printer 150
q5 <- semi_join(customers, orders, by = 'customer_id')
There are 3 rows
Semi Join does not have orderID, product, or amount columns. Semi join filters records but does not combine them, while an inner join combines and returns columns from both tables.
q5
## # A tibble: 3 × 3
## customer_id name city
## <dbl> <chr> <chr>
## 1 1 Alice New York
## 2 2 Bob Los Angeles
## 3 3 Charlie Chicago
q6 <- anti_join(customers, orders, by = 'customer_id')
There are 2 rows
It tells you their customerID and their city.
q6
## # A tibble: 2 × 3
## customer_id name city
## <dbl> <chr> <chr>
## 1 4 David Houston
## 2 5 Eve Phoenix
Use left join, because a left join returns all records from the left table (Customers), and if there is a matching record in the right table (Orders), it includes that data. If no match is found, the result still includes the customer, but with NULLs for the missing order details.
Use inner join, because an inner join returns only the rows where there is a match between both tables based on the join condition. If a customer has no matching order, they will be excluded from the result.
left_join(customers, orders, by = 'customer_id')
## # A tibble: 6 × 6
## customer_id name city order_id product amount
## <dbl> <chr> <chr> <dbl> <chr> <dbl>
## 1 1 Alice New York 101 Laptop 1200
## 2 2 Bob Los Angeles 102 Phone 800
## 3 2 Bob Los Angeles 104 Desktop 1500
## 4 3 Charlie Chicago 103 Tablet 300
## 5 4 David Houston NA <NA> NA
## 6 5 Eve Phoenix NA <NA> NA
inner_join(customers, orders, by = 'customer_id')
## # A tibble: 4 × 6
## customer_id name city order_id product amount
## <dbl> <chr> <chr> <dbl> <chr> <dbl>
## 1 1 Alice New York 101 Laptop 1200
## 2 2 Bob Los Angeles 102 Phone 800
## 3 2 Bob Los Angeles 104 Desktop 1500
## 4 3 Charlie Chicago 103 Tablet 300
library(dplyr)
# Assuming 'customers' and 'orders' data frames are already defined as in the assignment
# Group by customer and summarize
q8_summary <- left_join(customers, orders, by = 'customer_id') %>%
group_by(customer_id, name, city) %>%
summarize(
total_orders = n(),
total_amount_spent = sum(amount, na.rm = TRUE)
) %>%
ungroup() #remove grouping
## `summarise()` has grouped output by 'customer_id', 'name'. You can override
## using the `.groups` argument.
# Display the result
print(q8_summary)
## # A tibble: 5 × 5
## customer_id name city total_orders total_amount_spent
## <dbl> <chr> <chr> <int> <dbl>
## 1 1 Alice New York 1 1200
## 2 2 Bob Los Angeles 2 2300
## 3 3 Charlie Chicago 1 300
## 4 4 David Houston 1 0
## 5 5 Eve Phoenix 1 0