q1 <- inner_join(customers , orders , by = 'customer_id')
How many rows are in the result?
There are 4 rows
Why are some customers or orders not included in the result? Inner join returns all rows from both tables where there is a match. Therefore the rows not returned did not have a match in the other table
Display the result
head(q1)
## # A tibble: 4 × 6
## customer_id name city order_id product amount
## <dbl> <chr> <chr> <dbl> <chr> <dbl>
## 1 1 Alice New York 101 Laptop 1200
## 2 2 Bob Los Angeles 102 Phone 800
## 3 2 Bob Los Angeles 104 Desktop 1500
## 4 3 Charlie Chicago 103 Tablet 300
q2 <- left_join(customers , orders , by = 'customer_id')
How many rows are in the result?
There are 6 rows
Explain why this number differs from the inner join result.
Left join returns all rows from the left table and matching rows from
the right table. The left table has 5 rows and the right table has
customer_id 2 twice, making 2 have 6 rows
Display the result
head(q2)
## # A tibble: 6 × 6
## customer_id name city order_id product amount
## <dbl> <chr> <chr> <dbl> <chr> <dbl>
## 1 1 Alice New York 101 Laptop 1200
## 2 2 Bob Los Angeles 102 Phone 800
## 3 2 Bob Los Angeles 104 Desktop 1500
## 4 3 Charlie Chicago 103 Tablet 300
## 5 4 David Houston NA <NA> NA
## 6 5 Eve Phoenix NA <NA> NA
q3 <- right_join(customers , orders , by = 'customer_id')
How many rows are in the result?
There are 6 rows
Which customer_ids in the result have NULL for customer name and
city? Explain why.
The ones that show NULL show it because there are no customer_ids for 5,
6, or 7. Right join returns all rows from the right table and matching
rows from the left table.
Display the result
head(q3)
## # A tibble: 6 × 6
## customer_id name city order_id product amount
## <dbl> <chr> <chr> <dbl> <chr> <dbl>
## 1 1 Alice New York 101 Laptop 1200
## 2 2 Bob Los Angeles 102 Phone 800
## 3 2 Bob Los Angeles 104 Desktop 1500
## 4 3 Charlie Chicago 103 Tablet 300
## 5 6 <NA> <NA> 105 Camera 600
## 6 7 <NA> <NA> 106 Printer 150
q4 <- full_join(customers , orders , by = 'customer_id')
How many rows are in the result? There are 8 rows
Identify any rows where there’s information from only one table.
Explain these results.
There is information in only the one row because David and Eve do not
have matching customer_ids.
Display the result
head(q4)
## # A tibble: 6 × 6
## customer_id name city order_id product amount
## <dbl> <chr> <chr> <dbl> <chr> <dbl>
## 1 1 Alice New York 101 Laptop 1200
## 2 2 Bob Los Angeles 102 Phone 800
## 3 2 Bob Los Angeles 104 Desktop 1500
## 4 3 Charlie Chicago 103 Tablet 300
## 5 4 David Houston NA <NA> NA
## 6 5 Eve Phoenix NA <NA> NA
q5 <- semi_join(customers , orders , by = 'customer_id')
How many rows are in the result? There are 3 rows
How does this result differ from the inner join result?
It does not show order_id, product, or amount
Display the result
head(q5)
## # A tibble: 3 × 3
## customer_id name city
## <dbl> <chr> <chr>
## 1 1 Alice New York
## 2 2 Bob Los Angeles
## 3 3 Charlie Chicago
q6 <- anti_join(customers , orders , by = 'customer_id')
Which customers are in the result?
David and Eve
Explain what this result tells you about these customers.
They are in the result which means that they do not match from the left
table into the right table
Display the result
head(q6)
## # A tibble: 2 × 3
## customer_id name city
## <dbl> <chr> <chr>
## 1 4 David Houston
## 2 5 Eve Phoenix
Which join would you use to find all customers, including those
who haven’t placed any orders? Why?
I would use Full Join as it takes in all the rows from both the
tables
Which join would you use to find only the customers who have
placed orders? Why?
I would use Semi Join as it shows what has matched between both
tables
Write the R code for both scenarios.
q7.1 <- full_join(customers , orders , by = 'customer_id')
q7.2 <- semi_join(customers , orders , by = 'customer_id')
head(q7.1)
## # A tibble: 6 × 6
## customer_id name city order_id product amount
## <dbl> <chr> <chr> <dbl> <chr> <dbl>
## 1 1 Alice New York 101 Laptop 1200
## 2 2 Bob Los Angeles 102 Phone 800
## 3 2 Bob Los Angeles 104 Desktop 1500
## 4 3 Charlie Chicago 103 Tablet 300
## 5 4 David Houston NA <NA> NA
## 6 5 Eve Phoenix NA <NA> NA
head(q7.2)
## # A tibble: 3 × 3
## customer_id name city
## <dbl> <chr> <chr>
## 1 1 Alice New York
## 2 2 Bob Los Angeles
## 3 3 Charlie Chicago
Create a summary that shows each customer’s name, city, total number of orders, and total amount spent. Include all customers, even those without orders. Hint: You’ll need to use a combination of joins and group_by/summarize operations.
customer_orders <- left_join(customers, orders, by = "customer_id")
customer_summary <- customer_orders %>%
group_by(customer_id, name, city) %>%
summarize(
total_orders = sum(!is.na(order_id)), # Count non-NA order_ids for total count
total_amount_spent = sum(amount, na.rm = TRUE), # Sum amount, ignoring NA values
.groups = 'drop' # Ungroup the result
)
print(customer_summary)
## # A tibble: 5 × 5
## customer_id name city total_orders total_amount_spent
## <dbl> <chr> <chr> <int> <dbl>
## 1 1 Alice New York 1 1200
## 2 2 Bob Los Angeles 2 2300
## 3 3 Charlie Chicago 1 300
## 4 4 David Houston 0 0
## 5 5 Eve Phoenix 0 0