library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
Dataset 1: Customers
customers <- tibble(
customer_id = c(1, 2, 3, 4, 5),
name = c("Alice", "Bob", "Charlie", "David", "Eve"),
city = c("New York", "Los Angeles", "Chicago", "Houston", "Phoenix")
)
Dataset 2: Orders
orders <- tibble(
order_id = c(101, 102, 103, 104, 105, 106),
customer_id = c(1, 2, 3, 2, 6, 7),
product = c("Laptop", "Phone", "Tablet", "Desktop", "Camera", "Printer"),
amount = c(1200, 800, 300, 1500, 600, 150)
)
How many rows are in the result?
nrow(joined_data)
## [1] 4
Why are some customers or orders not included in the result?
Customers 4 and 5 are missing from the result because they do not
have any matching orders in the orders dataset.
Orders 105 and 106 are missing because they belong to customer IDs 6
and 7, which do not exist in the customers dataset.
Display the result
print(joined_data)
## # A tibble: 4 × 6
## customer_id name city order_id product amount
## <dbl> <chr> <chr> <dbl> <chr> <dbl>
## 1 1 Alice New York 101 Laptop 1200
## 2 2 Bob Los Angeles 102 Phone 800
## 3 2 Bob Los Angeles 104 Desktop 1500
## 4 3 Charlie Chicago 103 Tablet 300
How many rows are in the result?
nrow(left_joined_data)
## [1] 6
Explain why this number differs from the inner join result.
The left join keeps all customers, even if they don’t have a
matching order.
Display the result
print(left_joined_data)
## # A tibble: 6 × 6
## customer_id name city order_id product amount
## <dbl> <chr> <chr> <dbl> <chr> <dbl>
## 1 1 Alice New York 101 Laptop 1200
## 2 2 Bob Los Angeles 102 Phone 800
## 3 2 Bob Los Angeles 104 Desktop 1500
## 4 3 Charlie Chicago 103 Tablet 300
## 5 4 David Houston NA <NA> NA
## 6 5 Eve Phoenix NA <NA> NA
How many rows are in the result?
nrow(right_joined_data)
## [1] 6
Which customer_ids in the result have NULL for customer name and
city? Explain why.
Customer IDs 6 and 7 will have NA for name and city. These IDs exist
in the orders dataset but do not exist in the customers dataset.
Display the result
print(right_joined_data)
## # A tibble: 6 × 6
## customer_id name city order_id product amount
## <dbl> <chr> <chr> <dbl> <chr> <dbl>
## 1 1 Alice New York 101 Laptop 1200
## 2 2 Bob Los Angeles 102 Phone 800
## 3 2 Bob Los Angeles 104 Desktop 1500
## 4 3 Charlie Chicago 103 Tablet 300
## 5 6 <NA> <NA> 105 Camera 600
## 6 7 <NA> <NA> 106 Printer 150
How many rows are in the result?
nrow(full_joined_data)
## [1] 8
Display the result
print(full_joined_data)
## # A tibble: 8 × 6
## customer_id name city order_id product amount
## <dbl> <chr> <chr> <dbl> <chr> <dbl>
## 1 1 Alice New York 101 Laptop 1200
## 2 2 Bob Los Angeles 102 Phone 800
## 3 2 Bob Los Angeles 104 Desktop 1500
## 4 3 Charlie Chicago 103 Tablet 300
## 5 4 David Houston NA <NA> NA
## 6 5 Eve Phoenix NA <NA> NA
## 7 6 <NA> <NA> 105 Camera 600
## 8 7 <NA> <NA> 106 Printer 150
How many rows are in the result?
nrow(semi_joined_data)
## [1] 3
How does this result differ from the inner join result?
The semi join only returns the unique customers who have at least
one order, without including order details, resulting in 3 rows.
Display the result
print(semi_joined_data)
## # A tibble: 3 × 3
## customer_id name city
## <dbl> <chr> <chr>
## 1 1 Alice New York
## 2 2 Bob Los Angeles
## 3 3 Charlie Chicago
Which customers are in the result?
Customers David (ID 4) and Eve (ID 5) are in the result because they
do not appear in the orders dataset.
Explain what this result tells you about these customers.
These customers exist in the customers dataset but have never placed
an order in the orders dataset.
Display the result
print (anti_joined_data)
## # A tibble: 2 × 3
## customer_id name city
## <dbl> <chr> <chr>
## 1 4 David Houston
## 2 5 Eve Phoenix
Practical Application (4 points) Imagine you’re analyzing customer
behavior.
Which join would you use to find all customers, including those who
haven’t placed any orders? Why?
I would use left join as it finds all customers who have and have
not placed orders.
Which join would you use to find only the customers who have placed
orders? Why?
I would use semi join as it filters out customers who have not
placed orders
Write the R code for both scenarios.
all_customers <- left_join(customers, orders, by = "customer_id")
active_customers <- semi_join(customers, orders, by = "customer_id")
Display the result
print(all_customers)
## # A tibble: 6 × 6
## customer_id name city order_id product amount
## <dbl> <chr> <chr> <dbl> <chr> <dbl>
## 1 1 Alice New York 101 Laptop 1200
## 2 2 Bob Los Angeles 102 Phone 800
## 3 2 Bob Los Angeles 104 Desktop 1500
## 4 3 Charlie Chicago 103 Tablet 300
## 5 4 David Houston NA <NA> NA
## 6 5 Eve Phoenix NA <NA> NA
print(active_customers)
## # A tibble: 3 × 3
## customer_id name city
## <dbl> <chr> <chr>
## 1 1 Alice New York
## 2 2 Bob Los Angeles
## 3 3 Charlie Chicago
Challenge Question (3 points) Create a summary that shows each
customer’s name, city, total number of orders, and total amount spent.
Include all customers, even those without orders. Hint: You’ll need to
use a combination of joins and group_by/summarize operations.
customer_orders <- left_join(customers, orders, by = "customer_id")
customer_summary <- customer_orders %>%
group_by(customer_id, name, city) %>%
summarize(
total_orders = n_distinct(order_id, na.rm = TRUE),
total_amount_spent = sum(amount, na.rm = TRUE)
)
## `summarise()` has grouped output by 'customer_id', 'name'. You can override
## using the `.groups` argument.
print(customer_summary)
## # A tibble: 5 × 5
## # Groups: customer_id, name [5]
## customer_id name city total_orders total_amount_spent
## <dbl> <chr> <chr> <int> <dbl>
## 1 1 Alice New York 1 1200
## 2 2 Bob Los Angeles 2 2300
## 3 3 Charlie Chicago 1 300
## 4 4 David Houston 0 0
## 5 5 Eve Phoenix 0 0