library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
# Dataset 1: Customers
customers <- tibble(
customer_id = c(1, 2, 3, 4, 5),
name = c("Alice", "Bob", "Charlie", "David", "Eve"),
city = c("New York", "Los Angeles", "Chicago", "Houston", "Phoenix")
)
# Dataset 2: Orders
orders <- tibble(
order_id = c(101, 102, 103, 104, 105, 106),
customer_id = c(1, 2, 3, 2, 6, 7),
product = c("Laptop", "Phone", "Tablet", "Desktop", "Camera", "Printer"),
amount = c(1200, 800, 300, 1500, 600, 150)
)
Q1 <- inner_join(customers , orders)
## Joining with `by = join_by(customer_id)`
- Inner Join (3 points) Perform an inner join between the customers
and orders datasets.
- How many rows are in the result? The data has 4
rows
- Why are some customers or orders not included in the
result? The customer data only has three customers in the order
table where one customer has two orders.
- Display the result
head(Q1)
## # A tibble: 4 × 6
## customer_id name city order_id product amount
## <dbl> <chr> <chr> <dbl> <chr> <dbl>
## 1 1 Alice New York 101 Laptop 1200
## 2 2 Bob Los Angeles 102 Phone 800
## 3 2 Bob Los Angeles 104 Desktop 1500
## 4 3 Charlie Chicago 103 Tablet 300
Q2 <- left_join(customers, orders)
## Joining with `by = join_by(customer_id)`
- Left Join (3 points) Perform a left join with customers as the left
table and orders as the right table.
- How many rows are in the result? The data has 6
rows
- Explain why this number differs from the inner join
result. The left join contains all customers, including those
without order information.
- Display the result
head(Q2)
## # A tibble: 6 × 6
## customer_id name city order_id product amount
## <dbl> <chr> <chr> <dbl> <chr> <dbl>
## 1 1 Alice New York 101 Laptop 1200
## 2 2 Bob Los Angeles 102 Phone 800
## 3 2 Bob Los Angeles 104 Desktop 1500
## 4 3 Charlie Chicago 103 Tablet 300
## 5 4 David Houston NA <NA> NA
## 6 5 Eve Phoenix NA <NA> NA
Q3 <- right_join(customers, orders)
## Joining with `by = join_by(customer_id)`
- Right Join (3 points) Perform a right join with customers as the
left table and orders as the right table.
- How many rows are in the result? The data has 6
rows
- Which customer_ids in the result have NULL for customer name
and city? Explain why. The customers IDs for 6 and 7have values
NULL for customer name and city because they do not have matching data
in the customers table.
- Display the result
head(Q3)
## # A tibble: 6 × 6
## customer_id name city order_id product amount
## <dbl> <chr> <chr> <dbl> <chr> <dbl>
## 1 1 Alice New York 101 Laptop 1200
## 2 2 Bob Los Angeles 102 Phone 800
## 3 2 Bob Los Angeles 104 Desktop 1500
## 4 3 Charlie Chicago 103 Tablet 300
## 5 6 <NA> <NA> 105 Camera 600
## 6 7 <NA> <NA> 106 Printer 150
Q4 <- full_join(customers, orders)
## Joining with `by = join_by(customer_id)`
- Full Join (3 points) Perform a full join between customers and
orders.
- How many rows are in the result? The data has 4
rows
- Identify any rows where there’s information from only one
table. Explain these results. In the customers table, the
customers with IDs of 4 and 5 have information from only one table
(customers) because they have not placed any orders. The orders table
shows that customers IDs 6 and 7 have information from only one table
(orders) because they are not in the customers table.
- Display the result
head(Q4)
## # A tibble: 6 × 6
## customer_id name city order_id product amount
## <dbl> <chr> <chr> <dbl> <chr> <dbl>
## 1 1 Alice New York 101 Laptop 1200
## 2 2 Bob Los Angeles 102 Phone 800
## 3 2 Bob Los Angeles 104 Desktop 1500
## 4 3 Charlie Chicago 103 Tablet 300
## 5 4 David Houston NA <NA> NA
## 6 5 Eve Phoenix NA <NA> NA
Q5 <- semi_join(customers, orders)
## Joining with `by = join_by(customer_id)`
- Semi Join (3 points) Perform a semi join with customers as the left
table and orders as the right table.
- How many rows are in the result? The data has 3
rows
- How does this result differ from the inner join
result? A semi join will only return columns from the customers
table whereas the inner join returns columns from both customers and
orders.
- Display the result
head(Q5)
## # A tibble: 3 × 3
## customer_id name city
## <dbl> <chr> <chr>
## 1 1 Alice New York
## 2 2 Bob Los Angeles
## 3 3 Charlie Chicago
Q6 <- anti_join(customers, orders)
## Joining with `by = join_by(customer_id)`
- Anti Join (3 points) Perform an anti join with customers as the left
table and orders as the right table.
- Which customers are in the result? David and
Eve
- Explain what this result tells you about these
customers. This result shows that the customers are not active
or have not made purchases, but are part of the dataset.
- Display the result
head(Q6)
## # A tibble: 2 × 3
## customer_id name city
## <dbl> <chr> <chr>
## 1 4 David Houston
## 2 5 Eve Phoenix
- Practical Application (4 points) Imagine you’re analyzing customer
behavior.
- Which join would you use to find all customers, including
those who haven’t placed any orders? Why? I would use a full
join because full joins return all rows from both datasets, including
all of the customers even those who haven’t placed orders.
- Which join would you use to find only the customers who have
placed orders? Why? I would use a right join because that will
show me only customers who have placed orders. Right joins return only
the rows where there are matching data in both sets.
- Write the R code for both scenarios.
Q7A <- full_join(customers, orders, by = "customer_id")
Q7B <- right_join(customers, orders, by = "customer_id")
- Display the result
head(Q7A)
## # A tibble: 6 × 6
## customer_id name city order_id product amount
## <dbl> <chr> <chr> <dbl> <chr> <dbl>
## 1 1 Alice New York 101 Laptop 1200
## 2 2 Bob Los Angeles 102 Phone 800
## 3 2 Bob Los Angeles 104 Desktop 1500
## 4 3 Charlie Chicago 103 Tablet 300
## 5 4 David Houston NA <NA> NA
## 6 5 Eve Phoenix NA <NA> NA
head(Q7B)
## # A tibble: 6 × 6
## customer_id name city order_id product amount
## <dbl> <chr> <chr> <dbl> <chr> <dbl>
## 1 1 Alice New York 101 Laptop 1200
## 2 2 Bob Los Angeles 102 Phone 800
## 3 2 Bob Los Angeles 104 Desktop 1500
## 4 3 Charlie Chicago 103 Tablet 300
## 5 6 <NA> <NA> 105 Camera 600
## 6 7 <NA> <NA> 106 Printer 150
- Challenge Question (3 points) Create a summary that shows
each customer’s name, city, total number of orders, and total amount
spent. Include all customers, even those without orders. Hint: You’ll
need to use a combination of joins and group_by/summarize
operations.
EC <- full_join(customers, orders, by = "customer_id") %>%
group_by(customer_id, name, city) %>%
summarize(
total_orders = sum(!is.na(order_id)),
total_amount_spent = sum(amount, na.rm = TRUE)
) %>%
arrange(name)
## `summarise()` has grouped output by 'customer_id', 'name'. You can override
## using the `.groups` argument.