library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
# Dataset 1: Customers
customers <- tibble(
customer_id = c(1, 2, 3, 4, 5),
name = c("Alice", "Bob", "Charlie", "David", "Eve"),
city = c("New York", "Los Angeles", "Chicago", "Houston", "Phoenix")
)
# Dataset 2: Orders
orders <- tibble(
order_id = c(101, 102, 103, 104, 105, 106),
customer_id = c(1, 2, 3, 2, 6, 7),
product = c("Laptop", "Phone", "Tablet", "Desktop", "Camera", "Printer"),
amount = c(1200, 800, 300, 1500, 600, 150)
)
Inner Join (3 points) Perform an inner join between the customers and orders datasets.
q1 <- inner_join(customers, orders)
## Joining with `by = join_by(customer_id)`
rnrow(q1)
rowsthere are 4 rows b) Why are some customers or orders not
included in the result?
Inner join returns where there is a match on both tables. The customer
data has only 3 customers in the order table where the customer has 2
orders.
head(q1)
## # A tibble: 4 × 6
## customer_id name city order_id product amount
## <dbl> <chr> <chr> <dbl> <chr> <dbl>
## 1 1 Alice New York 101 Laptop 1200
## 2 2 Bob Los Angeles 102 Phone 800
## 3 2 Bob Los Angeles 104 Desktop 1500
## 4 3 Charlie Chicago 103 Tablet 300
Left Join (3 points) Perform a left join with customers as the left table and orders as the right table.
q2 <- left_join(customers,orders)
## Joining with `by = join_by(customer_id)`
How many rows are in the result? there are rnrow(q2)
rows there are 5 rows Explain why this number differs from the
inner join result. A left join returns all rows from the left
table (customers) along with matched rows from the right table (orders),
filling in for any missing order data. This results in 5 rows for the
left join, while the inner join includes only those customers with
corresponding orders, leading to fewer rows.
Display the result
head(q2)
## # A tibble: 6 × 6
## customer_id name city order_id product amount
## <dbl> <chr> <chr> <dbl> <chr> <dbl>
## 1 1 Alice New York 101 Laptop 1200
## 2 2 Bob Los Angeles 102 Phone 800
## 3 2 Bob Los Angeles 104 Desktop 1500
## 4 3 Charlie Chicago 103 Tablet 300
## 5 4 David Houston NA <NA> NA
## 6 5 Eve Phoenix NA <NA> NA
Right Join (3 points) Perform a right join with customers as the left table and orders as the right table.
q3 <- right_join(customers,orders)
## Joining with `by = join_by(customer_id)`
How many rows are in the result? there are `rnrow(q3) rows there are 6 rows Which customer_ids in the result have NULL for customer name and city? Explain why Customer_ids 5 &6 have N/A values for customer name and city because during the joins the the table only includes rows where there is a match between both tables and if there is no corresponding order and city then it will not appear
Display the result head(q3)
Full Join (3 points) Perform a full join between customers and orders. How many rows are in the result?
q4 <- full_join(customers,orders)
## Joining with `by = join_by(customer_id)`
Identify any rows where there’s information from only one table. there are `rnrow(q4) rows there are 8 rows Explain these results When I did the full join it combines all rows from both tables filling in the NA values where there are no matches in either table. Therefore I got every row from each customer and order data set. Display the result head(q4)
Semi Join (3 points) Perform a semi join with customers as the left table and orders as the right table.
q5 <- semi_join(customers, orders)
## Joining with `by = join_by(customer_id)`
How many rows are in the result? there are `rnrow(q5) rows there are 3 rows How does this result differ from the inner join result? The semi join differs from the inner join because it only retains rows from the left table that have matches in the right table, which filters out any unmatched records. Display the result head(q5)
Anti Join (3 points) Perform an anti join with customers as the left table and orders as the right table.
q6 <- anti_join(customers,orders)
## Joining with `by = join_by(customer_id)`
Which customers are in the result? Only David and Eve are in the results Explain what this result tells you about these customers. This indicates that these two customers do not have any corresponding entries in the order table Display the result head(q6) Practical Application (4 points) Imagine you’re analyzing customer behavior. Which join would you use to find all customers, including those who haven’t placed any orders? Why? I would use the left join because it would allow me to retrieve all the customers even the customers that did not place an order because the left join returns all the rows from the left table including the matching ones from the right table. Write the R code for both scenarios.
q7 <- left_join(customers, orders, by ="customer_id")
Which join would you use to find only the customers who have placed orders?Why? Inner join because I want to retrieve only those customers who have placed orders. Because the inner join returns only the rows that have matching values in both tables. Therefore it will return only those customers who have corresponding entries in the orders table.
q7.5 <- inner_join(customers,orders, by ="customer_id")
Display the result head(q7) head(q7.5)
Challenge Question (3 points) Create a summary that shows each customer’s name, city, total number of orders, and total amount spent. Include all customers, even those without orders. Hint: You’ll need to use a combination of joins and group_by/summarize operations. Submission
```
Note that the echo = FALSE
parameter was added to the
code chunk to prevent printing of the R code that generated the
plot.