q1 <- inner_join(customers , orders)
## Joining with `by = join_by(customer_id)`
How many rows are in the result?
The data has 4 rows
Why are some customers or orders not included in the
result?
Inner join returns where there is a match on both tables. The customer
data only has 3 customers in the order table where one customer has 2
orders.
Display the result
head(q1)
## # A tibble: 4 × 6
## customer_id name city order_id product amount
## <dbl> <chr> <chr> <dbl> <chr> <dbl>
## 1 1 Alice New York 101 Laptop 1200
## 2 2 Bob Los Angeles 102 Phone 800
## 3 2 Bob Los Angeles 104 Desktop 1500
## 4 3 Charlie Chicago 103 Tablet 300
q2 <- left_join(customers, orders)
## Joining with `by = join_by(customer_id)`
How many rows are in the result?
The data has 6 rows.
Explain why this number differs from the inner join
result.
There are 6 rows because in a left join all of the items from the left
table(customers), are included even if they don’t match to what is in
the right table(orders). The items that don’t have a match have N/A.
However with inner join, it only returns items where there is a matching
record between both tables. Notably, even though there are only 5
customer_ids in the customers table, customer_id 2 placed two orders in
the orders table, so that is why they are 6 rows in the q2 table even
though there are only 5 customers in the customers table.
Display the result
print(q2)
## # A tibble: 6 × 6
## customer_id name city order_id product amount
## <dbl> <chr> <chr> <dbl> <chr> <dbl>
## 1 1 Alice New York 101 Laptop 1200
## 2 2 Bob Los Angeles 102 Phone 800
## 3 2 Bob Los Angeles 104 Desktop 1500
## 4 3 Charlie Chicago 103 Tablet 300
## 5 4 David Houston NA <NA> NA
## 6 5 Eve Phoenix NA <NA> NA
q3 <-right_join(customers, orders)
## Joining with `by = join_by(customer_id)`
How many rows are in the result?
The data has 6 rows.
Which customer_ids in the result have NULL for customer
name and city? Explain why.
Customer_ids 6 and 7 have NULL values for name as well as city because
there isn’t any corresponding/matching data for customer_ids 6 and
7(which were in the right table) within the customers table(left table).
Furthermore, in right joins, all the rows from the right table(orders)
are included, even if there isn’t a match found within the left
table(customers). When there isn’t a match for a row that is in the
right table, then NULL will be returned in the columns from the left
table.
Display the result
print(q3)
## # A tibble: 6 × 6
## customer_id name city order_id product amount
## <dbl> <chr> <chr> <dbl> <chr> <dbl>
## 1 1 Alice New York 101 Laptop 1200
## 2 2 Bob Los Angeles 102 Phone 800
## 3 2 Bob Los Angeles 104 Desktop 1500
## 4 3 Charlie Chicago 103 Tablet 300
## 5 6 <NA> <NA> 105 Camera 600
## 6 7 <NA> <NA> 106 Printer 150
q4 <-full_join(customers, orders)
## Joining with `by = join_by(customer_id)`
How many rows are in the result?
The data has 8 rows.
Identify any rows where there’s information from only one
table. Explain these results.
There are 4 rows where there is information from only one table. The
first set of rows that have information from only one table are
customer_id 6 and 7 because they are from the orders table, but there is
no match for those ids within the customers table. Additionally,
customer_id 4 and 5 have information from only one table because 4 and 5
are from the customers table, but have no matching data within the
orders table. Overall, for customer_id 4, 5, 6, and 7 you can tell that
they only have information from one table because they have NULL within
the columns that there is not a match for their specific
customer_id.
Display the result
print(q4)
## # A tibble: 8 × 6
## customer_id name city order_id product amount
## <dbl> <chr> <chr> <dbl> <chr> <dbl>
## 1 1 Alice New York 101 Laptop 1200
## 2 2 Bob Los Angeles 102 Phone 800
## 3 2 Bob Los Angeles 104 Desktop 1500
## 4 3 Charlie Chicago 103 Tablet 300
## 5 4 David Houston NA <NA> NA
## 6 5 Eve Phoenix NA <NA> NA
## 7 6 <NA> <NA> 105 Camera 600
## 8 7 <NA> <NA> 106 Printer 150
q5 <-semi_join(customers, orders)
## Joining with `by = join_by(customer_id)`
How many rows are in the result?
The data has 3 rows.
How does this result differ from the inner join result?
The result of this semi join differs from that of the inner join because
a semi join only returns the rows from the left table(customers) and
that have a matching record in the right table(orders) and it doesn’t
rerrun columns from the right table. However, and inner join returns
rows where there is a match in both tables and includes columns from
both the left and right tables. So in an inner join you would see both
customer and order information, but with the semi join it will only show
customers that have orders, but without any order-related
information.
Display the result
print(q5)
## # A tibble: 3 × 3
## customer_id name city
## <dbl> <chr> <chr>
## 1 1 Alice New York
## 2 2 Bob Los Angeles
## 3 3 Charlie Chicago
q6 <-anti_join(customers, orders)
## Joining with `by = join_by(customer_id)`
Which customers are in the result?
Customer_ids 4 and 5 are in the results.
Explain what this result tells you about these customers.
These results show that customer_ids 4 and 5 have not made any purchases
because they are in the customer table but are not listed within the
orders table.
Display the result
head(q6)
## # A tibble: 2 × 3
## customer_id name city
## <dbl> <chr> <chr>
## 1 4 David Houston
## 2 5 Eve Phoenix
Which join would you use to find all customers, including
those who haven’t placed any orders? Why?
You would utilize the full join function because the full join returns
all rows from both the customers and orders table when there is a match
in either the left or right table. This means that it would include all
customers even those who do not have orders placed within the orders
table.
Which join would you use to find only the customers who
have placed orders? Why?
To find only the customers who have placed orders you would use
right_join(customers, orders). This will display all rows from the right
table(orders) and find matching rows in the left table(customers). This
means that only customers who have placed orders will be displayed
within the results because the orders dataset is the right table. This
also means that if there are any orders that do not have corresponding
customers in the customers table, they will still appear within the
results of the right join.
Write the R code for both scenarios.
q7a <-full_join(customers, orders)
## Joining with `by = join_by(customer_id)`
q7b <-right_join(customers, orders)
## Joining with `by = join_by(customer_id)`
print(q7a)
## # A tibble: 8 × 6
## customer_id name city order_id product amount
## <dbl> <chr> <chr> <dbl> <chr> <dbl>
## 1 1 Alice New York 101 Laptop 1200
## 2 2 Bob Los Angeles 102 Phone 800
## 3 2 Bob Los Angeles 104 Desktop 1500
## 4 3 Charlie Chicago 103 Tablet 300
## 5 4 David Houston NA <NA> NA
## 6 5 Eve Phoenix NA <NA> NA
## 7 6 <NA> <NA> 105 Camera 600
## 8 7 <NA> <NA> 106 Printer 150
print(q7b)
## # A tibble: 6 × 6
## customer_id name city order_id product amount
## <dbl> <chr> <chr> <dbl> <chr> <dbl>
## 1 1 Alice New York 101 Laptop 1200
## 2 2 Bob Los Angeles 102 Phone 800
## 3 2 Bob Los Angeles 104 Desktop 1500
## 4 3 Charlie Chicago 103 Tablet 300
## 5 6 <NA> <NA> 105 Camera 600
## 6 7 <NA> <NA> 106 Printer 150
challenge_question <- customers %>%
full_join(orders, by = "customer_id") %>%
group_by(customer_id, name, city) %>%
summarize( total_orders = sum(!is.na(order_id)), total_amount_spent = sum(amount, na.rm = TRUE),
.groups = "drop"
)
print(challenge_question)
## # A tibble: 7 × 5
## customer_id name city total_orders total_amount_spent
## <dbl> <chr> <chr> <int> <dbl>
## 1 1 Alice New York 1 1200
## 2 2 Bob Los Angeles 2 2300
## 3 3 Charlie Chicago 1 300
## 4 4 David Houston 0 0
## 5 5 Eve Phoenix 0 0
## 6 6 <NA> <NA> 1 600
## 7 7 <NA> <NA> 1 150