R Markdown

Inner Join (3 points) Perform an inner join between the customers and orders datasets.

q1 <- inner_join(customers, orders)
## Joining with `by = join_by(customer_id)`

How many rows are in the result?

The resulting data has 4 rows

Why are some customers or orders not included in the result?

Because the customers and orders that are not included did not have a match in the other table.

Display the result

head(q1)
## # A tibble: 4 × 6
##   customer_id name    city        order_id product amount
##         <dbl> <chr>   <chr>          <dbl> <chr>    <dbl>
## 1           1 Alice   New York         101 Laptop    1200
## 2           2 Bob     Los Angeles      102 Phone      800
## 3           2 Bob     Los Angeles      104 Desktop   1500
## 4           3 Charlie Chicago          103 Tablet     300

Left Join (3 points) Perform a left join with customers as the left table and orders as the right table.

q2 <- left_join(customers, orders)
## Joining with `by = join_by(customer_id)`

How many rows are in the result?

The resulting data has 6 rows

Explain why this number differs from the inner join result.

All rows from the customers table are included, even if they dont have an order.

Display the result

head(q2)
## # A tibble: 6 × 6
##   customer_id name    city        order_id product amount
##         <dbl> <chr>   <chr>          <dbl> <chr>    <dbl>
## 1           1 Alice   New York         101 Laptop    1200
## 2           2 Bob     Los Angeles      102 Phone      800
## 3           2 Bob     Los Angeles      104 Desktop   1500
## 4           3 Charlie Chicago          103 Tablet     300
## 5           4 David   Houston           NA <NA>        NA
## 6           5 Eve     Phoenix           NA <NA>        NA

Right Join (3 points) Perform a right join with customers as the left table and orders as the right table.

q3 <- right_join(customers,orders)
## Joining with `by = join_by(customer_id)`

How many rows are in the result?

The resulting data set has 6 rows

Which customer_ids in the result have NULL for customer name and city? Explain why.

Customer id 6 and 7 have no result for name and city with their orders because there is no matching customer id in that data table

Display the result

head(q3)
## # A tibble: 6 × 6
##   customer_id name    city        order_id product amount
##         <dbl> <chr>   <chr>          <dbl> <chr>    <dbl>
## 1           1 Alice   New York         101 Laptop    1200
## 2           2 Bob     Los Angeles      102 Phone      800
## 3           2 Bob     Los Angeles      104 Desktop   1500
## 4           3 Charlie Chicago          103 Tablet     300
## 5           6 <NA>    <NA>             105 Camera     600
## 6           7 <NA>    <NA>             106 Printer    150

Full Join (3 points) Perform a full join between customers and orders.

q4 <- full_join(customers, orders)
## Joining with `by = join_by(customer_id)`

How many rows are in the result?

The resulting data has 8 rows

Identify any rows where there’s information from only one table. Explain these results.

Rows 5-8 are all missing information in some areas. All rows from both tables are included, with NA where there’s no match. Meaning customers 4 and 5 haven’t placed an order, and customers 6 and 7 don’t have an id in the data set.

Display the result

head(q4)
## # A tibble: 6 × 6
##   customer_id name    city        order_id product amount
##         <dbl> <chr>   <chr>          <dbl> <chr>    <dbl>
## 1           1 Alice   New York         101 Laptop    1200
## 2           2 Bob     Los Angeles      102 Phone      800
## 3           2 Bob     Los Angeles      104 Desktop   1500
## 4           3 Charlie Chicago          103 Tablet     300
## 5           4 David   Houston           NA <NA>        NA
## 6           5 Eve     Phoenix           NA <NA>        NA

Semi Join (3 points) Perform a semi join with customers as the left table and orders as the right table.

q5 <- semi_join(customers, orders)
## Joining with `by = join_by(customer_id)`

How many rows are in the result?

The resulting data has 3 rows

How does this result differ from the inner join result?

Only customers with order information are returned, and order data isn’t included.

Display the result

head(q5)
## # A tibble: 3 × 3
##   customer_id name    city       
##         <dbl> <chr>   <chr>      
## 1           1 Alice   New York   
## 2           2 Bob     Los Angeles
## 3           3 Charlie Chicago

Anti Join (3 points) Perform an anti join with customers as the left table and orders as the right table.

q6 <- anti_join(customers, orders)
## Joining with `by = join_by(customer_id)`

Which customers are in the result?

Customers 4 and 5

Explain what this result tells you about these customers.

This result tells us that they have not made an order yet because only customers without order information are returned

Display the result

head(q6)
## # A tibble: 2 × 3
##   customer_id name  city   
##         <dbl> <chr> <chr>  
## 1           4 David Houston
## 2           5 Eve   Phoenix

Practical Application (4 points) Imagine you’re analyzing customer behavior.

Which join would you use to find all customers, including those who haven’t placed any orders? Why?

A left join because it keeps every row from the first (left) table. If a customer hasn’t placed an order, the order-related column will simply show N/A. That is why its the best option when you want a complete database, regardless of their activity.

Which join would you use to find only the customers who have placed orders? Why?

An inner join because it only returns rows where there is a match in both data tables. If a customer id exists in the customer table, but not the orders, that customer will be excused fro, the results.

Write the R code for both scenarios.

q7 <- left_join(customers, orders)
## Joining with `by = join_by(customer_id)`
q77 <- inner_join(customers, orders)
## Joining with `by = join_by(customer_id)`

Display the result

head(q7)
## # A tibble: 6 × 6
##   customer_id name    city        order_id product amount
##         <dbl> <chr>   <chr>          <dbl> <chr>    <dbl>
## 1           1 Alice   New York         101 Laptop    1200
## 2           2 Bob     Los Angeles      102 Phone      800
## 3           2 Bob     Los Angeles      104 Desktop   1500
## 4           3 Charlie Chicago          103 Tablet     300
## 5           4 David   Houston           NA <NA>        NA
## 6           5 Eve     Phoenix           NA <NA>        NA
head(q77)
## # A tibble: 4 × 6
##   customer_id name    city        order_id product amount
##         <dbl> <chr>   <chr>          <dbl> <chr>    <dbl>
## 1           1 Alice   New York         101 Laptop    1200
## 2           2 Bob     Los Angeles      102 Phone      800
## 3           2 Bob     Los Angeles      104 Desktop   1500
## 4           3 Charlie Chicago          103 Tablet     300

Bonus Question

customer_summary <- customers %>%
  left_join(orders, by = "customer_id") %>%
  group_by(customer_id, name, city) %>%
  summarize(
    total_orders = sum(!is.na(order_id)),   # count only real orders
    total_spent  = sum(amount, na.rm = TRUE),
    .groups = "drop"
  )
head(customer_summary)
## # A tibble: 5 × 5
##   customer_id name    city        total_orders total_spent
##         <dbl> <chr>   <chr>              <int>       <dbl>
## 1           1 Alice   New York               1        1200
## 2           2 Bob     Los Angeles            2        2300
## 3           3 Charlie Chicago                1         300
## 4           4 David   Houston                0           0
## 5           5 Eve     Phoenix                0           0