Question 1

Inner join between customers and orders datasets.
Q1 <- inner_join(
  customers, 
  orders, 
  by = "customer_id"
)
Display result of inner join.
print(Q1)
## # A tibble: 4 × 6
##   customer_id name    city        order_id product amount
##         <dbl> <chr>   <chr>          <dbl> <chr>    <dbl>
## 1           1 Alice   New York         101 Laptop    1200
## 2           2 Bob     Los Angeles      102 Phone      800
## 3           2 Bob     Los Angeles      104 Desktop   1500
## 4           3 Charlie Chicago          103 Tablet     300
Number of rows.
print(
  paste(
    nrow(Q1), 
    "rows in the result"
  )
)
## [1] "4 rows in the result"
Why are some customers or orders not included in the result?

Inner join returns all rows from both the customers and orders tables where there is a match. Only customers with matching customer_id in both the customers table and the orders table are included. This means there are some customers without order information or some orders without customer information that are excluded.

Question 2

Left join with customers as left table and orders as right table.
Q2 <- left_join(
  customers, 
  orders, 
  by = "customer_id"
)
Display result of left join.
print(Q2)
## # A tibble: 6 × 6
##   customer_id name    city        order_id product amount
##         <dbl> <chr>   <chr>          <dbl> <chr>    <dbl>
## 1           1 Alice   New York         101 Laptop    1200
## 2           2 Bob     Los Angeles      102 Phone      800
## 3           2 Bob     Los Angeles      104 Desktop   1500
## 4           3 Charlie Chicago          103 Tablet     300
## 5           4 David   Houston           NA <NA>        NA
## 6           5 Eve     Phoenix           NA <NA>        NA
Number of rows.
print(
  paste(
    nrow(Q2), 
    "rows in the result"
  )
)
## [1] "6 rows in the result"
Why does the number of rows differ from the inner join result?

Left join returns all rows from the customers table and matching rows from the orders table. All customers are included (even those without order information). Orders without customer information are still excluded.

Question 3

Right join with customers as left table and orders as right table.
Q3 <- right_join(
  customers, 
  orders, 
  by = "customer_id"
)
Display result of right join.
print(Q3)
## # A tibble: 6 × 6
##   customer_id name    city        order_id product amount
##         <dbl> <chr>   <chr>          <dbl> <chr>    <dbl>
## 1           1 Alice   New York         101 Laptop    1200
## 2           2 Bob     Los Angeles      102 Phone      800
## 3           2 Bob     Los Angeles      104 Desktop   1500
## 4           3 Charlie Chicago          103 Tablet     300
## 5           6 <NA>    <NA>             105 Camera     600
## 6           7 <NA>    <NA>             106 Printer    150
Number of rows.
print(
  paste(
    nrow(Q3), 
    "rows in the result"
  )
)
## [1] "6 rows in the result"
Customer_ids with NULL for customer name and city.
print(
  filter(
    Q3, 
    is.na(name), 
    is.na(city)
  )$customer_id
)
## [1] 6 7
Why do these customer_ids in the result have NULL for customer name and city?

Right join returns all rows from the orders table and matching rows from the customers table. All orders are included (even those without customer information). Customers without order information are still excluded. customer_ids with NULL for customer name and city are orders without customer information.

Question 4

Full join between customers and orders datasets.
Q4 <- full_join(
  customers, 
  orders, 
  by = "customer_id"
)
Display result of full join.
print(Q4)
## # A tibble: 8 × 6
##   customer_id name    city        order_id product amount
##         <dbl> <chr>   <chr>          <dbl> <chr>    <dbl>
## 1           1 Alice   New York         101 Laptop    1200
## 2           2 Bob     Los Angeles      102 Phone      800
## 3           2 Bob     Los Angeles      104 Desktop   1500
## 4           3 Charlie Chicago          103 Tablet     300
## 5           4 David   Houston           NA <NA>        NA
## 6           5 Eve     Phoenix           NA <NA>        NA
## 7           6 <NA>    <NA>             105 Camera     600
## 8           7 <NA>    <NA>             106 Printer    150
Number of rows.
print(
  paste(
    nrow(Q4), 
    "rows in the result"
  )
)
## [1] "8 rows in the result"
Rows with information from only one table.
print((
  mutate(
    Q4, 
    row_num = row_number()
  ) %>% 
  filter(
    (is.na(name) & is.na(city)) | 
    (is.na(product) & is.na(amount))
  )
)$row_num)
## [1] 5 6 7 8
Why do these rows in the result have information from only one table?

Full join returns all rows when there is a match in either the customers or orders table. All rows from both tables are included (with NA where there is no match). Rows with NA have results from only one table.

Question 5

Semi join with customers as left table and orders as right table.
Q5 <- semi_join(
  customers, 
  orders, 
  by = "customer_id"
)
Display result of semi join.
print(Q5)
## # A tibble: 3 × 3
##   customer_id name    city       
##         <dbl> <chr>   <chr>      
## 1           1 Alice   New York   
## 2           2 Bob     Los Angeles
## 3           3 Charlie Chicago
Number of rows.
print(
  paste(
    nrow(Q5), 
    "rows in the result"
  )
)
## [1] "3 rows in the result"
How does this result differ from the inner join result?

Semi join returns all rows from the customers table where there is a match in the orders table. Only customers with order information are returned. Order information is not included in the result.

Question 6

Anti join with customers as left table and orders as right table.
Q6 <- anti_join(
  customers, 
  orders, 
  by = "customer_id"
)
Display result of anti join.
print(Q6)
## # A tibble: 2 × 3
##   customer_id name  city   
##         <dbl> <chr> <chr>  
## 1           4 David Houston
## 2           5 Eve   Phoenix
Customers in the result.
print(Q6$name)
## [1] "David" "Eve"
What does the result say about these customers?

Anti join returns all rows from the customers table where there is no match in the orders table. Only customers without order information are returned. This means David and Eve have not placed any orders.

Question 7

Which join should be used to find all customers (including those who haven’t placed any orders)?

Left join. Left join returns all rows from the customers table and matching rows from the orders table. All customers are included (even those without order information).

Which join should be used to find only customers who have placed orders?

Right join. Right join returns all rows from the orders table and matching rows from the customers table. All orders are included (even those without customer information).

Left join with customers as left table and orders as right table.
Q7A <- left_join(
  customers, 
  orders, 
  by = "customer_id"
)
Display result of left join.
print(Q7A)
## # A tibble: 6 × 6
##   customer_id name    city        order_id product amount
##         <dbl> <chr>   <chr>          <dbl> <chr>    <dbl>
## 1           1 Alice   New York         101 Laptop    1200
## 2           2 Bob     Los Angeles      102 Phone      800
## 3           2 Bob     Los Angeles      104 Desktop   1500
## 4           3 Charlie Chicago          103 Tablet     300
## 5           4 David   Houston           NA <NA>        NA
## 6           5 Eve     Phoenix           NA <NA>        NA
Customer names.
print((
    group_by(
      Q7A,
      customer_id, 
      name
    ) %>% 
    summarise()
)$name)
## [1] "Alice"   "Bob"     "Charlie" "David"   "Eve"
Right join with customers as left table and orders as right table.
Q7B <- right_join(
  customers, 
  orders, 
  by = "customer_id"
)
Display result of right join.
print(Q7B)
## # A tibble: 6 × 6
##   customer_id name    city        order_id product amount
##         <dbl> <chr>   <chr>          <dbl> <chr>    <dbl>
## 1           1 Alice   New York         101 Laptop    1200
## 2           2 Bob     Los Angeles      102 Phone      800
## 3           2 Bob     Los Angeles      104 Desktop   1500
## 4           3 Charlie Chicago          103 Tablet     300
## 5           6 <NA>    <NA>             105 Camera     600
## 6           7 <NA>    <NA>             106 Printer    150
Customers that have placed orders.
print((
  filter(
    Q7B, 
    !is.na(name)
  ) %>%
  group_by(
    customer_id, 
    name
  ) %>% 
  summarise()
)$name)
## [1] "Alice"   "Bob"     "Charlie"

Question 8

Create summary that shows each customer’s name, city, total number of orders, and total amount spent. Include all customers (even those without orders).
Q8 <- 
  full_join(
    customers, 
    orders, 
    by = "customer_id"
  ) %>% 
  group_by(
    customer_id, 
    name, 
    city
  ) %>% 
  summarise(
    total_orders = sum(!is.na(order_id)), 
    total_spent = sum(amount)
  )
Display result of summary.
print(Q8)
## # A tibble: 7 × 5
## # Groups:   customer_id, name [7]
##   customer_id name    city        total_orders total_spent
##         <dbl> <chr>   <chr>              <int>       <dbl>
## 1           1 Alice   New York               1        1200
## 2           2 Bob     Los Angeles            2        2300
## 3           3 Charlie Chicago                1         300
## 4           4 David   Houston                0          NA
## 5           5 Eve     Phoenix                0          NA
## 6           6 <NA>    <NA>                   1         600
## 7           7 <NA>    <NA>                   1         150