q1 <- inner_join(customers , orders)
## Joining with `by = join_by(customer_id)`
nrow(q1)
## [1] 4
Because the customers and orders that are not included did not have a match in the other table
head(q1)
## # A tibble: 4 × 6
## customer_id name city order_id product amount
## <dbl> <chr> <chr> <dbl> <chr> <dbl>
## 1 1 Alice New York 101 Laptop 1200
## 2 2 Bob Los Angeles 102 Phone 800
## 3 2 Bob Los Angeles 104 Desktop 1500
## 4 3 Charlie Chicago 103 Tablet 300
q2 <- left_join(customers , orders)
## Joining with `by = join_by(customer_id)`
nrow(q2)
## [1] 6
The inner join function only displays data that has a match while the left join function displays all of the data from the left table, whether there is a match or not, and the matches from the right.
head(q2)
## # A tibble: 6 × 6
## customer_id name city order_id product amount
## <dbl> <chr> <chr> <dbl> <chr> <dbl>
## 1 1 Alice New York 101 Laptop 1200
## 2 2 Bob Los Angeles 102 Phone 800
## 3 2 Bob Los Angeles 104 Desktop 1500
## 4 3 Charlie Chicago 103 Tablet 300
## 5 4 David Houston NA <NA> NA
## 6 5 Eve Phoenix NA <NA> NA
q3 <- right_join(customers , orders)
## Joining with `by = join_by(customer_id)`
nrow(q3)
## [1] 6
Customer_ids 6 & 7 result NULL because they do not exist in the customers dataset, so there are no matching results for that data. Only order ids 1, 2, 3, 4, and 5 have matching customer ids that show up in the customers table, which is why those orders have names attached to them.
head(q3)
## # A tibble: 6 × 6
## customer_id name city order_id product amount
## <dbl> <chr> <chr> <dbl> <chr> <dbl>
## 1 1 Alice New York 101 Laptop 1200
## 2 2 Bob Los Angeles 102 Phone 800
## 3 2 Bob Los Angeles 104 Desktop 1500
## 4 3 Charlie Chicago 103 Tablet 300
## 5 6 <NA> <NA> 105 Camera 600
## 6 7 <NA> <NA> 106 Printer 150
q4 <- full_join(customers , orders)
## Joining with `by = join_by(customer_id)`
nrow(q4)
## [1] 8
Rows 5&6 show null data in columns order_id, product, and amount. This is because the customers in those rows do not have matching data from the orders table. Rows 7&8 show null data in columns name and city. This is because the customer ids in those rows do not have matching data from the customers table.
head(q4)
## # A tibble: 6 × 6
## customer_id name city order_id product amount
## <dbl> <chr> <chr> <dbl> <chr> <dbl>
## 1 1 Alice New York 101 Laptop 1200
## 2 2 Bob Los Angeles 102 Phone 800
## 3 2 Bob Los Angeles 104 Desktop 1500
## 4 3 Charlie Chicago 103 Tablet 300
## 5 4 David Houston NA <NA> NA
## 6 5 Eve Phoenix NA <NA> NA
q5 <- semi_join(customers , orders)
## Joining with `by = join_by(customer_id)`
nrow(q5)
## [1] 3
This differs from the inner join because a semi join displays only the customers that have matching orders and keeps each customer listed once. The inner join functions displays one row for every matching customer and order pair. This means that customers will appear multiple times if they have placed multiple orders.
head(q5)
## # A tibble: 3 × 3
## customer_id name city
## <dbl> <chr> <chr>
## 1 1 Alice New York
## 2 2 Bob Los Angeles
## 3 3 Charlie Chicago
q6 <- anti_join(customers , orders)
## Joining with `by = join_by(customer_id)`
select(q6, name)
## # A tibble: 2 × 1
## name
## <chr>
## 1 David
## 2 Eve
This result tells me that David and Eve do not have any matching data in the customers and orders tables.
head(q6)
## # A tibble: 2 × 3
## customer_id name city
## <dbl> <chr> <chr>
## 1 4 David Houston
## 2 5 Eve Phoenix
I would use a left join because this would display all data from the left table, which is customers. A left join will display data even if it doesn’t have a match in the right table.
I would use an inner join to find only the customers who have places orders. This is because the inner join will only show the customers who placed orders, not all orders like a right join.
q7 <- left_join(customers , orders)
## Joining with `by = join_by(customer_id)`
q7 <- inner_join(customers , orders)
## Joining with `by = join_by(customer_id)`
head(q7)
## # A tibble: 4 × 6
## customer_id name city order_id product amount
## <dbl> <chr> <chr> <dbl> <chr> <dbl>
## 1 1 Alice New York 101 Laptop 1200
## 2 2 Bob Los Angeles 102 Phone 800
## 3 2 Bob Los Angeles 104 Desktop 1500
## 4 3 Charlie Chicago 103 Tablet 300
bonus <- full_join(customers , orders) %>%
group_by(customer_id , name , city) %>%
summarise(total_orders = n_distinct(order_id),
total_amount_spent = sum(amount , na.rm = TRUE))
## Joining with `by = join_by(customer_id)`
## `summarise()` has regrouped the output.
head(bonus)
## # A tibble: 6 × 5
## # Groups: customer_id, name [6]
## customer_id name city total_orders total_amount_spent
## <dbl> <chr> <chr> <int> <dbl>
## 1 1 Alice New York 1 1200
## 2 2 Bob Los Angeles 2 2300
## 3 3 Charlie Chicago 1 300
## 4 4 David Houston 1 0
## 5 5 Eve Phoenix 1 0
## 6 6 <NA> <NA> 1 600