q1 <- inner_join(customers , orders)
## Joining with `by = join_by(customer_id)`
4 rows
Because the customers and orders that are not included did not have a match in the other table
head(q1)
## # A tibble: 4 × 6
## customer_id name city order_id product amount
## <dbl> <chr> <chr> <dbl> <chr> <dbl>
## 1 1 Alice New York 101 Laptop 1200
## 2 2 Bob Los Angeles 102 Phone 800
## 3 2 Bob Los Angeles 104 Desktop 1500
## 4 3 Charlie Chicago 103 Tablet 300
q2 <- left_join(customers, orders, by = "customer_id")
nrow(q2) # Result: 6
## [1] 6
The Left Join results in more rows because it retains all customers from the left table—including those with no orders—while still listing multiple rows for customers who placed more than one order.
head(q2)
## # A tibble: 6 × 6
## customer_id name city order_id product amount
## <dbl> <chr> <chr> <dbl> <chr> <dbl>
## 1 1 Alice New York 101 Laptop 1200
## 2 2 Bob Los Angeles 102 Phone 800
## 3 2 Bob Los Angeles 104 Desktop 1500
## 4 3 Charlie Chicago 103 Tablet 300
## 5 4 David Houston NA <NA> NA
## 6 5 Eve Phoenix NA <NA> NA
q3 <- right_join(customers, orders, by = "customer_id")
nrow(q3) # Result: 6
## [1] 6
Customer IDs 6 and 7 have NULL values because they exist in the orders dataset but do not have a matching record in the customers dataset.
head(q3)
## # A tibble: 6 × 6
## customer_id name city order_id product amount
## <dbl> <chr> <chr> <dbl> <chr> <dbl>
## 1 1 Alice New York 101 Laptop 1200
## 2 2 Bob Los Angeles 102 Phone 800
## 3 2 Bob Los Angeles 104 Desktop 1500
## 4 3 Charlie Chicago 103 Tablet 300
## 5 6 <NA> <NA> 105 Camera 600
## 6 7 <NA> <NA> 106 Printer 150
q4 <- full_join(customers, orders, by = "customer_id")
nrow(q4) # Result: 8
## [1] 8
Customers 4 and 5: Have names and cities but no order data (only in the customers table). Customer IDs 6 and 7: Have order data but no names or cities (only in the orders table). Explanation: A Full Join keeps all records from both tables, filling in NA whenever a match is missing on either side.
head(q4)
## # A tibble: 6 × 6
## customer_id name city order_id product amount
## <dbl> <chr> <chr> <dbl> <chr> <dbl>
## 1 1 Alice New York 101 Laptop 1200
## 2 2 Bob Los Angeles 102 Phone 800
## 3 2 Bob Los Angeles 104 Desktop 1500
## 4 3 Charlie Chicago 103 Tablet 300
## 5 4 David Houston NA <NA> NA
## 6 5 Eve Phoenix NA <NA> NA
q5 <- semi_join(customers, orders, by = "customer_id")
nrow(q5) # Result: 3 rows
## [1] 3
The semi join only returns columns from the customers table and does not duplicate rows for multiple orders. While the inner join showed Bob twice (once for each order), the semi join only shows him once to confirm he is a customer who has placed at least one order.
head(q5)
## # A tibble: 3 × 3
## customer_id name city
## <dbl> <chr> <chr>
## 1 1 Alice New York
## 2 2 Bob Los Angeles
## 3 3 Charlie Chicago
q6 <- anti_join(customers, orders, by = "customer_id")
David (ID 4) and Eve (ID 5).
The anti join identifies records in the left table that have no match in the right table. This tells us that David and Eve are the only customers who have not placed any orders yet.
head(q6)
## # A tibble: 2 × 3
## customer_id name city
## <dbl> <chr> <chr>
## 1 4 David Houston
## 2 5 Eve Phoenix
I would use a Left Join. This join ensures that every record from the customers table is retained, while matching order data is added where it exists.
# Scenario 1: All customers regardless of orders
all_customers <- left_join(customers, orders, by = "customer_id")
all_customers
## # A tibble: 6 × 6
## customer_id name city order_id product amount
## <dbl> <chr> <chr> <dbl> <chr> <dbl>
## 1 1 Alice New York 101 Laptop 1200
## 2 2 Bob Los Angeles 102 Phone 800
## 3 2 Bob Los Angeles 104 Desktop 1500
## 4 3 Charlie Chicago 103 Tablet 300
## 5 4 David Houston NA <NA> NA
## 6 5 Eve Phoenix NA <NA> NA
I would use an Inner Join (or a Semi Join). An Inner Join filters the dataset to only include rows where there is a match in both the customers and orders tables.
# Scenario 1: All customers regardless of orders
all_customers <- left_join(customers, orders, by = "customer_id")
all_customers
## # A tibble: 6 × 6
## customer_id name city order_id product amount
## <dbl> <chr> <chr> <dbl> <chr> <dbl>
## 1 1 Alice New York 101 Laptop 1200
## 2 2 Bob Los Angeles 102 Phone 800
## 3 2 Bob Los Angeles 104 Desktop 1500
## 4 3 Charlie Chicago 103 Tablet 300
## 5 4 David Houston NA <NA> NA
## 6 5 Eve Phoenix NA <NA> NA
Shown in R code above
challenge_summary <- customers %>%
left_join(orders, by = "customer_id") %>%
group_by(customer_id, name, city) %>%
summarize(
total_orders = sum(!is.na(order_id)),
total_spent = sum(amount, na.rm = TRUE)
)
## `summarise()` has regrouped the output.
## ℹ Summaries were computed grouped by customer_id, name, and city.
## ℹ Output is grouped by customer_id and name.
## ℹ Use `summarise(.groups = "drop_last")` to silence this message.
## ℹ Use `summarise(.by = c(customer_id, name, city))` for per-operation grouping
## (`?dplyr::dplyr_by`) instead.
challenge_summary
## # A tibble: 5 × 5
## # Groups: customer_id, name [5]
## customer_id name city total_orders total_spent
## <dbl> <chr> <chr> <int> <dbl>
## 1 1 Alice New York 1 1200
## 2 2 Bob Los Angeles 2 2300
## 3 3 Charlie Chicago 1 300
## 4 4 David Houston 0 0
## 5 5 Eve Phoenix 0 0
This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.
When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:
summary(cars)
## speed dist
## Min. : 4.0 Min. : 2.00
## 1st Qu.:12.0 1st Qu.: 26.00
## Median :15.0 Median : 36.00
## Mean :15.4 Mean : 42.98
## 3rd Qu.:19.0 3rd Qu.: 56.00
## Max. :25.0 Max. :120.00
You can also embed plots, for example:
Note that the echo = FALSE parameter was added to the
code chunk to prevent printing of the R code that generated the
plot.