Inner Join (3 points) Perform an inner join between the customers and orders datasets.

q1 <- inner_join(customers , orders)
## Joining with `by = join_by(customer_id)`

How many rows are in the result?

4 rows

Why are some customers or orders not included in the result?

Because the customers and orders that are not included did not have a match in the other table

Display the result

head(q1)
## # A tibble: 4 × 6
##   customer_id name    city        order_id product amount
##         <dbl> <chr>   <chr>          <dbl> <chr>    <dbl>
## 1           1 Alice   New York         101 Laptop    1200
## 2           2 Bob     Los Angeles      102 Phone      800
## 3           2 Bob     Los Angeles      104 Desktop   1500
## 4           3 Charlie Chicago          103 Tablet     300

Left Join (3 points) Perform a left join with customers as the left table and orders as the right table.

q2 <- left_join(customers, orders, by = "customer_id")

How many rows are in the result?

nrow(q2) # Result: 6
## [1] 6

Explain why this number differs from the inner join result.

The Left Join results in more rows because it retains all customers from the left table—including those with no orders—while still listing multiple rows for customers who placed more than one order.

Display the result

head(q2)
## # A tibble: 6 × 6
##   customer_id name    city        order_id product amount
##         <dbl> <chr>   <chr>          <dbl> <chr>    <dbl>
## 1           1 Alice   New York         101 Laptop    1200
## 2           2 Bob     Los Angeles      102 Phone      800
## 3           2 Bob     Los Angeles      104 Desktop   1500
## 4           3 Charlie Chicago          103 Tablet     300
## 5           4 David   Houston           NA <NA>        NA
## 6           5 Eve     Phoenix           NA <NA>        NA

Right Join (3 points) Perform a right join with customers as the left table and orders as the right table.

q3 <- right_join(customers, orders, by = "customer_id")

How many rows are in the result?

nrow(q3) # Result: 6
## [1] 6

Which customer_ids in the result have NULL for customer name and city? Explain why.

Customer IDs 6 and 7 have NULL values because they exist in the orders dataset but do not have a matching record in the customers dataset.

Display the result

head(q3)
## # A tibble: 6 × 6
##   customer_id name    city        order_id product amount
##         <dbl> <chr>   <chr>          <dbl> <chr>    <dbl>
## 1           1 Alice   New York         101 Laptop    1200
## 2           2 Bob     Los Angeles      102 Phone      800
## 3           2 Bob     Los Angeles      104 Desktop   1500
## 4           3 Charlie Chicago          103 Tablet     300
## 5           6 <NA>    <NA>             105 Camera     600
## 6           7 <NA>    <NA>             106 Printer    150

Full Join (3 points) Perform a full join between customers and orders.

q4 <- full_join(customers, orders, by = "customer_id")

How many rows are in the result?

nrow(q4) # Result: 8
## [1] 8

Identify any rows where there’s information from only one table. Explain these results.

Customers 4 and 5: Have names and cities but no order data (only in the customers table). Customer IDs 6 and 7: Have order data but no names or cities (only in the orders table). Explanation: A Full Join keeps all records from both tables, filling in NA whenever a match is missing on either side.

Display the result

head(q4)
## # A tibble: 6 × 6
##   customer_id name    city        order_id product amount
##         <dbl> <chr>   <chr>          <dbl> <chr>    <dbl>
## 1           1 Alice   New York         101 Laptop    1200
## 2           2 Bob     Los Angeles      102 Phone      800
## 3           2 Bob     Los Angeles      104 Desktop   1500
## 4           3 Charlie Chicago          103 Tablet     300
## 5           4 David   Houston           NA <NA>        NA
## 6           5 Eve     Phoenix           NA <NA>        NA

Semi Join (3 points) Perform a semi join with customers as the left table and orders as the right table.

q5 <- semi_join(customers, orders, by = "customer_id")

How many rows are in the result?

nrow(q5) # Result: 3 rows 
## [1] 3

How does this result differ from the inner join result?

The semi join only returns columns from the customers table and does not duplicate rows for multiple orders. While the inner join showed Bob twice (once for each order), the semi join only shows him once to confirm he is a customer who has placed at least one order.

Display the result

head(q5)
## # A tibble: 3 × 3
##   customer_id name    city       
##         <dbl> <chr>   <chr>      
## 1           1 Alice   New York   
## 2           2 Bob     Los Angeles
## 3           3 Charlie Chicago

Anti Join (3 points) Perform an anti join with customers as the left table and orders as the right table.

q6 <- anti_join(customers, orders, by = "customer_id")

Which customers are in the result?

David (ID 4) and Eve (ID 5).

Explain what this result tells you about these customers.

The anti join identifies records in the left table that have no match in the right table. This tells us that David and Eve are the only customers who have not placed any orders yet.

Display the result

head(q6)
## # A tibble: 2 × 3
##   customer_id name  city   
##         <dbl> <chr> <chr>  
## 1           4 David Houston
## 2           5 Eve   Phoenix

Practical Application (4 points) Imagine you’re analyzing customer behavior.

Which join would you use to find all customers, including those who haven’t placed any orders? Why?

I would use a Left Join. This join ensures that every record from the customers table is retained, while matching order data is added where it exists.

# Scenario 1: All customers regardless of orders
all_customers <- left_join(customers, orders, by = "customer_id")
all_customers
## # A tibble: 6 × 6
##   customer_id name    city        order_id product amount
##         <dbl> <chr>   <chr>          <dbl> <chr>    <dbl>
## 1           1 Alice   New York         101 Laptop    1200
## 2           2 Bob     Los Angeles      102 Phone      800
## 3           2 Bob     Los Angeles      104 Desktop   1500
## 4           3 Charlie Chicago          103 Tablet     300
## 5           4 David   Houston           NA <NA>        NA
## 6           5 Eve     Phoenix           NA <NA>        NA

Which join would you use to find only the customers who have placed orders? Why?

I would use an Inner Join (or a Semi Join). An Inner Join filters the dataset to only include rows where there is a match in both the customers and orders tables.

# Scenario 1: All customers regardless of orders
all_customers <- left_join(customers, orders, by = "customer_id")
all_customers
## # A tibble: 6 × 6
##   customer_id name    city        order_id product amount
##         <dbl> <chr>   <chr>          <dbl> <chr>    <dbl>
## 1           1 Alice   New York         101 Laptop    1200
## 2           2 Bob     Los Angeles      102 Phone      800
## 3           2 Bob     Los Angeles      104 Desktop   1500
## 4           3 Charlie Chicago          103 Tablet     300
## 5           4 David   Houston           NA <NA>        NA
## 6           5 Eve     Phoenix           NA <NA>        NA

Display the result

Shown in R code above

Challenge Question (3 points) Create a summary that shows each customer’s name, city, total number of orders, and total amount spent. Include all customers, even those without orders. Hint: You’ll need to use a combination of joins and group_by/summarize operations.

challenge_summary <- customers %>%
  left_join(orders, by = "customer_id") %>%
  group_by(customer_id, name, city) %>%
  summarize(
    total_orders = sum(!is.na(order_id)),
    total_spent = sum(amount, na.rm = TRUE)
  )
## `summarise()` has regrouped the output.
## ℹ Summaries were computed grouped by customer_id, name, and city.
## ℹ Output is grouped by customer_id and name.
## ℹ Use `summarise(.groups = "drop_last")` to silence this message.
## ℹ Use `summarise(.by = c(customer_id, name, city))` for per-operation grouping
##   (`?dplyr::dplyr_by`) instead.
challenge_summary
## # A tibble: 5 × 5
## # Groups:   customer_id, name [5]
##   customer_id name    city        total_orders total_spent
##         <dbl> <chr>   <chr>              <int>       <dbl>
## 1           1 Alice   New York               1        1200
## 2           2 Bob     Los Angeles            2        2300
## 3           3 Charlie Chicago                1         300
## 4           4 David   Houston                0           0
## 5           5 Eve     Phoenix                0           0

R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:

summary(cars)
##      speed           dist       
##  Min.   : 4.0   Min.   :  2.00  
##  1st Qu.:12.0   1st Qu.: 26.00  
##  Median :15.0   Median : 36.00  
##  Mean   :15.4   Mean   : 42.98  
##  3rd Qu.:19.0   3rd Qu.: 56.00  
##  Max.   :25.0   Max.   :120.00

Including Plots

You can also embed plots, for example:

Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.