R Markdown

This is an R Markdown document. Markdown is a simple formatting syntax for authoring HTML, PDF, and MS Word documents. For more details on using R Markdown see http://rmarkdown.rstudio.com.

When you click the Knit button a document will be generated that includes both content as well as the output of any embedded R code chunks within the document. You can embed an R code chunk like this:

summary(cars)
##      speed           dist       
##  Min.   : 4.0   Min.   :  2.00  
##  1st Qu.:12.0   1st Qu.: 26.00  
##  Median :15.0   Median : 36.00  
##  Mean   :15.4   Mean   : 42.98  
##  3rd Qu.:19.0   3rd Qu.: 56.00  
##  Max.   :25.0   Max.   :120.00

Including Plots

# Dataset 1: Customers
customers <- tibble(
  customer_id = c(1, 2, 3, 4, 5),
  name = c("Alice", "Bob", "Charlie", "David", "Eve"),
  city = c("New York", "Los Angeles", "Chicago", "Houston", "Phoenix")
)

# Dataset 2: Orders
orders <- tibble(
  order_id = c(101, 102, 103, 104, 105, 106),
  customer_id = c(1, 2, 3, 2, 6, 7),
  product = c("Laptop", "Phone", "Tablet", "Desktop", "Camera", "Printer"),
  amount = c(1200, 800, 300, 1500, 600, 150)
)

Inner Join (3 points) Perform an inner join between the customers and orders datasets.

Q1 <- inner_join(customers, orders, by = ‘customer_id’) # How many rows are in the result? nrow(Q1) # Why are some customers or orders not included in the result? # inner join returns all rows from both tables where there is a match # therefore the rows not returned did not have a match in the # other table # Display the result head(Q1)

# Left Join (3 points) Perform a left join with customers as the left table and orders as the right table.

``` r
Q2 <- left_join(customers,orders)
## Joining with `by = join_by(customer_id)`
# How many rows are in the result?
nrow(Q2)
## [1] 6
# Explain why this number differs from the inner join result.
# Left join joins all customers even if they don't have matching orders
# Display the result
head(Q2)
## # A tibble: 6 × 6
##   customer_id name    city        order_id product amount
##         <dbl> <chr>   <chr>          <dbl> <chr>    <dbl>
## 1           1 Alice   New York         101 Laptop    1200
## 2           2 Bob     Los Angeles      102 Phone      800
## 3           2 Bob     Los Angeles      104 Desktop   1500
## 4           3 Charlie Chicago          103 Tablet     300
## 5           4 David   Houston           NA <NA>        NA
## 6           5 Eve     Phoenix           NA <NA>        NA

Right Join (3 points) Perform a right join with customers as the left table and orders as the right table.

Q3 <- right_join(customers,orders)
## Joining with `by = join_by(customer_id)`
# How many rows are in the result?
nrow(Q3)
## [1] 6
# Which customer_ids in the result have NULL for customer name and city? Explain why.
# They join can't find a match due to customer ids not being in customers table. 
# Display the result
head(Q3)
## # A tibble: 6 × 6
##   customer_id name    city        order_id product amount
##         <dbl> <chr>   <chr>          <dbl> <chr>    <dbl>
## 1           1 Alice   New York         101 Laptop    1200
## 2           2 Bob     Los Angeles      102 Phone      800
## 3           2 Bob     Los Angeles      104 Desktop   1500
## 4           3 Charlie Chicago          103 Tablet     300
## 5           6 <NA>    <NA>             105 Camera     600
## 6           7 <NA>    <NA>             106 Printer    150

Full Join (3 points) Perform a full join between customers and orders.

Q4 <- full_join(customers,orders)
## Joining with `by = join_by(customer_id)`
# How many rows are in the result?
  nrow(Q4)
## [1] 8
# Identify any rows where there’s information from only one table. Explain these results.
  #Rows 5 & 6 (David, Eve) → these customers have no matching orders.
  # order_id, product, and amount are NA.
  # Rows 7 & 8 (customer_id 6 and 7) → these orders have no matching customer in the customers table.
  # customer_name and city are NA
  # Customers without orders appear because a full join keeps all rows from the left table (customers).  # Orders without customers appear because a full join also keeps all rows from the right table (orders).
# Display the result
head(Q4)
## # A tibble: 6 × 6
##   customer_id name    city        order_id product amount
##         <dbl> <chr>   <chr>          <dbl> <chr>    <dbl>
## 1           1 Alice   New York         101 Laptop    1200
## 2           2 Bob     Los Angeles      102 Phone      800
## 3           2 Bob     Los Angeles      104 Desktop   1500
## 4           3 Charlie Chicago          103 Tablet     300
## 5           4 David   Houston           NA <NA>        NA
## 6           5 Eve     Phoenix           NA <NA>        NA

Semi Join (3 points) Perform a semi join with customers as the left table and orders as the right table.

Q5<- semi_join(customers,orders)
## Joining with `by = join_by(customer_id)`
# How many rows are in the result?
nrow(Q5)
## [1] 3
# How does this result differ from the inner join result?
# A semi join returns rows from customers only if they have matching customer_ids in orders
# Display the result
head(Q5)
## # A tibble: 3 × 3
##   customer_id name    city       
##         <dbl> <chr>   <chr>      
## 1           1 Alice   New York   
## 2           2 Bob     Los Angeles
## 3           3 Charlie Chicago

Anti Join (3 points) Perform an anti join with customers as the left table and orders as the right table.

Q6<- anti_join(customers,orders)
## Joining with `by = join_by(customer_id)`
# Which customers are in the result?
# David and Eve
# Explain what this result tells you about these customers.
# These are customers who have not placed any orders — meaning their customer_id does not exist in the orders table.
# Display the result
head(Q6)
## # A tibble: 2 × 3
##   customer_id name  city   
##         <dbl> <chr> <chr>  
## 1           4 David Houston
## 2           5 Eve   Phoenix

Practical Application (4 points) Imagine you’re analyzing customer behavior.

# A). Which join would you use to find all customers, including those who haven’t placed any orders? Why?
# A left join keeps all rows from the customers table, even if there’s no matching customer_id in orders

# B). Which join would you use to find only the customers who have placed orders? Why?
# A semi join returns only customers that have at least one matching customer_id in orders, without including columns from the orders table.
# C. Write the R code for both scenarios.
# # All customers (including those with no orders)
# all_customers <- left_join(customers, orders, by = "customer_id")
  
# # Only customers who placed orders
# customers_with_orders <- semi_join(customers, orders, by = "customer_id")

Challenge Question (3 points) Create a summary that shows each customer’s name, city, total number of orders, and total amount spent. Include all customers, even those without orders. Hint: You’ll need to use a combination of joins and group_by/summarize operations.

# Summary of orders per customer
Q8 <- left_join(customers, orders, by = "customer_id") %>%
  group_by(customer_id, name, city) %>%
  summarize(
    total_orders = n_distinct(order_id, na.rm = TRUE),
    total_spent = sum(amount, na.rm = TRUE)
  )
## `summarise()` has grouped output by 'customer_id', 'name'. You can override
## using the `.groups` argument.
# Display the result
head(Q8)
## # A tibble: 5 × 5
## # Groups:   customer_id, name [5]
##   customer_id name    city        total_orders total_spent
##         <dbl> <chr>   <chr>              <int>       <dbl>
## 1           1 Alice   New York               1        1200
## 2           2 Bob     Los Angeles            2        2300
## 3           3 Charlie Chicago                1         300
## 4           4 David   Houston                0           0
## 5           5 Eve     Phoenix                0           0