1. Inner Join (3 points) Perform an inner join between the customers and orders datasets.

Q1 <-inner_join(customers, orders)
## Joining with `by = join_by(customer_id)`

a.How many rows are in the result?

## [1] 4

b. Why are some customers or orders not included in the result?.

Some customers or orders are not included in the inner join result because an inner join only returns rows with matching customer id in both the customers and orders data sets that does not include unmatched entries from either table.

c. Display the result

## # A tibble: 4 × 6
##   customer_id name    city        order_id product amount
##         <dbl> <chr>   <chr>          <dbl> <chr>    <dbl>
## 1           1 Alice   New York         101 Laptop    1200
## 2           2 Bob     Los Angeles      102 Phone      800
## 3           2 Bob     Los Angeles      104 Desktop   1500
## 4           3 Charlie Chicago          103 Tablet     300

2.Left Join (3 points) Perform a left join with customers as the left table and orders as the right table.

Q2 <-left_join(customers, orders)
## Joining with `by = join_by(customer_id)`

a. How many rows are in the result?

## [1] 6

b. Explain why this number differs from the inner join result

A left join includes all customers from the left table and the inner join only includes rows with matching customer id values in both tables. This shows a larger number of rows for the left join.

c. Display the result

## # A tibble: 6 × 6
##   customer_id name    city        order_id product amount
##         <dbl> <chr>   <chr>          <dbl> <chr>    <dbl>
## 1           1 Alice   New York         101 Laptop    1200
## 2           2 Bob     Los Angeles      102 Phone      800
## 3           2 Bob     Los Angeles      104 Desktop   1500
## 4           3 Charlie Chicago          103 Tablet     300
## 5           4 David   Houston           NA <NA>        NA
## 6           5 Eve     Phoenix           NA <NA>        NA

3. Right Join (3 points) Perform a right join with customers as the left table and orders as the right table.

Q3 <-right_join(customers, orders)
## Joining with `by = join_by(customer_id)`

a.How many rows are in the result?

## [1] 6

b. which customer_ids in the result have NULL for customer name and city? Explain why.

Customers 6 and 7 have Null results. In a right join any customer id in the orders table that does not have an entry in the customers table will have NULL values for name and city since the right join keeps all rows from the orders table even if theres not a match in the customers table.

c.Display the result

## # A tibble: 6 × 6
##   customer_id name    city        order_id product amount
##         <dbl> <chr>   <chr>          <dbl> <chr>    <dbl>
## 1           1 Alice   New York         101 Laptop    1200
## 2           2 Bob     Los Angeles      102 Phone      800
## 3           2 Bob     Los Angeles      104 Desktop   1500
## 4           3 Charlie Chicago          103 Tablet     300
## 5           6 <NA>    <NA>             105 Camera     600
## 6           7 <NA>    <NA>             106 Printer    150

4. Full Join (3 points) Perform a full join between customers and orders.

Q4 <-right_join(customers, orders)
## Joining with `by = join_by(customer_id)`

a. How many rows are in the result?

## [1] 6

b. Identify any rows where there’s information from only one table. Explain these results.

Row 1-4 contains data from only the customer table. In a Full join rows with information from only one table customer id exists in either the customers or orders table but not both so there will be NULL values for the missing data.

c. Display the result

## # A tibble: 6 × 6
##   customer_id name    city        order_id product amount
##         <dbl> <chr>   <chr>          <dbl> <chr>    <dbl>
## 1           1 Alice   New York         101 Laptop    1200
## 2           2 Bob     Los Angeles      102 Phone      800
## 3           2 Bob     Los Angeles      104 Desktop   1500
## 4           3 Charlie Chicago          103 Tablet     300
## 5           6 <NA>    <NA>             105 Camera     600
## 6           7 <NA>    <NA>             106 Printer    150

5. Semi Join (3 points) Perform a semi join with customers as the left table and orders as the right table.

Q5 <-semi_join(customers, orders)
## Joining with `by = join_by(customer_id)`

a. How many rows are in the result?

## [1] 3

b. How does this result differ from the inner join result?

It only has customer_Id, name, and city. Semi join returns only the rows from the customers table that have matching orders in the orders table but it does not include any columns from the orders table. Inner join returns all matching rows from both tables that include their columns.

c. Display the result

## # A tibble: 3 × 3
##   customer_id name    city       
##         <dbl> <chr>   <chr>      
## 1           1 Alice   New York   
## 2           2 Bob     Los Angeles
## 3           3 Charlie Chicago

6. Anti Join (3 points) Perform an anti join with customers as the left table and orders as the right table.**

Q6 <- anti_join(customers, orders)
## Joining with `by = join_by(customer_id)`

a. Which customers are in the result?

## # A tibble: 2 × 1
##   name 
##   <chr>
## 1 David
## 2 Eve

b.Explain what this result tells you about these customers.

David and Eve do not have order numbers that wont link to the product or price. Anti join returns the rows from the customers table that do not have any corresponding orders in the orders table. This shows these customers have never made a purchase.

c.Display the result

## # A tibble: 2 × 3
##   customer_id name  city   
##         <dbl> <chr> <chr>  
## 1           4 David Houston
## 2           5 Eve   Phoenix

7.Practical Application (4 points) Imagine you’re analyzing customer behavior

a. Which join would you use to find all customers, including those who haven’t placed any orders? Why?

I would use a full join to find all customers even customers who haven’t placed any orders since it includes every customer regardless of whether they have any corresponding order data.

b.. Which join would you use to find only the customers who have placed orders? Why?

I would use an inner join to find only the customers who have placed orders because it returns rows that have matching entries in both the customers and orders tables that shows only customers with orders that are included in the result.

c.Write the R code for both scenarios.

Q7A <- full_join(customers, orders, by = "customer_id")

Q7B <- inner_join(customers, orders, by = "customer_id")

d. Display the result

## # A tibble: 6 × 6
##   customer_id name    city        order_id product amount
##         <dbl> <chr>   <chr>          <dbl> <chr>    <dbl>
## 1           1 Alice   New York         101 Laptop    1200
## 2           2 Bob     Los Angeles      102 Phone      800
## 3           2 Bob     Los Angeles      104 Desktop   1500
## 4           3 Charlie Chicago          103 Tablet     300
## 5           4 David   Houston           NA <NA>        NA
## 6           5 Eve     Phoenix           NA <NA>        NA
## # A tibble: 4 × 6
##   customer_id name    city        order_id product amount
##         <dbl> <chr>   <chr>          <dbl> <chr>    <dbl>
## 1           1 Alice   New York         101 Laptop    1200
## 2           2 Bob     Los Angeles      102 Phone      800
## 3           2 Bob     Los Angeles      104 Desktop   1500
## 4           3 Charlie Chicago          103 Tablet     300

8. Challenge Question (3 points) Create a summary that shows each customer’s name, city, total number of orders, and total amount spent. Include all customers, even those without orders. Hint: You’ll need to use a combination of joins and group_by/summarize operations.**

EC <- full_join(customers, orders, by = "customer_id")
EC <- EC %>%
  group_by(customer_id, name, city) %>%
  summarize(
    total_orders = sum(!is.na(order_id)),  
    total_amount_spent = sum(amount, na.rm = TRUE)
  ) %>%
  arrange(name)
## `summarise()` has grouped output by 'customer_id', 'name'. You can override
## using the `.groups` argument.
## # A tibble: 7 × 5
## # Groups:   customer_id, name [7]
##   customer_id name    city        total_orders total_amount_spent
##         <dbl> <chr>   <chr>              <int>              <dbl>
## 1           1 Alice   New York               1               1200
## 2           2 Bob     Los Angeles            2               2300
## 3           3 Charlie Chicago                1                300
## 4           4 David   Houston                0                  0
## 5           5 Eve     Phoenix                0                  0
## 6           6 <NA>    <NA>                   1                600
## 7           7 <NA>    <NA>                   1                150