Q1 <-inner_join(customers, orders)
## Joining with `by = join_by(customer_id)`
a.How many rows are in the result?
## [1] 4
b. Why are some customers or orders not included in the result?.
Some customers or orders are not included in the inner join result because an inner join only returns rows with matching customer id in both the customers and orders data sets that does not include unmatched entries from either table.
c. Display the result
## # A tibble: 4 × 6
## customer_id name city order_id product amount
## <dbl> <chr> <chr> <dbl> <chr> <dbl>
## 1 1 Alice New York 101 Laptop 1200
## 2 2 Bob Los Angeles 102 Phone 800
## 3 2 Bob Los Angeles 104 Desktop 1500
## 4 3 Charlie Chicago 103 Tablet 300
Q2 <-left_join(customers, orders)
## Joining with `by = join_by(customer_id)`
a. How many rows are in the result?
## [1] 6
b. Explain why this number differs from the inner join result
A left join includes all customers from the left table and the inner join only includes rows with matching customer id values in both tables. This shows a larger number of rows for the left join.
c. Display the result
## # A tibble: 6 × 6
## customer_id name city order_id product amount
## <dbl> <chr> <chr> <dbl> <chr> <dbl>
## 1 1 Alice New York 101 Laptop 1200
## 2 2 Bob Los Angeles 102 Phone 800
## 3 2 Bob Los Angeles 104 Desktop 1500
## 4 3 Charlie Chicago 103 Tablet 300
## 5 4 David Houston NA <NA> NA
## 6 5 Eve Phoenix NA <NA> NA
Q3 <-right_join(customers, orders)
## Joining with `by = join_by(customer_id)`
a.How many rows are in the result?
## [1] 6
b. which customer_ids in the result have NULL for customer name and city? Explain why.
Customers 6 and 7 have Null results. In a right join any customer id in the orders table that does not have an entry in the customers table will have NULL values for name and city since the right join keeps all rows from the orders table even if theres not a match in the customers table.
c.Display the result
## # A tibble: 6 × 6
## customer_id name city order_id product amount
## <dbl> <chr> <chr> <dbl> <chr> <dbl>
## 1 1 Alice New York 101 Laptop 1200
## 2 2 Bob Los Angeles 102 Phone 800
## 3 2 Bob Los Angeles 104 Desktop 1500
## 4 3 Charlie Chicago 103 Tablet 300
## 5 6 <NA> <NA> 105 Camera 600
## 6 7 <NA> <NA> 106 Printer 150
Q4 <-right_join(customers, orders)
## Joining with `by = join_by(customer_id)`
a. How many rows are in the result?
## [1] 6
b. Identify any rows where there’s information from only one table. Explain these results.
Row 1-4 contains data from only the customer table. In a Full join rows with information from only one table customer id exists in either the customers or orders table but not both so there will be NULL values for the missing data.
c. Display the result
## # A tibble: 6 × 6
## customer_id name city order_id product amount
## <dbl> <chr> <chr> <dbl> <chr> <dbl>
## 1 1 Alice New York 101 Laptop 1200
## 2 2 Bob Los Angeles 102 Phone 800
## 3 2 Bob Los Angeles 104 Desktop 1500
## 4 3 Charlie Chicago 103 Tablet 300
## 5 6 <NA> <NA> 105 Camera 600
## 6 7 <NA> <NA> 106 Printer 150
Q5 <-semi_join(customers, orders)
## Joining with `by = join_by(customer_id)`
a. How many rows are in the result?
## [1] 3
b. How does this result differ from the inner join result?
It only has customer_Id, name, and city. Semi join returns only the rows from the customers table that have matching orders in the orders table but it does not include any columns from the orders table. Inner join returns all matching rows from both tables that include their columns.
c. Display the result
## # A tibble: 3 × 3
## customer_id name city
## <dbl> <chr> <chr>
## 1 1 Alice New York
## 2 2 Bob Los Angeles
## 3 3 Charlie Chicago
Q6 <- anti_join(customers, orders)
## Joining with `by = join_by(customer_id)`
a. Which customers are in the result?
## # A tibble: 2 × 1
## name
## <chr>
## 1 David
## 2 Eve
b.Explain what this result tells you about these customers.
David and Eve do not have order numbers that wont link to the product or price. Anti join returns the rows from the customers table that do not have any corresponding orders in the orders table. This shows these customers have never made a purchase.
c.Display the result
## # A tibble: 2 × 3
## customer_id name city
## <dbl> <chr> <chr>
## 1 4 David Houston
## 2 5 Eve Phoenix
a. Which join would you use to find all customers, including those who haven’t placed any orders? Why?
I would use a full join to find all customers even customers who haven’t placed any orders since it includes every customer regardless of whether they have any corresponding order data.
b.. Which join would you use to find only the customers who have placed orders? Why?
I would use an inner join to find only the customers who have placed orders because it returns rows that have matching entries in both the customers and orders tables that shows only customers with orders that are included in the result.
c.Write the R code for both scenarios.
Q7A <- full_join(customers, orders, by = "customer_id")
Q7B <- inner_join(customers, orders, by = "customer_id")
d. Display the result
## # A tibble: 6 × 6
## customer_id name city order_id product amount
## <dbl> <chr> <chr> <dbl> <chr> <dbl>
## 1 1 Alice New York 101 Laptop 1200
## 2 2 Bob Los Angeles 102 Phone 800
## 3 2 Bob Los Angeles 104 Desktop 1500
## 4 3 Charlie Chicago 103 Tablet 300
## 5 4 David Houston NA <NA> NA
## 6 5 Eve Phoenix NA <NA> NA
## # A tibble: 4 × 6
## customer_id name city order_id product amount
## <dbl> <chr> <chr> <dbl> <chr> <dbl>
## 1 1 Alice New York 101 Laptop 1200
## 2 2 Bob Los Angeles 102 Phone 800
## 3 2 Bob Los Angeles 104 Desktop 1500
## 4 3 Charlie Chicago 103 Tablet 300
EC <- full_join(customers, orders, by = "customer_id")
EC <- EC %>%
group_by(customer_id, name, city) %>%
summarize(
total_orders = sum(!is.na(order_id)),
total_amount_spent = sum(amount, na.rm = TRUE)
) %>%
arrange(name)
## `summarise()` has grouped output by 'customer_id', 'name'. You can override
## using the `.groups` argument.
## # A tibble: 7 × 5
## # Groups: customer_id, name [7]
## customer_id name city total_orders total_amount_spent
## <dbl> <chr> <chr> <int> <dbl>
## 1 1 Alice New York 1 1200
## 2 2 Bob Los Angeles 2 2300
## 3 3 Charlie Chicago 1 300
## 4 4 David Houston 0 0
## 5 5 Eve Phoenix 0 0
## 6 6 <NA> <NA> 1 600
## 7 7 <NA> <NA> 1 150