Lab 4 - Emily Turcotte

Data Setup

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.4
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(lubridate)
library(completejourney)

## Welcome to the completejourney package! Learn more about these data
## sets at http://bit.ly/completejourney.

transactions <- get_transactions()
dim(transactions)

## [1] 1469307      11

promotions <- get_promotions()
dim(promotions)

## [1] 20940529        5

How many transactions do we have demographics on?

transactions %>%
  semi_join(demographics, by = "household_id") %>%
  tally()

How many transactions do we not have demographics on?

transactions %>%
  anti_join(demographics, by = "household_id") %>%
  tally()

Perform an inner join with the transactions and demographics data. Then compute the total sales_value by age category to identify which age group generate the most sales.

transactions %>%
  inner_join(demographics, by = "household_id") %>%
  group_by(age) %>%
  summarize(total_sales = sum(sales_value)) %>%
  arrange(desc(total_sales))

Identify all households with $1000 or more in total sales

hshld_1000 <- transactions %>%
  group_by(household_id) %>%
  summarize(total_sales = sum(sales_value, na.rm = TRUE)) %>%
  filter(total_sales >= 1000)

hshld_1000

Join the above results with the demographics data to determine:

How many of these households do we have demographic data on?

hshld_1000 %>%
  semi_join(demographics, by = "household_id") %>%
  tally()

How many do we not have demographic on?

hshld_1000 %>%
  anti_join(demographics, by = "household_id") %>%
  tally()

Which income range produces the most households that spent $1000 or more?

hshld_1000 %>%
  inner_join(demographics, by = "household_id") %>%
  count(income)

Join transactions and filtered promotions data

front_display_trans <- promotions %>%
  filter(display_location == 1) %>%
  inner_join(transactions, by = c('product_id', 'store_id', 'week'))

Total sales for all products displayed in the front of the store

front_display_trans %>%
  summarize(total_sales = sum(sales_value))

Identify the product displayed in the front of the store that had then largest total sales

front_display_trans %>%
  group_by(product_id) %>%
  summarize(total_front_display_sales = sum(sales_value)) %>%
  arrange(desc(total_front_display_sales))

Identify which product_category is related to the coupon where campaign_id is equal to 18 and coupon_upc is equal to 10000089238

coupons %>%
  filter(campaign_id == 18, coupon_upc == 10000089238) %>%
  inner_join(products, by = "product_id")

Identify all different products that contain “pizza” in their product_type description. Which of these products produces the greatest amount of total sales (compute total sales by product ID and product type)?

greatest_amount <- products %>%
  filter(str_detect(product_type, regex("pizza", ignore_case = TRUE))) %>%
  inner_join(transactions, by = "product_id")

type <- greatest_amount %>%
  group_by(product_type) %>%
  summarize(total_sales = sum(sales_value)) %>%
  arrange(desc(total_sales))

id <- greatest_amount %>%
  group_by(product_id) %>%
  summarize(total_sales = sum(sales_value)) %>%
  arrange(desc(total_sales))

type

id

Identify all products that are categorized (product_category) as “pizza” but are considered a “snack” or “appetizer”

relevant_products <- products %>%
  filter(
    str_detect(product_category, regex("pizza", ignore_case = TRUE)), 
    str_detect(product_type, regex("(snack|appetizer)", ignore_case = TRUE))
  )

relevant_products

Join the above relevant pizza products with the transactions data, compute the total quantity of items sold by product ID. Which of these products (product_id) have the most number of sales?

relevant_products %>%
  inner_join(transactions, by = "product_id") %>%
  group_by(product_id) %>%
  summarize(total_qty = sum(quantity)) %>%
  arrange(desc(total_qty))

Identify all products that contain “peanut butter” in their product_type. How many unique products does this result in?

pb <- products %>%
  filter(str_detect(product_type, regex("peanut butter", ignore_case = TRUE)))

tally(pb)

Compute the total sales_value by month based on the transaction_timestamp. Which month produces the most sales value for these products? Which month produces the least sales values for these products?

pb %>%
  inner_join(transactions, by = "product_id") %>%
  group_by(month = month(transaction_timestamp, label = TRUE)) %>%
  summarize(total_sales = sum(sales_value)) %>%
  arrange(desc(total_sales))

Using the coupon_redemtions data, filter for the coupon associated with campaign_id 18 and coupon_upc “10000085475”. How many households redeemed this coupon? Identify the total sales_value for all transactions associated with the households_ids that redeemed this coupon on the same day they redeemed the coupon.

redeemed <- coupon_redemptions %>%
  filter(campaign_id == "18", coupon_upc == "10000085475") %>%
  inner_join(transactions, by =  "household_id") %>%
  filter(yday(transaction_timestamp) == yday(redemption_date)) %>%
  group_by(household_id) %>%
  summarize(total_sales = sum(sales_value))

redeemed

Using the same redeemed coupon (campaign_id == “18” & coupon_upc == “10000085475”). Calculate the total sales_value for each product_type that this coupon was applied to identify which product_type resulted in the greatest sales when associated with this coupon.

redeemed_product <- coupon_redemptions %>%
  filter(campaign_id == "18", coupon_upc == "10000085475") %>% 
  inner_join(coupons, by = c("coupon_upc", "campaign_id")) %>%
  inner_join(products, by = "product_id") %>% filter(str_detect(product_category, regex("vegetables", ignore_case = TRUE)))  %>%
  inner_join(transactions, by = c("household_id", "product_id")) %>%
  filter(yday(transaction_timestamp) == yday(redemption_date)) %>%
  group_by(product_type) %>%
  summarize(total_sales = sum(sales_value)) %>%
  arrange(desc(total_sales))

redeemed_product