Alex Martin-Cabrera, Andrew Grant, Jared Rice

Introduction:

Milk is a common household item that is consumed by the masses. Being an extremely versatile good, milk can combined with a breakfast cereal, used in baking/cooking, or even for enjoying a cold glass for the satisfaction of the consumer. Milk is also a relatively cheap good and is constantly in demand.

After running preliminary analysis on the data frames that were provided it was discovered that out of all the product transactions that occurred during the year 2017, milk was the fourth most purchased product, but it falls significantly lower from the revenue gained than the top three products by a margin of over $150,000.

With milk sales lagging, this analysis aims to provide insight to Regork on who is most likely to consume milk, and how Regork can target coupon campaigns to consumers where milk consumption is lagging based on the household characteristics of those consumers.

The household characteristics include age group, number of children, household size, and level of income.

Packages:

The required packages to replicate this analysis include:

completejourney- contains data of grocery store transactions over one year from 2,469 households.
tidyverse- compilation of packages for data transformation and visualization. Packages from tidyverse used for this analysis include ggplot2, dplyr, tidyr, and forcats.
lubridate- allows date-time variables to be manipulated and easily readable in code chunks.
knitr-allows for creating dynamic reports.

Importing Data Sets From Complete Journey

demo<-completejourney::demographics
transactions<-get_transactions()
products<-completejourney::products
campaigns<-completejourney::campaigns
campaign_descriptions<-completejourney::campaign_descriptions
coupons<-completejourney::coupons
coupon_redemptions<-completejourney::coupon_redemptions

Data Cleaning:

Joining tables by the “product_id” variable

transactions_products <- transactions %>%
  left_join(products, by = "product_id")

Joining tables by the “household_id” variable

complete_data <- transactions_products %>%
  inner_join(demographics, by = "household_id")%>%
  inner_join(campaigns, by = "household_id")

Joining tables by the “household_id” variable

complete_data<-complete_data%>%
  inner_join(campaign_descriptions, by = "campaign_id")

After joining multiple data frames to combine one large table, we have a table with 488,742 observations and 28 variables.

dim(complete_data)

## [1] 4886742      28

A snapshot of the Data

head(complete_data)

## # A tibble: 6 x 28
##   household_id store_id basket_id product_id quantity sales_value retail_disc
##   <chr>        <chr>    <chr>     <chr>         <dbl>       <dbl>       <dbl>
## 1 900          330      31198570~ 1095275           1         0.5           0
## 2 900          330      31198570~ 1095275           1         0.5           0
## 3 900          330      31198570~ 1095275           1         0.5           0
## 4 900          330      31198570~ 1095275           1         0.5           0
## 5 900          330      31198570~ 1095275           1         0.5           0
## 6 900          330      31198570~ 1095275           1         0.5           0
## # ... with 21 more variables: coupon_disc <dbl>, coupon_match_disc <dbl>,
## #   week <int>, transaction_timestamp <dttm>, manufacturer_id <chr>,
## #   department <chr>, brand <fct>, product_category <chr>, product_type <chr>,
## #   package_size <chr>, age <ord>, income <ord>, home_ownership <ord>,
## #   marital_status <ord>, household_size <ord>, household_comp <ord>,
## #   kids_count <ord>, campaign_id <chr>, campaign_type <ord>,
## #   start_date <date>, end_date <date>

complete_data%>%
  group_by(product_category)%>%
  summarize(Spent=sum(sales_value,na.rm = TRUE))%>%
  slice_max(order_by=Spent,n=5)

## # A tibble: 5 x 2
##   product_category       Spent
##   <chr>                  <dbl>
## 1 COUPON/MISC ITEMS   1510881.
## 2 SOFT DRINKS          567578.
## 3 BEEF                 544233.
## 4 FLUID MILK PRODUCTS  379666.
## 5 CHEESE               354232.

complete_data%>%
  group_by(product_category)%>%
  summarize(Quantity=sum(quantity,na.rm = TRUE))%>%
  slice_max(order_by=Quantity,n=10)

## # A tibble: 10 x 2
##    product_category           Quantity
##    <chr>                         <dbl>
##  1 COUPON/MISC ITEMS         600618898
##  2 SOFT DRINKS                  268654
##  3 FUEL                         229200
##  4 FLUID MILK PRODUCTS          215298
##  5 CHEESE                       180084
##  6 BAKED BREAD/BUNS/ROLLS       177111
##  7 YOGURT                       163223
##  8 SOUP                         159330
##  9 FRZN MEAT/MEAT DINNERS       148716
## 10 VEGETABLES - SHELF STABLE    147811

complete_data%>%
  group_by(product_category)%>%
  summarize(Sales=sum(sales_value,na.rm = TRUE))%>%
  slice_max(order_by=Sales,n=5)

## # A tibble: 5 x 2
##   product_category       Sales
##   <chr>                  <dbl>
## 1 COUPON/MISC ITEMS   1510881.
## 2 SOFT DRINKS          567578.
## 3 BEEF                 544233.
## 4 FLUID MILK PRODUCTS  379666.
## 5 CHEESE               354232.

As shown by the two tables above, milk is the 4th largest item purchased in regards to quantity, and the 4th largest item in regards to price. We can see that milk products fall behind in sales by around $164,000, a significant deficit that we are seeking to close to gap of.

Exploratory Data Analysis:

milk_data <- complete_data %>%
  filter(grepl("milk", product_type, ignore.case = TRUE))

# Summarize milk purchases by age, income, household size, and kids
milk_summary <- milk_data %>%
  group_by(transaction_timestamp, age, income, household_size, kids_count) %>%
  summarize(total_spent = sum(sales_value), .groups = 'drop')

# Convert transaction_timestamp to Date format using ymd_hms
milk_summary <- milk_summary %>%
  mutate(transaction_timestamp = ymd_hms(transaction_timestamp, quiet = TRUE))

# Remove unknown groups
milk_summary <- milk_summary %>%
  filter(!is.na(age) & age != "Unknown",
         !is.na(income) & income != "Unknown",
         !is.na(household_size) & household_size != "Unknown",
         !is.na(kids_count) & kids_count != "Unknown")

# Plot milk purchases over time by age
ggplot(milk_summary, aes(x = transaction_timestamp, y = total_spent, color = as.factor(age))) +
  geom_line() +
  labs(title = "Milk Purchases Over Time by Age",
       x = "Date", y = "Total Sales") +
  scale_color_discrete(name = "Age Group") +  
  theme_minimal() +
  theme(
    plot.title = element_text(hjust = 0.5), 
    legend.position = "right")

We can see that throughout 2017, most milk purchases came from consumers age 25-44, with the peak of milk sales being age group 35-44 in January. Groups that are lagging in milk purchases are 19-24 and 55+. This implies that implementing a marketing campaign targeted toward households with consumers 19-24 and 55+ and would be be effective in increasing milk sales.

# Bar chart for milk purchases by kids count
ggplot(milk_summary, aes(x = as.factor(kids_count), y = total_spent, fill = as.factor(kids_count))) +
  geom_bar(stat = "identity", show.legend = FALSE) +
  labs(title = "Total Milk Purchases by Kids Count",
       x = "Kids Count", y = "Total Sales", fill = "Kid Count")+
   theme(
    plot.title = element_text(hjust = 0.5))

From the bar chart above, it appears that has the kid count increases, the amount spent by households on milk decreases. For our coupon campaign, we could specifically market towards households with children as there is a larger customer base that we could tap into to boost milk sales.

# Bar chart for milk purchases by household size
ggplot(milk_summary, aes(x = as.factor(household_size), y = total_spent, fill = as.factor(household_size))) +
  geom_bar(stat = "identity", show.legend = FALSE) +
  labs(title = "Total Milk Purchases by Household Size",
       x = "Household Size", y = "Total Sales") +
  theme_minimal()+
  scale_fill_discrete(name = "")

In the bar chart above, the largest consumers of milk based on household size is two, and decreases significantly as household size increases. This result makes sense as we learned that with more children in the household, the less likely that household is to buy milk. The household size plot reinforces that coupons should be distributed to households with more children.

milk_summary %>%
  mutate(income_factor = fct_collapse(income,
    "Under 100K" = c("Under 15K", "15-24K", "25-34K", "35-49K", "50-74K", "75-99K"),
    "Above 100K" = c("100-124K", "125-149K", "150-174K", "175-199K", "200-249K", "250K+")
  )) %>%
  mutate(year = floor_date(transaction_timestamp, unit = "year")) %>%
  group_by(income_factor, year) %>%
  summarize(total_spent = sum(total_spent, na.rm = TRUE), .groups = "drop") %>%
  ggplot(aes(x = income_factor, y = total_spent, fill = income_factor)) +
  geom_violin(alpha = 0.7, color = "black") +
  geom_point(aes(color = income_factor), position = position_dodge(width = 0), size = 3, alpha = 0.8) +
  scale_fill_manual(values = c("Under 100K" = "#1f77b4", "Above 100K" = "black"), name = "Income") +
  scale_color_manual(values = c("Under 100K" = "#1f77b4", "Above 100K" = "black"), name = "Income") +
  scale_y_continuous(labels = scales::comma) +
  labs(title = "Distribution of Milk Purchases by Income", x = "Income Bracket", y = "Total Sales") +
  theme_minimal(base_size = 14) +
  theme(plot.title = element_text(hjust = 0.5, face = "bold"))

The sales of milk among income level is very interesting. We can see that people who make above 100K spend way less on milk than people who makes less than 100K. While coupons may not be utilized by people who make over 100K, if they fit other household characteristics, such as having multiple children, they could still use the milk coupons.

milk_months <- milk_data %>%
  mutate(month = floor_date(ymd_hms(transaction_timestamp, quiet = TRUE), "month")) %>%
  group_by(month) %>%
  summarize(total_volume = sum(quantity), .groups = 'drop')

ggplot(milk_months, aes(x = month, y = total_volume)) +
  geom_line(stat = "identity", color = "purple") +
  labs(title = "Volume of Milk Purchases by Month",
       x = "Month", y = "Total Volume") +
  theme_minimal()

From the plot above, we can see that sales of milk by month are all relatively high, with the smallest volumes occurring in February, April, and November. For our coupon campaign, if we market about these three months, we should expect to see the total volume to increase among these months.

Now that we have identified characteristics of households that we should target, we are going to look at which campaign type was successful in 2017.

complete_data %>%
  ggplot(aes(x = campaign_type, fill = campaign_type)) +
  geom_bar() + 
  geom_text(
    stat = "count",
    aes(label = scales::comma(..count..)), 
    vjust = -0.5,
    size = 4
  ) +
  labs(
    title = "Campaign Type Distribution",
    x = "Campaign Type",
    y = "Count"
  ) +
  scale_fill_brewer(palette = "Set2") +  
  scale_y_continuous(expand = expansion(mult = c(0, 0.1))) + 
  theme_minimal() +
  theme(
    plot.title = element_text(hjust = 0.5, face = "bold", size = 14),  # Center and bold title
    axis.text.x = element_text(angle = 45, hjust = 1),  
    legend.position = "none"
  )

The plot above illustrates the count of coupons that were redeemed by households at our stores based on the campaign type that was used to market the coupons. There is minimal difference between type A and type B, so either one of these campaign types should be run to market coupons.

After analyzing the plots above to scan household characteristics and the type of campaign to distribute coupons, the summary tables below identify the households line up with the ideal characteristics.

Findings:

Age Summary Table:

milk_data%>%
  select(household_id,age,sales_value)%>%
  group_by(household_id,age)%>%
  summarize(sales_value = sum(sales_value,na.rm = TRUE))%>%
  arrange((sales_value))

## # A tibble: 751 x 3
## # Groups:   household_id [751]
##    household_id age   sales_value
##    <chr>        <ord>       <dbl>
##  1 1219         19-24        2   
##  2 460          65+          3.6 
##  3 621          45-54        3.78
##  4 1604         35-44        3.99
##  5 1499         55-64        4.24
##  6 2390         25-34        4.42
##  7 1076         25-34        4.54
##  8 2179         45-54        4.65
##  9 823          35-44        4.96
## 10 48           19-24        5.58
## # ... with 741 more rows

Kid Count Summary Table:

milk_data%>%
  filter(kids_count != 0)%>%
  select(household_id,kids_count,sales_value)%>%
  group_by(household_id,kids_count)%>%
  summarize(sales_value = sum(sales_value,na.rm = TRUE))%>%
  arrange((sales_value))

## # A tibble: 261 x 3
## # Groups:   household_id [261]
##    household_id kids_count sales_value
##    <chr>        <ord>            <dbl>
##  1 1219         1                 2   
##  2 2390         1                 4.42
##  3 2179         1                 4.65
##  4 823          1                 4.96
##  5 2092         1                 5.97
##  6 302          1                 9.47
##  7 1120         2                15.5 
##  8 71           1                15.8 
##  9 1021         3+               15.9 
## 10 378          1                17.6 
## # ... with 251 more rows

Household Size Table:

milk_data%>%
  select(household_id,household_size,sales_value)%>%
  group_by(household_id,household_size)%>%
  summarize(sales_value = sum(sales_value,na.rm = TRUE))%>%
  arrange((sales_value))

## # A tibble: 751 x 3
## # Groups:   household_id [751]
##    household_id household_size sales_value
##    <chr>        <ord>                <dbl>
##  1 1219         3                     2   
##  2 460          1                     3.6 
##  3 621          1                     3.78
##  4 1604         2                     3.99
##  5 1499         2                     4.24
##  6 2390         2                     4.42
##  7 1076         2                     4.54
##  8 2179         3                     4.65
##  9 823          3                     4.96
## 10 48           2                     5.58
## # ... with 741 more rows

Household Income Level Table:

milk_data%>%
 mutate(income_factor =fct_collapse(income,
"Under 100K"=c("Under 15K","15-24K","25-34K","35-49K","50-74K","75-99K"),
 "Above 100K"=c("100-124K","125-149K","150-174K","175-199K","200-249K","250K+")))%>%
  group_by(household_id,income_factor)%>%
  summarize(sales_value=sum(sales_value))%>%
  arrange((sales_value))

## # A tibble: 751 x 3
## # Groups:   household_id [751]
##    household_id income_factor sales_value
##    <chr>        <ord>               <dbl>
##  1 1219         Under 100K           2   
##  2 460          Under 100K           3.6 
##  3 621          Under 100K           3.78
##  4 1604         Under 100K           3.99
##  5 1499         Under 100K           4.24
##  6 2390         Under 100K           4.42
##  7 1076         Under 100K           4.54
##  8 2179         Under 100K           4.65
##  9 823          Under 100K           4.96
## 10 48           Under 100K           5.58
## # ... with 741 more rows

The summary tables above show the households that purchased the least amount of milk based on the ideal characteristics in the plots above. Simply put, if a household id is present in the first few rows of the four summary tables above, it is an indication that milk consumption is is low there, and coupons should be distributed around February, April, and November, utilizing campaign Type A or B.

While this analysis was for milk only, with a little tweek in the first code chunk in the “Exploratory Data Analysis” and changing the name of any table from “milk_######” to the product type of interest. For the first code chunk, simply replace “milk” following the grepl function, with the product type of interest.

Milk Analysis