Final Project

Introduction

Ice cream sales make up a small, but significant portion of overall profit in Regork stores. Totaling ~$55000 in total sales per year (1.1% of the total yearly profit across all products), we would like to find opportunities to increase sales of these products. For this project, we will examine the overall ice cream sale numbers across the entirety of Regork. Then, we will examine possible areas of growth, namely in the handling of ice cream toppings. Then, we will focus on other factors potentially affecting ice cream sales and build a model for our typical customer.

Required Libraries

For this project, we will be using the following libraries and data sets.

library(tidyverse)                  #For useful statistics handling
#install.packages("DT")             #For scrolling data tables
library(DT)
#install.packages("gridExtra")      #To help with putting plots next to eachother
library(gridExtra)
library(lubridate)                  #For time related functionality
library(ggplot2)                    #For other graphs
#install.packages("paletteer")      #For custom color pallet
library(paletteer)
#install.packages("rmdformats")     #For the cool site design 
                                    #(https://github.com/juba/rmdformats)

library(completejourney)            #For data set (https://cran.r-project.org/web/packages/
                                    #completejourney/vignettes/completejourney.html)

transactions <- get_transactions()  #Gets the complete transaction data
promotions <- get_promotions()      #Gets the complete promotions data
#demographics                       #The demographics data
#products                           #The products data
#coupons_descriptions               #For descriptions
#coupon_redemptions                 #For coupon redemption data

#combines both data sets
transact_prod <- transactions %>%
  inner_join(products, by = "product_id")
#Example of transaction/product data
DT::datatable(head(transact_prod, n=5),
              caption = "Short Example of the kinds of data in data set") 

Exploratory Data Analysis

Ice Cream Sales Numbers

Ice cream sales can be broken down into 11 product types and 6 different categories of products.

ice_cream_list <- 
  transact_prod %>% 
  filter(str_detect(product_category, regex("ice cream", ignore_case = TRUE))|
           str_detect(product_type, regex("ice cream", ignore_case = TRUE))
         )

print("------------------------Ice Cream Product Types-------------------------------")
unique(ice_cream_list$product_type)
print("---------------------Ice Cream Product Categories-----------------------------")
unique(ice_cream_list$product_category)
## [1] "------------------------Ice Cream Product Types-------------------------------"
##  [1] "QUARTS"                         "TRADITIONAL"                   
##  [3] "PREMIUM"                        "PREMIUM PINTS"                 
##  [5] "SUPER PREMIUM PINTS"            "ICE CREAM TOPPINGS"            
##  [7] "ICE CREAM SANDWICHES"           "PAILS"                         
##  [9] "FROZEN ICE CREAM"               "CAKES: ICE CREAM"              
## [11] "FROZEN DESSERT (ICE CREAM CAKE"
## [1] "---------------------Ice Cream Product Categories-----------------------------"
## [1] "ICE CREAM/MILK/SHERBTS" "SYRUPS/TOPPINGS"        "FRZN NOVELTIES/WTR ICE"
## [4] "FROZEN"                 "CAKES"                  "BAKERY PARTY TRAYS"

Of these, two of the products are ice cream cakes, another includes all ice cream sandwiches, another represents toppings, and the rest represent typical ice cream products (albeit, in a variety of different sizes and price points).

Looking into each of these categories, we can examine spending habits by customers.

#---------------------------------------Data for totals---------------------------------
summary_data <- ice_cream_list %>%
  group_by(product_type) %>%
  summarize(
    sales = sum(sales_value, na.rm=TRUE),
    baskets = n_distinct(basket_id, na.rm = TRUE),
    quantity = sum(quantity, na.rm =TRUE)
    ) 

Sales Data

# ------------------------------------- bar chart --------------------------------------
#This is the output for the bar chart. Again, it is reactive. RenderPlot() lets the 
#chart get rendered in real time
s1 <- ggplot(summary_data,
       aes(x = reorder(product_type, -sales),       #puts data in dec order
           y = sales)) +
  geom_col(fill = "#9F2042") +
  labs(
    x = "Product Type",
    y = "sales",
    title = "Ice Cream Purchases by Product Type"
  ) + 
  theme_minimal()+
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

#---------------------------------------Stacked box-------------------------------------
s2 <- ggplot(summary_data,
       aes(x = "Break Down", y = sales, fill = product_type)) +
  geom_col(width = 1) +
  labs(
    x = NULL,
    y = "sales",
    fill = "Amt",
    title = "Breakdown By Type"
  ) +
  scale_fill_manual(values = paletteer_d("RColorBrewer::RdGy")) +
  theme_minimal()

grid.arrange(s1,s2, ncol=2)

Quantity Data

# ------------------------------------- bar chart --------------------------------------
#This is the output for the bar chart. Again, it is reactive. RenderPlot() lets the 
#chart get rendered in real time
q1 <- ggplot(summary_data,
       aes(x = reorder(product_type, -quantity),       #puts data in dec order
           y = quantity)) +
  geom_col(fill = "#9F2042") +
  labs(
    x = "Product Type",
    y = "quantity",
    title = "Ice Cream Purchases by Product Type"
  ) + 
  theme_minimal()+
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

#---------------------------------------Stacked box-------------------------------------
q2 <- ggplot(summary_data,
       aes(x = "Break Down", y = quantity, fill = product_type)) +
  geom_col(width = 1) +
  labs(
    x = NULL,
    y = "quantity",
    fill = "Amt",
    title = "Breakdown By Type"
  ) +
  scale_fill_manual(values = paletteer_d("RColorBrewer::RdGy")) +
  theme_minimal()

grid.arrange(q1,q2, ncol=2)

Baskets Data

# ------------------------------------- bar chart --------------------------------------
#This is the output for the bar chart. Again, it is reactive. RenderPlot() lets the 
#chart get rendered in real time
b1 <- ggplot(summary_data,
       aes(x = reorder(product_type, -baskets),       #puts data in dec order
           y = baskets)) +
  geom_col(fill = "#9F2042") +
  labs(
    x = "Product Type",
    y = "baskets",
    title = "Ice Cream Purchases by Product Type"
  ) + 
  theme_minimal()+
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

#---------------------------------------Stacked box-------------------------------------
b2 <- ggplot(summary_data,
       aes(x = "Break Down", y = baskets, fill = product_type)) +
  geom_col(width = 1) +
  labs(
    x = NULL,
    y = "baskets",
    fill = "Amt",
    title = "Breakdown By Type"
  ) +
  scale_fill_manual(values = paletteer_d("RColorBrewer::RdGy")) +
  theme_minimal()

grid.arrange(b1,b2, ncol=2)

We see from these numbers that the largest money makers when it comes to ice cream sales are in the Premium, Traditional, and super Premium pints categories, with Premium wracking slightly under half of all total sales. Non standard items like ice cream sandwiches and cakes rank much further down the list. We can also see that ice cream toppings sales are much further down the list, in 8th place. This suggests that initiatives aimed at increasing standard ice cream sales may increase sales more than initiatives addressed to lesser earning categories.

Ice Cream Products Correlations

An interesting question is whether there is any correlation between ice cream sales, and ice cream toppings sales. Were such a correlation to exist, increasing ice cream sales could, potentially, lead to additional income via increases in ice cream topping sales. Among those who purchase ice cream products, we can see the following correlations.

#-------------------------List of sales numbers per category---------------------------
correlation_data <- ice_cream_list %>%
  group_by(basket_id)%>%
  summarize(
    premium = any(product_type == "PREMIUM"),
    traditional = any(product_type == "TRADITIONAL"),
    sup_prem = any(product_type == "SUPER PREMIUM PINTS"),
    toppings = any(product_type == "ICE CREAM TOPPINGS"),
    ice_sand = any(product_type == "ICE CREAM SANDWICHES"),
    quart = any(product_type == "QUARTS"),
    fro_ice = any(product_type == "FROZEN ICE CREAM"),
    fro_des = any(product_type == "FROZEN DESSERT (ICE CREAM CAKE"),
    pale = any(product_type == "PAILS"),
    cake = any(product_type == "CAKES: ICE CREAM"),
    prem_pin = any(product_type == "PREMIUM PINTS")
  )

#-------------------------Turn it into a contingency table-----------------------------
tab_premium <- table(correlation_data$premium, correlation_data$toppings)
tab_trad <- table(correlation_data$traditional, correlation_data$toppings)
tab_sup <- table(correlation_data$sup_prem, correlation_data$toppings)
tab_ice_sand <- table(correlation_data$ice_sand, correlation_data$toppings)
tab_quart <- table(correlation_data$quart, correlation_data$toppings)
tab_fro_ice <- table(correlation_data$fro_ice, correlation_data$toppings)
tab_fro_des <- table(correlation_data$fro_des, correlation_data$toppings)
tab_pale <- table(correlation_data$pale, correlation_data$toppings)
tab_cake <- table(correlation_data$cake, correlation_data$toppings)
tab_prem_pin <- table(correlation_data$prem_pin, correlation_data$toppings)

#-----------------------------------Plots---------------------------------------------
#You are welcome for all these mosaic plots (10 of them in all)
par(mfrow = c(1, 3))
mosaicplot(tab_premium, 
           main = "Premium vs Toppings",
           xlab = "Customer bought Premium Ice Cream",
           ylab = "Customer bought Toppings",
           color = TRUE)

mosaicplot(tab_trad,
           main = "Traditional vs Toppings",
           xlab = "Customer bought Traditional Ice Cream",
           ylab = "Customer bought Toppings",
           color = TRUE)

mosaicplot(tab_sup,
           main = "Super Premium vs Toppings",
           xlab = "Customer bought Super Premium Ice Cream",
           ylab = "Customer bought Toppings",
           color = TRUE)

mosaicplot(tab_ice_sand,
           main = "Ice Cream Sandwiches vs Toppings",
           xlab = "Customer bought Premium Ice Cream Sandwiches",
           ylab = "Customer bought Toppings",
           color = TRUE)

mosaicplot(tab_quart,
           main = "Quarts vs Toppings",
           xlab = "Customer bought Traditional Quarts",
           ylab = "Customer bought Toppings",
           color = TRUE)

mosaicplot(tab_fro_ice,
           main = "Frozen Ice vs Toppings",
           xlab = "Customer bought Frozen Ice",
           ylab = "Customer bought Toppings",
           color = TRUE)

mosaicplot(tab_fro_des,
           main = "Frozen Dessert vs Toppings",
           xlab = "Customer bought Frozen Dessert",
           ylab = "Customer bought Toppings",
           color = TRUE)

mosaicplot(tab_pale,
           main = "Pails vs Toppings",
           xlab = "Customer bought Pails",
           ylab = "Customer bought Toppings",
           color = TRUE)

mosaicplot(tab_cake,
           main = "Ice Cream Cake m vs Toppings",
           xlab = "Customer bought Ice Cream Cake",
           ylab = "Customer bought Toppings",
           color = TRUE)

mosaicplot(tab_prem_pin,
           main = "Premium Pints vs Toppings",
           xlab = "Customer bought Premium Pints",
           ylab = "Customer bought Toppings",
           color = TRUE)

par(mfrow = c(1, 1))
#------------------ The degree to which toppings and x are correlated ------------------
correlations <- c(
cor(as.numeric(correlation_data$premium),
                   as.numeric(correlation_data$toppings)),
cor(as.numeric(correlation_data$traditional),
                   as.numeric(correlation_data$toppings)),
cor(as.numeric(correlation_data$sup_prem),
                   as.numeric(correlation_data$toppings)),
cor(as.numeric(correlation_data$ice_sand),
                   as.numeric(correlation_data$toppings)),
cor(as.numeric(correlation_data$quart),
                   as.numeric(correlation_data$toppings)),
cor(as.numeric(correlation_data$fro_ice),
                   as.numeric(correlation_data$toppings)),
cor(as.numeric(correlation_data$fro_des),
                   as.numeric(correlation_data$toppings)),
cor(as.numeric(correlation_data$pale),
                   as.numeric(correlation_data$toppings)),
cor(as.numeric(correlation_data$cake),
                   as.numeric(correlation_data$toppings)),
cor(as.numeric(correlation_data$prem_pin),
                   as.numeric(correlation_data$toppings))
)

categories <- c("premium","traditional","sup_prem","ice_sand","quart",
"fro_ice","fro_des","pale","cake","prem_pin")
#Shows findings
head(data.frame(types = categories, correlations = correlations), n=10)
##          types correlations
## 1      premium -0.026200377
## 2  traditional -0.017680231
## 3     sup_prem -0.045279163
## 4     ice_sand -0.046441584
## 5        quart -0.022766255
## 6      fro_ice -0.011303472
## 7      fro_des -0.006430965
## 8         pale -0.008474459
## 9         cake -0.013728845
## 10    prem_pin -0.050688323

Curiously, the data suggests that not only is there no correlation, there is a slight negative correlation between ice cream toppings and ice cream! Expanding our view to look at connections with other items reveals that we simply don’t have enough data to draw good correlations.

#Shrinks the data set a little bit
filtered <- transact_prod %>%
  filter(department == "GROCERY" | department == "BAKERY")

#Gets a list of baskets that have toppings.
has_toppings <- filtered %>%
  mutate(is_topping = str_detect(product_type, "ICE CREAM TOPPINGS")) %>%
  group_by(basket_id) %>%
  summarize(has_topping = any(is_topping))

#Gets a list of products that were bought alongside ice cream toppings
item_bought_with_ice_cream_toppings <- filtered %>%
  inner_join(has_toppings, by="basket_id")%>%
  group_by(product_category) %>%
  summarise(
    baskets_with_type = n_distinct(basket_id),
    baskets_with_both = n_distinct(basket_id[has_toppings == TRUE]),
    prob_toppings_given_type = baskets_with_both / baskets_with_type
  ) %>%
  arrange(desc(baskets_with_both))
DT::datatable(item_bought_with_ice_cream_toppings, 
              caption = "Amount of baskets with toppings and this item type")
#People who only bought ice cream toppings.
has_toppings_only <- filtered %>%
  group_by(basket_id) %>%
  filter(all(product_type == "ICE CREAM TOPPINGS")) %>%
  select(basket_id, quantity, product_type)
DT::datatable(has_toppings_only, 
              caption = "Baskets that only have toppings")

As we can see from the data, ice cream toppings are a fairly rare buy in our data set. No category shown demonstrates multiple baskets containing ice cream toppings and an item from that category.

Ice Cream Coupon Data

The next thing to check is what types of coupons exist for ice cream and whether they are used. The following code shows the number of coupon upc codes that have to do with ice cream. Then, we join some data together to get a list of used coupons.

#How many coupons are there for ice cream?
num <- ice_cream_list %>%
  inner_join(coupons, by = "product_id") %>%
  select(coupon_upc) %>%
  n_distinct()
print(paste0("Total number of coupons for ice cream: ",num))
## [1] "Total number of coupons for ice cream: 21"
#------------------------Data on which coupons were used-------------------------------
num_coupons_used <- coupon_redemptions %>% 
  tibble::rowid_to_column("ID") %>%
  inner_join(coupons, by = c("coupon_upc", "campaign_id")) %>%
  inner_join(ice_cream_list, by= c("household_id", "product_id")) %>%
  inner_join(campaign_descriptions, by="campaign_id") %>%
  filter(yday(redemption_date) == yday(transaction_timestamp)) %>%
  group_by(campaign_type) %>%
  summarize(total = n_distinct(ID))
  

#---------------------------Data on existing coupons-----------------------------------
num_coupons_by_type <- ice_cream_list %>%
  inner_join(coupons, by = "product_id") %>%
  inner_join(campaign_descriptions, by = "campaign_id") %>%
  select(c("coupon_upc","campaign_type")) %>% 
  group_by(campaign_type) %>%
  summarize(total = n_distinct(coupon_upc))

The following plots show coupon usage by type and total amount of coupons by type. We can see from the data that type A coupons are most prevalent, and also most used. Considering type A coupons are based on past buying behavior, this suggests that targeted ice cream coupon campaigns would be most likely to raise sales. We do not know whether coupons could assist in increasing ice cream toppings sales, say with a “get 1 topping 50% of with a purchase of some ice cream coupon”.

#------------------------------------Total number of coupons----------------------------
p1 <- ggplot(
  num_coupons_by_type,
  aes( x = campaign_type, y = total)
) + 
  geom_col(fill = "#9F2042") +
  labs(x = "Campaign Type",
       y = "Number of Ice Cream Coupons",
       title = "Number of Ice Cream Coupons by Campaign Type") +
  annotate("text", x = 2, y = 15, label = " \"3 coupons are both\" ")

#------------------------------------Number of coupons used-----------------------------
p2 <- ggplot(
  num_coupons_used,
  aes(x = campaign_type, y = total)
) +
  geom_col(fill = "#9F2042") +
  labs(x = "Campaign Type",
       y = "Number of Coupons used",
       title = "Number of Ice Cream Coupons Used")

grid.arrange(p1,p2, ncol=2)

Check Ice Cream Sales over time

We now transition to studying ice cream sales over time. The following graphs show total ice cream sales over time. Curiously, the numbers are largely consistent, albeit with a small increase in sales around July. We have not examined what kinds of ad campaigns occur throughout the year, nor how successful they have been. It’s possible ad campaigns may be more successful for ice cream in the summer, despite the data shown here.

#-------------------------------------Time series data----------------------------------
time_series_data <- ice_cream_list %>%
  select(c("basket_id", "quantity", "sales_value", "transaction_timestamp", "product_type")) %>%
  mutate(transaction_date = as.Date(floor_date(transaction_timestamp, "week"))) %>%
  group_by(transaction_date, product_type) %>%
  summarize(total_sales = sum(sales_value), .groups = "drop")

# ---------------------------Total amount of ice cream overall--------------------------
cum_totals <- time_series_data %>%
  group_by(product_type) %>%
  summarize(cum_total = sum(total_sales)) %>%
  arrange(desc(cum_total))

# -------------Allows the lines to be sorted from biggest area to smallest area---------
time_series_data <- time_series_data %>%
  mutate(product_type = factor(product_type,
                               levels = cum_totals$product_type))

Individual Sales Over Time

The following data shows weekly sales for individual products over time. (Change tabs to see a different graph)

#--------------------------------Sales per product type per week------------------------
ggplot(time_series_data,
       aes(x = transaction_date,
           y = total_sales,
           color = product_type)) +
  geom_point() +
  labs(x = "Date",
       y = "Total Sales Value Per Week",
       title = "Total Sales Value Per Product Over Time",
       color= "Product Type") +
  scale_color_manual(values = paletteer_d("RColorBrewer::RdGy"))

Cumulative Sales Over Time

The following data shows total weekly sales over time for all products. (Change tabs to see a different graph)

#--------------------------------------Cumulative Sales--------------------------------
ggplot() +
  geom_area(data = time_series_data,
            aes(x = transaction_date,
           y= total_sales,
           fill = product_type)) +
  labs(title = "Cumulative Ice Cream Sales Over Time",
       x = "Week",
       y = "Total Sales Value Across all Products",
       fill = "Product Type") +
  scale_fill_manual(values = paletteer_d("RColorBrewer::RdGy"))

Demographics

Finally, we examine the ways in which age, income, and children appear to affect ice cream sales.

Age

#----------------------------------Age Graph vs Toppings--------------------------------
age_list <- ice_cream_list %>%
  inner_join(demographics, by = "household_id") %>%
  group_by(age) %>%
  summarize(total = sum(quantity))

ggplot( age_list, aes(x = age, y = total)) +
  geom_col(fill = ifelse(age_list$age == "35-44"|age_list$age == "45-54",
                         "#66142a", "#9F2042")) +
  coord_flip() +
  labs(x = "Age",
       y = "Total Quantity Bought",
       title = "Total Quantity of Ice Cream Products versus Age") +
  annotate(
    "segment",
    x = 1.5,
    y = 2500,
    xend = 4, yend = 3500,
    arrow = arrow(length = unit(0.25, "cm"))
    ) +
  annotate(
    "text",
    x = 1,
    y = 2500,
    label = "Most ice cream sales involve younger and middle-aged people",
    )

Income

#----------------------------------Income Graph vs Toppings-----------------------------
income_list <- ice_cream_list %>%
  inner_join(demographics, by = "household_id") %>%
  group_by(income) %>%
  summarize(total = sum(quantity))

ggplot( income_list, aes(x = income, y = total)) +
  geom_col(fill = ifelse(income_list$income == "35-49K"|income_list$income == "50-74K",
                         "#66142a", "#9F2042")) +
  coord_flip() +
  labs(x = "Income Bracket",
       y = "Total Quantity Bought",
       title = "Total Quantity of Ice Cream Products versus Income")+
  annotate(
    "segment",
    x = 8.5,
    y = 1500,
    xend = 4, yend = 2000,
    arrow = arrow(length = unit(0.25, "cm"))
    ) +
  annotate(
    "text",
    x = 9,
    y = 1600,
    label = "Most ice cream sales involve those making 35k-75k income",
    )

Kids Count

#----------------------------------Kids Graph vs Toppings-------------------------------
has_kids_list <- ice_cream_list %>%
  inner_join(demographics, by = "household_id") %>%
  group_by(kids_count) %>%
  summarize(total = sum(quantity))

ggplot( has_kids_list, aes(x = kids_count, y = total)) +
  geom_col(fill = ifelse(has_kids_list$kids_count == "0", "#66142a", "#9F2042")) +
  coord_flip() +
  labs(x = "Number of Children",
       y = "Total Quantity Bought",
       title = "Total Quantity of Ice Cream Products versus Number of Children")+
  annotate(
    "segment",
    x = 3.2,
    y = 4000,
    xend = 1, yend = 6000,
    arrow = arrow(length = unit(0.25, "cm"))
    ) +
  annotate(
    "text",
    x = 3.5,
    y = 4000,
    label = "Most ice cream sales involve those with no children",
    )

The picture we get from these plots is that most ice cream sales appear to come from those who are

  1. Middle aged
  2. Earn between 35 and 74 thousand dollars
  3. Have 0 kids

It is unclear whether these demographics are more likely to buy ice cream, or whether they simply make up a larger portion of our data set.

Summary and Limitations

Recall that our goal for this project was to come up with ways of increasing ice cream profits. Based on the data, we can make a couple of recommendations. First of all, most ice cream sales appear to be due to 3 types of ice cream sales. We do not know for what reason non-premium ice creams bring in less customers, however, its possible that focusing advertising efforts on those few money makes may be more profitable.

We also examined the degree to which improving ice cream sales may help with improving topping sales. The data seems to suggest that the answer is no, however, it’s likely that our data set is not large enough to answer this question.

We also examined coupons and found that personalized coupons seemed to be used more often than non personalized coupons when it comes to ice cream. It’s unclear whether coupons themselves raise profits, or whether these customers would have bought ice cream anyway. To answer this question, more data would be needed.

Then, we moved on to analyzing ice cream sales over time. We found that there is a slight bump during July, however, ice cream sales were more or less constant. This suggests that the market is there year round for ice cream, however, it’s unclear to what degree marketing campaigns might work in different times of year. To help answer this question, we would need more data.

Finally, we found that most of our ice cream sales came from those who are middle aged, middle-low income and who have no kids. Again, we do not know whether this group is more likely to purchase ice cream, or whether they are simply more common in our data.

Our suggestion at the present time is to focus on coupon campaigns towards proven buyers of ice cream, since those campaigns have been shown to be more effective. We should also focus marketing efforts towards those demographics listed above. Finally, it’s possible the coupon campaigns intended to increase toppings sales could be paired with ice cream coupons. However, evidence is scant right now on how effective such a campaign could be.