Final Project
Introduction
Ice cream sales make up a small, but significant portion of overall profit in Regork stores. Totaling ~$55000 in total sales per year (1.1% of the total yearly profit across all products), we would like to find opportunities to increase sales of these products. For this project, we will examine the overall ice cream sale numbers across the entirety of Regork. Then, we will examine possible areas of growth, focusing on ice cream and toppings. Finally we will look at other factors potentially affecting ice cream sales and build a model for our typical customer.
Required Libraries
For this project, we will be using the following libraries and data sets.
library(tidyverse) #For useful statistics handling
#install.packages("DT") #For scrolling data tables
library(DT)
#install.packages("gridExtra") #To help with putting plots next to each other
library(gridExtra)
library(lubridate) #For time related functionality
library(ggplot2) #For other graphs
#install.packages("paletteer") #For custom color palette
library(paletteer)
#install.packages("rmdformats") #For the cool site design
#(https://github.com/juba/rmdformats)
library(completejourney) #For data set (https://cran.r-project.org/web/packages/
#completejourney/vignettes/completejourney.html)
transactions <- get_transactions() #Gets the complete transaction data
promotions <- get_promotions() #Gets the complete promotions data
#demographics #The demographics data
#products #The products data
#coupons_descriptions #For descriptions
#coupon_redemptions #For coupon redemption data
#combines both data sets
transact_prod <- transactions %>%
inner_join(products, by = "product_id")Exploratory Data Analysis
Ice Cream Sales Numbers
Ice cream sales can be broken down into 6 different categories of products and 11 product types.
ice_cream_list <-
transact_prod %>%
filter(str_detect(product_category, regex("ice cream", ignore_case = TRUE))|
str_detect(product_type, regex("ice cream", ignore_case = TRUE))
)
print("---------------------Ice Cream Product Categories-----------------------------")
unique(ice_cream_list$product_category)
print("------------------------Ice Cream Product Types-------------------------------")
unique(ice_cream_list$product_type)## [1] "---------------------Ice Cream Product Categories-----------------------------"
## [1] "ICE CREAM/MILK/SHERBTS" "SYRUPS/TOPPINGS" "FRZN NOVELTIES/WTR ICE"
## [4] "FROZEN" "CAKES" "BAKERY PARTY TRAYS"
## [1] "------------------------Ice Cream Product Types-------------------------------"
## [1] "QUARTS" "TRADITIONAL"
## [3] "PREMIUM" "PREMIUM PINTS"
## [5] "SUPER PREMIUM PINTS" "ICE CREAM TOPPINGS"
## [7] "ICE CREAM SANDWICHES" "PAILS"
## [9] "FROZEN ICE CREAM" "CAKES: ICE CREAM"
## [11] "FROZEN DESSERT (ICE CREAM CAKE"
Of these product types, two are ice cream cakes, another includes all ice cream sandwiches, another represents toppings, and the rest represent typical ice cream products in a variety of sizes and price points.
Looking into each of these product types, we can examine spending habits by customers.
#---------------------------------------Data for totals---------------------------------
summary_data <- ice_cream_list %>%
group_by(product_type) %>%
summarize(
sales = sum(sales_value, na.rm=TRUE),
baskets = n_distinct(basket_id, na.rm = TRUE),
quantity = sum(quantity, na.rm =TRUE)
) Sales Data
# ------------------------------------- bar chart --------------------------------------
#This is the output for the bar chart.
s1 <- ggplot(summary_data,
aes(x = reorder(product_type, -sales), #puts data in dec order
y = sales)) +
geom_col(fill = "#9F2042") +
labs(
x = "Product Type",
y = "sales",
title = "Ice Cream Purchases by Product Type"
) +
theme_minimal()+
theme(axis.text.x = element_text(angle = 45, hjust = 1))
#---------------------------------------Stacked box-------------------------------------
s2 <- ggplot(summary_data,
aes(x = "Break Down", y = sales, fill = product_type)) +
geom_col(width = 1) +
labs(
x = NULL,
y = "sales",
fill = "Amt",
title = "Breakdown By Type"
) +
scale_fill_manual(values = paletteer_d("RColorBrewer::RdGy")) +
theme_minimal()
grid.arrange(s1,s2, ncol=2)Quantity Data
# ------------------------------------- bar chart --------------------------------------
#This is the output for the bar chart.
q1 <- ggplot(summary_data,
aes(x = reorder(product_type, -quantity), #puts data in dec order
y = quantity)) +
geom_col(fill = "#9F2042") +
labs(
x = "Product Type",
y = "quantity",
title = "Ice Cream Purchases by Product Type"
) +
theme_minimal()+
theme(axis.text.x = element_text(angle = 45, hjust = 1))
#---------------------------------------Stacked box-------------------------------------
q2 <- ggplot(summary_data,
aes(x = "Break Down", y = quantity, fill = product_type)) +
geom_col(width = 1) +
labs(
x = NULL,
y = "quantity",
fill = "Amt",
title = "Breakdown By Type"
) +
scale_fill_manual(values = paletteer_d("RColorBrewer::RdGy")) +
theme_minimal()
grid.arrange(q1,q2, ncol=2)Baskets Data
# ------------------------------------- bar chart --------------------------------------
#This is the output for the bar chart.
b1 <- ggplot(summary_data,
aes(x = reorder(product_type, -baskets), #puts data in dec order
y = baskets)) +
geom_col(fill = "#9F2042") +
labs(
x = "Product Type",
y = "baskets",
title = "Ice Cream Purchases by Product Type"
) +
theme_minimal()+
theme(axis.text.x = element_text(angle = 45, hjust = 1))
#---------------------------------------Stacked box-------------------------------------
b2 <- ggplot(summary_data,
aes(x = "Break Down", y = baskets, fill = product_type)) +
geom_col(width = 1) +
labs(
x = NULL,
y = "baskets",
fill = "Amt",
title = "Breakdown By Type"
) +
scale_fill_manual(values = paletteer_d("RColorBrewer::RdGy")) +
theme_minimal()
grid.arrange(b1,b2, ncol=2)We see from these numbers that the largest moneymakers when it comes to ice cream sales are in the Premium, Traditional, and super Premium pints categories, with Premium wracking slightly under half of all total sales. Non-standard items like ice cream sandwiches and cakes rank much further down the list. We can also see that ice cream toppings sales are much further down the list, in eighth place. This suggests that initiatives aimed at increasing standard ice cream sales may increase sales more than initiatives addressed to lesser earning categories.
Ice Cream Products Correlations
An interesting question is whether there is any correlation between ice cream sales and ice cream toppings sales. Were such a correlation to exist, increasing ice cream sales could, potentially, lead to additional income via increases in ice cream topping sales. Among those who purchased ice cream products, we find some interesting correlations. Notice that more people bought toppings without ice cream, than bought both toppings and ice cream for each of the ice cream types.
#-------------------------List of sales numbers per category---------------------------
correlation_data <- ice_cream_list %>%
group_by(basket_id)%>%
summarize(
premium = any(product_type == "PREMIUM"),
traditional = any(product_type == "TRADITIONAL"),
sup_prem = any(product_type == "SUPER PREMIUM PINTS"),
toppings = any(product_type == "ICE CREAM TOPPINGS"),
ice_sand = any(product_type == "ICE CREAM SANDWICHES"),
quart = any(product_type == "QUARTS"),
fro_ice = any(product_type == "FROZEN ICE CREAM"),
fro_des = any(product_type == "FROZEN DESSERT (ICE CREAM CAKE"),
pale = any(product_type == "PAILS"),
cake = any(product_type == "CAKES: ICE CREAM"),
prem_pin = any(product_type == "PREMIUM PINTS")
)
#-------------------------Turn it into a contingency table-----------------------------
tab_premium <- table(correlation_data$premium, correlation_data$toppings)
tab_trad <- table(correlation_data$traditional, correlation_data$toppings)
tab_sup <- table(correlation_data$sup_prem, correlation_data$toppings)
tab_ice_sand <- table(correlation_data$ice_sand, correlation_data$toppings)
tab_quart <- table(correlation_data$quart, correlation_data$toppings)
tab_fro_ice <- table(correlation_data$fro_ice, correlation_data$toppings)
tab_fro_des <- table(correlation_data$fro_des, correlation_data$toppings)
tab_pale <- table(correlation_data$pale, correlation_data$toppings)
tab_cake <- table(correlation_data$cake, correlation_data$toppings)
tab_prem_pin <- table(correlation_data$prem_pin, correlation_data$toppings)
#-----------------------------------Plots---------------------------------------------
#You are welcome for all these mosaic plots (10 of them in all)
par(mfrow = c(1, 3))
mosaicplot(tab_premium,
main = "Premium vs Toppings",
xlab = "Customer bought Premium Ice Cream",
ylab = "Customer bought Toppings",
color = TRUE)
mosaicplot(tab_trad,
main = "Traditional vs Toppings",
xlab = "Customer bought Traditional Ice Cream",
ylab = "Customer bought Toppings",
color = TRUE)
mosaicplot(tab_sup,
main = "Super Premium vs Toppings",
xlab = "Customer bought Super Premium Ice Cream",
ylab = "Customer bought Toppings",
color = TRUE)
mosaicplot(tab_ice_sand,
main = "Ice Cream Sandwiches vs Toppings",
xlab = "Customer bought Premium Ice Cream Sandwiches",
ylab = "Customer bought Toppings",
color = TRUE)
mosaicplot(tab_quart,
main = "Quarts vs Toppings",
xlab = "Customer bought Traditional Quarts",
ylab = "Customer bought Toppings",
color = TRUE)
mosaicplot(tab_fro_ice,
main = "Frozen Ice vs Toppings",
xlab = "Customer bought Frozen Ice",
ylab = "Customer bought Toppings",
color = TRUE)
mosaicplot(tab_fro_des,
main = "Frozen Dessert vs Toppings",
xlab = "Customer bought Frozen Dessert",
ylab = "Customer bought Toppings",
color = TRUE)
mosaicplot(tab_pale,
main = "Pails vs Toppings",
xlab = "Customer bought Pails",
ylab = "Customer bought Toppings",
color = TRUE)
mosaicplot(tab_cake,
main = "Ice Cream Cake m vs Toppings",
xlab = "Customer bought Ice Cream Cake",
ylab = "Customer bought Toppings",
color = TRUE)
mosaicplot(tab_prem_pin,
main = "Premium Pints vs Toppings",
xlab = "Customer bought Premium Pints",
ylab = "Customer bought Toppings",
color = TRUE)
par(mfrow = c(1, 1))#------------------ The degree to which toppings and x are correlated ------------------
correlations <- c(
cor(as.numeric(correlation_data$premium),
as.numeric(correlation_data$toppings)),
cor(as.numeric(correlation_data$traditional),
as.numeric(correlation_data$toppings)),
cor(as.numeric(correlation_data$sup_prem),
as.numeric(correlation_data$toppings)),
cor(as.numeric(correlation_data$ice_sand),
as.numeric(correlation_data$toppings)),
cor(as.numeric(correlation_data$quart),
as.numeric(correlation_data$toppings)),
cor(as.numeric(correlation_data$fro_ice),
as.numeric(correlation_data$toppings)),
cor(as.numeric(correlation_data$fro_des),
as.numeric(correlation_data$toppings)),
cor(as.numeric(correlation_data$pale),
as.numeric(correlation_data$toppings)),
cor(as.numeric(correlation_data$cake),
as.numeric(correlation_data$toppings)),
cor(as.numeric(correlation_data$prem_pin),
as.numeric(correlation_data$toppings))
)
categories <- c("premium","traditional","sup_prem","ice_sand","quart",
"fro_ice","fro_des","pale","cake","prem_pin")
#Shows findings
head(data.frame(types = categories, correlations = correlations), n=10)## types correlations
## 1 premium -0.026200377
## 2 traditional -0.017680231
## 3 sup_prem -0.045279163
## 4 ice_sand -0.046441584
## 5 quart -0.022766255
## 6 fro_ice -0.011303472
## 7 fro_des -0.006430965
## 8 pale -0.008474459
## 9 cake -0.013728845
## 10 prem_pin -0.050688323
Strangely, the data suggests that not only is there no correlation, there is a slight negative correlation between ice cream toppings and ice cream. Looking at correlations among the entire transactions data set reveals some more patterns in the data. We see from the data that toppings are bought with a number of products, though ice cream
#Gets a list of baskets that have toppings.
has_toppings <- transact_prod %>%
mutate(is_topping = str_detect(product_type, "ICE CREAM TOPPINGS")) %>%
group_by(basket_id) %>%
summarize(has_topping = any(is_topping))
#Gets a list of products that were bought alongside ice cream toppings
item_bought_with_ice_cream_toppings <- transact_prod %>%
inner_join(has_toppings, by="basket_id") %>%
group_by(product_type) %>%
summarise(
baskets_with_type = n_distinct(basket_id),
baskets_with_both = length(unique(basket_id[has_topping == TRUE])),
prob_toppings_given_type = baskets_with_both / baskets_with_type
) %>%
arrange(desc(baskets_with_both))
DT::datatable(item_bought_with_ice_cream_toppings,
caption = "Amount of baskets with toppings and this item type")#People who only bought ice cream toppings.
has_toppings_only <- transact_prod %>%
group_by(basket_id) %>%
filter(all(product_type == "ICE CREAM TOPPINGS")) %>%
select(basket_id, quantity) %>%
group_by(basket_id) %>%
summarize(quantitiy_of_toppings = sum(quantity))
DT::datatable(has_toppings_only,
caption = "Baskets that only have toppings")Ice Cream Coupon Data
The next thing to check is what types of coupons exist for ice cream and whether they are used. The following code shows the number of coupon upc codes that have to do with ice cream. Then, we join some data together to get a list of used coupons.
#How many coupons are there for ice cream?
num <- ice_cream_list %>%
inner_join(coupons, by = "product_id") %>%
select(coupon_upc) %>%
n_distinct()
print(paste0("Total number of coupons for ice cream: ",num))## [1] "Total number of coupons for ice cream: 21"
#------------------------Data on which coupons were used-------------------------------
num_coupons_used <- coupon_redemptions %>%
tibble::rowid_to_column("ID") %>%
inner_join(coupons, by = c("coupon_upc", "campaign_id")) %>%
inner_join(ice_cream_list, by= c("household_id", "product_id")) %>%
inner_join(campaign_descriptions, by="campaign_id") %>%
filter(yday(redemption_date) == yday(transaction_timestamp)) %>%
group_by(campaign_type) %>%
summarize(total = n_distinct(ID))
#---------------------------Data on existing coupons-----------------------------------
num_coupons_by_type <- ice_cream_list %>%
inner_join(coupons, by = "product_id") %>%
inner_join(campaign_descriptions, by = "campaign_id") %>%
select(c("coupon_upc","campaign_type")) %>%
group_by(campaign_type) %>%
summarize(total = n_distinct(coupon_upc))The following plots show coupon usage by type and total amount of
coupons by type. We can see from the data that type A coupons are most
prevalent, and also most used. Considering type A coupons are based on
past buying behavior, this suggests that targeted ice cream coupon
campaigns would be most likely to raise sales. We do not know whether
coupons could assist in increasing ice cream toppings sales, say with a
“get 1 topping 50% of with a purchase of some ice cream coupon”.
#------------------------------------Total number of coupons----------------------------
p1 <- ggplot(
num_coupons_by_type,
aes( x = campaign_type, y = total)
) +
geom_col(fill = "#9F2042") +
labs(x = "Campaign Type",
y = "Number of Ice Cream Coupons",
title = "Number of Ice Cream Coupons by Campaign Type") +
annotate("text", x = 2, y = 15, label = " Note: 3 coupons are \n both type A and B ")
#------------------------------------Number of coupons used-----------------------------
p2 <- ggplot(
num_coupons_used,
aes(x = campaign_type, y = total)
) +
geom_col(fill = "#9F2042") +
labs(x = "Campaign Type",
y = "Number of Coupons used",
title = "Number of Ice Cream Coupons Used")
grid.arrange(p1,p2, ncol=2)Check Ice Cream Sales over time
We now transition to studying ice cream sales over time. The following graphs show total ice cream sales over time. The numbers are largely consistent, albeit with a small increase in sales around July. We have not examined what kinds of ad campaigns occur throughout the year, nor how successful they have been. It’s possible ad campaigns may be more successful for ice cream in the summer, despite the data shown here.
#-------------------------------------Time series data----------------------------------
time_series_data <- ice_cream_list %>%
select(c("basket_id", "quantity", "sales_value", "transaction_timestamp", "product_type")) %>%
mutate(transaction_date = as.Date(floor_date(transaction_timestamp, "week"))) %>%
group_by(transaction_date, product_type) %>%
summarize(total_sales = sum(sales_value), .groups = "drop")
# ---------------------------Total amount of ice cream overall--------------------------
cum_totals <- time_series_data %>%
group_by(product_type) %>%
summarize(cum_total = sum(total_sales)) %>%
arrange(desc(cum_total))
# -------------Allows the lines to be sorted from biggest area to smallest area---------
time_series_data <- time_series_data %>%
mutate(product_type = factor(product_type,
levels = cum_totals$product_type))Individual Sales Over Time
#--------------------------------Sales per product type per week------------------------
ggplot(time_series_data,
aes(x = transaction_date,
y = total_sales,
color = product_type)) +
geom_point() +
labs(x = "Date",
y = "Total Sales Value Per Week",
title = "Total Sales Value Per Product Over Time",
color= "Product Type") +
scale_color_manual(values = paletteer_d("RColorBrewer::RdGy"))Cumulative Sales Over Time
#--------------------------------------Cumulative Sales--------------------------------
ggplot() +
geom_area(data = time_series_data,
aes(x = transaction_date,
y= total_sales,
fill = product_type)) +
labs(title = "Cumulative Ice Cream Sales Over Time",
x = "Week",
y = "Total Sales Value Across all Products",
fill = "Product Type") +
scale_fill_manual(values = paletteer_d("RColorBrewer::RdGy"))Demographics
Finally, we examine the ways in which age, income, and children appear to affect ice cream sales.
Age
#----------------------------------Age Graph vs Toppings--------------------------------
age_list <- ice_cream_list %>%
inner_join(demographics, by = "household_id") %>%
group_by(age) %>%
summarize(total = sum(quantity))
ggplot( age_list, aes(x = age, y = total)) +
geom_col(fill = ifelse(age_list$age == "35-44"|age_list$age == "45-54",
"#66142a", "#9F2042")) +
coord_flip() +
labs(x = "Age",
y = "Total Quantity Bought",
title = "Total Quantity of Ice Cream Products versus Age") +
annotate(
"segment",
x = 1.5,
y = 2500,
xend = 4, yend = 3500,
arrow = arrow(length = unit(0.25, "cm"))
) +
annotate(
"text",
x = 1,
y = 2500,
label = "Most ice cream sales involve younger and middle-aged people",
)Income
#----------------------------------Income Graph vs Toppings-----------------------------
income_list <- ice_cream_list %>%
inner_join(demographics, by = "household_id") %>%
group_by(income) %>%
summarize(total = sum(quantity))
ggplot( income_list, aes(x = income, y = total)) +
geom_col(fill = ifelse(income_list$income == "35-49K"|income_list$income == "50-74K",
"#66142a", "#9F2042")) +
coord_flip() +
labs(x = "Income Bracket",
y = "Total Quantity Bought",
title = "Total Quantity of Ice Cream Products versus Income")+
annotate(
"segment",
x = 8.5,
y = 1500,
xend = 4, yend = 2000,
arrow = arrow(length = unit(0.25, "cm"))
) +
annotate(
"text",
x = 9,
y = 1600,
label = "Most ice cream sales involve those making 35k-75k income",
)Kids Count
#----------------------------------Kids Graph vs Toppings-------------------------------
has_kids_list <- ice_cream_list %>%
inner_join(demographics, by = "household_id") %>%
group_by(kids_count) %>%
summarize(total = sum(quantity))
ggplot( has_kids_list, aes(x = kids_count, y = total)) +
geom_col(fill = ifelse(has_kids_list$kids_count == "0", "#66142a", "#9F2042")) +
coord_flip() +
labs(x = "Number of Children",
y = "Total Quantity Bought",
title = "Total Quantity of Ice Cream Products versus Number of Children")+
annotate(
"segment",
x = 3.2,
y = 4000,
xend = 1, yend = 6000,
arrow = arrow(length = unit(0.25, "cm"))
) +
annotate(
"text",
x = 3.5,
y = 4000,
label = "Most ice cream sales involve those with no children",
)The picture we get from these plots is that most ice cream sales appear to come from those who are
- Middle aged
- Earn between 35 and 74 thousand dollars
- Have 0 kids
It is unclear whether these demographics are more likely to buy ice cream, or whether they simply make up a larger portion of our data set.
Summary and Limitations
Recall that our goal for this project was to come up with ways of increasing volume of ice cream sold. Based on the data, we can make a couple of recommendations. First of all, most ice cream sales appear to be due to 3 types of ice cream sales. We do not know for what reason non-premium ice creams bring in fewer customers; however, it’s possible that focusing advertising efforts on those few moneymakers may be more profitable.
We also examined the degree to which improving ice cream sales may help with improving topping sales. The data seems to suggest that the answer is no. The correlation is either slightly negative or 0 between toppings sales and ice cream sales. It is important to note, however, that other products are also commonly bought with toppings and could represent an area of future study.
We also examined coupons and found that personalized coupons seemed to be used more often than non-personalized coupons when it comes to ice cream. It’s unclear whether coupons themselves raise sales, or whether these customers would have bought ice cream anyway. To answer this question, more data would be needed.
Then, we moved on to analyzing ice cream sales over time. We found that there is a slight bump during July; however, sales numbers are still robust year round.
Finally, we found that most of our ice cream sales came from those who are middle-aged, middle-low income, and who have no kids. Looking through the data, it appears this may possibly be due to the distribution of people in our data set; however, more study is needed.
Our suggestion at present is to focus coupon campaigns towards proven buyers of ice cream, since those campaigns have been shown to be more effective. We should also focus marketing efforts towards the demographics listed above. Finally, we suggest offering coupons which pair ice cream and toppings together.