knitr::opts_chunk$set(warnings = FALSE, messages = FALSE)
First I am going to read the four CSVs into my R session and take a look at a summary of our transaction data. The summary function is always a good place to start to catch any glaring data quality issues.
prod <-
read.csv("product.csv",
quote = "'",
stringsAsFactors = FALSE)
prod_class <-
read.csv("product_class.csv",
quote = "'",
stringsAsFactors = FALSE)
promo <-
read.csv("promotion.csv",
quote = "'",
stringsAsFactors = FALSE)
trans <-
read.csv("transactions.csv",
quote = "'")
summary(trans)
## product_id customer_id store_id promotion_id
## Min. : 1.0 Min. : 0 Min. : 2 Min. : 0.0
## 1st Qu.: 407.0 1st Qu.: 0 1st Qu.: 13 1st Qu.: 0.0
## Median : 790.0 Median : 1168 Median : 24 Median : 7.0
## Mean : 785.7 Mean : 2703 Mean : 2576 Mean : 140.6
## 3rd Qu.:1171.0 3rd Qu.: 5092 3rd Qu.: 5092 3rd Qu.: 17.0
## Max. :1559.0 Max. :10277 Max. :10277 Max. :1869.0
## month_of_year quarter the_year store_sales
## Min. : 1.000 Q1:43160 Min. :1997 Min. : 0.500
## 1st Qu.: 4.000 Q2:40714 1st Qu.:1997 1st Qu.: 3.780
## Median : 7.000 Q3:42892 Median :1997 Median : 5.920
## Mean : 6.639 Q4:46836 Mean :1997 Mean : 6.512
## 3rd Qu.:10.000 3rd Qu.:1997 3rd Qu.: 8.640
## Max. :12.000 Max. :1997 Max. :30.560
## store_cost unit_sales fact_count
## Min. : 0.153 Min. : 1.000 1 :173536
## 1st Qu.: 1.479 1st Qu.: 3.000 1);: 2
## Median : 2.368 Median : 3.000 2 : 64
## Mean : 2.599 Mean : 3.073
## 3rd Qu.: 3.472 3rd Qu.: 4.000
## Max. :12.064 Max. :10.000
I know this is an aggregated table due to the presence of fact_count column. Each entry in this table is representing an entire month and the fact_count column tells us if the same customer engaged in the exact same transaction multiple times in the same month. I can see a couple entries in the fact_count column were mistranslated somewhere along the way. Let’s clean those up.
trans$fact_count <-
as.numeric(
gsub(pattern = "1);",
replace = "1",
x = trans$fact_count))
Now I want to take a look at histograms of our numeric variables to make sure we don’t have any obvious issues there.
par(mfrow = c(3,1))
hist(trans$store_sales, breaks = 'scott')
hist(trans$store_cost, breaks = 'scott')
hist(trans$unit_sales, breaks = 10)
Everything looks good here, I don’t see any values that are way out of line.
To get a feel for what is going on with our product categories first I need to merge the transaction and various product info tables. Then, we will calculate a number of KPIs for each product category.
library(dplyr) # useful package for data munging
trans_prod <- merge(
x = merge(
x = trans,
y = prod,
by = 'product_id',
all.x = TRUE
),
y = prod_class,
by = 'product_class_id',
all.x = TRUE
)
# here we calculate some row level kpis
trans_prod <- trans_prod %>%
mutate('revenue' = unit_sales * store_sales * fact_count,
'profit' = unit_sales * fact_count * (store_sales - store_cost),
'product.cost' = unit_sales * store_cost * fact_count,
'margin' = ((profit/revenue) * 100))
# calculating total revenue for the year for later use
total_rev <- sum(trans_prod$revenue)
# grouping by category and calculating kpis
cat_kpis <- trans_prod %>%
group_by(product_category) %>%
summarize(
Sales = sum(fact_count),
Units = sum(fact_count * unit_sales),
Rev_by_1000s = sum(revenue)/1000,
RevPerSale = (sum(revenue) / Sales),
Pct_Tot_Rev = (sum(revenue) / total_rev)*100,
Margin = (sum(profit)/sum(revenue))*100)
I’m going to filter this down and plot out the 10 categories with the highest revenue.
top_rev <- top_n(cat_kpis, 10, Rev_by_1000s)
# The package plotly lets us quickly create interactive d3 visualizations
library(plotly)
p <- top_rev %>% plot_ly(x = product_category,
y = Rev_by_1000s,
color = Margin,
mode = "markers",
marker = list(size = 20))
p <- layout(p,
title = "Top Categories",
margin = list(b = 90),
xaxis = list(
title = ""
),
yaxis = list(
title = "$ Revenue (1000s)"
))
p
Below you will find an html table widget containing the product category KPI data. One can sort, search, and filter for anything of interest.
prettifyTable <- function(x) {
bools <- sapply(x,class) %in% c('numeric', 'integer')
x[,bools] <- apply(x[,bools], round, digits = 2, MARGIN = 2)
x[,bools] <- apply(x[,bools], prettyNum, big.mark = ",", MARGIN = 2)
return(x)
} # wrapper function I made to format numbers
cat_kpis.pretty <- prettifyTable(cat_kpis)
DT::datatable(cat_kpis.pretty,
rownames = FALSE,
filter = "top",
options = list(),
class = "display")
Clearly a significant portion of sales come from several core categories; vegetables, snack food, dairy, meat, fruit, jams and jellies, bread and canned soup account for about half of revenue.
Let’s take a quick look at the top subcategories within these core categories.
core_cats <- c('Vegetables','Snack Foods','Meat','Dairy','Fruit', 'Bread', 'Jams and Jellies', 'Canned Soup')
top_subcats <- trans_prod %>%
filter(product_category %in% core_cats) %>%
group_by(product_subcategory) %>%
summarize(
Sales = sum(fact_count),
Units = sum(fact_count * unit_sales),
Rev_by_1000s = sum(revenue)/1000,
RevPerSale = (sum(revenue) / Sales),
Pct_Tot_Rev = (sum(revenue) / total_rev)*100,
Margin = (sum(profit)/sum(revenue))*100) %>%
top_n(10, Rev_by_1000s)
p<-top_subcats %>% plot_ly(x = product_subcategory,
y = Rev_by_1000s,
color = Margin,
mode = "markers",
marker = list(size = 20))
p <- layout(p,
title = "Top Core Sub-categories",
margin = list(b = 90),
xaxis = list(
title = ""
),
yaxis = list(
title = "$ Revenue (1000s)"
))
p
Fresh vegetables and fresh fruits are far and away our two largest drivers of revenue. That’s great to see, food market can consider itself to have a positive influence on the community, primarily providing high quality nutrition to its customers.
Many of these food items are staples in households. Let’s look at the subcategories for categories that aren’t the core food items accounting for half of revenue.
noncore_totrev <- sum(filter(trans_prod, !(product_category %in% core_cats))$revenue)
subcat_kpis <- trans_prod %>%
filter(!(product_category %in% core_cats)) %>%
group_by(product_subcategory) %>%
summarize(
Sales = sum(fact_count),
Units = sum(fact_count * unit_sales),
Rev_by_1000s = sum(revenue)/1000,
RevPerSale = (sum(revenue) / Sales),
Pct_Tot_Rev = (sum(revenue) / noncore_totrev)*100,
Margin = (sum(profit)/sum(revenue))*100)
top_rev <- top_n(subcat_kpis, 10, Rev_by_1000s)
p<- top_rev %>% plot_ly(x = product_subcategory,
y = Rev_by_1000s,
color = Margin,
mode = "markers",
marker = list(size = 20))
p <- layout(p,
title = "Top Non-Core Sub-categories",
margin = list(b = 90),
xaxis = list(
title = ""
),
yaxis = list(
title = "$ Revenue (1000s)"
))
p
Outside of food pantry staples we can see paper wipes, wine, batteries, and candy are some of the largest drivers of revenue. These would be great products to focus promotions on. Below is another table displaying KPIs for the non-core subcategories.
subcat_kpis.pretty <- prettifyTable(subcat_kpis)
DT::datatable(subcat_kpis.pretty,
rownames = FALSE,
filter = "top",
options = list(),
class = "display")
One interesting way to look at category level revenues is with a heat map. It can inform if there are certain times of the year that we can capitalize on high demand for specific products. Check out interactive heatmap below.
heatmap <- trans_prod %>%
group_by(month_of_year, product_category) %>%
summarise(Rev = sum(revenue))
# reshape data to feed into heatmap function
heatmapspread <- tidyr::spread(heatmap, month_of_year, Rev)
cats <- heatmapspread$product_category
heatmapspread <- heatmapspread[,-1]
row.names(heatmapspread) <- cats
d3heatmap::d3heatmap(heatmapspread,
dendrogram = 'row',
colors = 'Blues',
anim_duration = 1000,
labCol = c('Jan',
'Feb',
'Mar',
'Apr',
'May',
'Jun',
'Jul',
'Aug',
'Sep',
'Oct',
'Nov',
'Dec'))
We can see clear signs of seasonality for all products. Sales are higher across the board in March, July and the last two months of the year.
Now that we have a handle on the categories, let’s look at sales through the lens of brands. While considering categories we were mostly looking at revenue, but I think that there are more relevant metrics to use for brand analysis. High-end brands are going to naturally cost more than their competitors and so they would always look like the top brand if we looked at revenue. Another metric would be to count the number of transactions overall. This captures volume of sales but one down side of this metric is that brands that sell many different types of products are going to always come out on top. As a compromise I’m going to be looking at the average transactions per product per brand.
top_brands <- trans_prod %>%
group_by(brand_name) %>%
summarise(TransPerProd = sum(fact_count)/length(unique(product_name)),
Prods_Offered = length(unique(product_name))) %>%
top_n(10, TransPerProd)
top_brands$brand_name <- as.factor(top_brands$brand_name)
p <- top_brands %>% plot_ly(x = brand_name,
y = TransPerProd,
color = Prods_Offered,
mode = "markers",
marker = list(size = 20))
p <- layout(p,
title = "Top 10 Brands",
margin = list(b = 90),
xaxis = list(
title = ""
),
yaxis = list(
title = "Buys per Product Offered"
))
p
We can see our top three brands by this metric are Special, Dollar and Big Time. Let’s see what they make.
trans_prod %>%
filter(brand_name %in% c('Special', 'Dollar', 'Big Time')) %>%
select(brand_name,product_name) %>%
unique %>%
arrange(desc(brand_name))
## brand_name product_name
## 1 Special Special Oatmeal
## 2 Special Special Corn Puffs
## 3 Special Special Wheat Puffs
## 4 Special Special Grits
## 5 Dollar Dollar Monthly Sports Magazine
## 6 Dollar Dollar Monthly Home Magazine
## 7 Dollar Dollar Monthly Fashion Magazine
## 8 Dollar Dollar Monthly Computer Magazine
## 9 Dollar Dollar Monthly Auto Magazine
## 10 Big Time Big Time Popsicles
## 11 Big Time Big Time Ice Cream Sandwich
## 12 Big Time Big Time Ice Cream
## 13 Big Time Big Time Chicken TV Dinner
## 14 Big Time Big Time Turkey TV Dinner
## 15 Big Time Big Time Beef TV Dinner
## 16 Big Time Big Time Pancake Mix
## 17 Big Time Big Time Frozen Pancakes
## 18 Big Time Big Time Low Fat Waffles
## 19 Big Time Big Time Blueberry Waffles
## 20 Big Time Big Time Apple Cinnamon Waffles
## 21 Big Time Big Time Waffles
## 22 Big Time Big Time Frozen Carrots
## 23 Big Time Big Time Frozen Peas
## 24 Big Time Big Time Frozen Cauliflower
## 25 Big Time Big Time Frozen Corn
## 26 Big Time Big Time Frozen Broccoli
## 27 Big Time Big Time Home Style French Fries
## 28 Big Time Big Time Low Fat French Fries
## 29 Big Time Big Time Fajita French Fries
## 30 Big Time Big Time Frozen Mushroom Pizza
## 31 Big Time Big Time Frozen Pepperoni Pizza
## 32 Big Time Big Time Frozen Cheese Pizza
## 33 Big Time Big Time Frozen Sausage Pizza
## 34 Big Time Big Time Frozen Chicken Wings
## 35 Big Time Big Time Frozen Chicken Breast
## 36 Big Time Big Time Frozen Chicken Thighs
## 37 Big Time Big Time Lime Popsicles
## 38 Big Time Big Time Orange Popsicles
## 39 Big Time Big Time Lemon Popsicles
## 40 Big Time Big Time Grape Popsicles
In a super market like food market there is valuable insight to be found in the interplay between the margin for a particular product and it’s popularity. If you imagine a plot with margin on the Y axis and item popularity on the X axis, then we can consider 4 distinct quadrants.
High-margin, high-popularity items are our bread and butter. Our number one priority with these should be never running too low on supply and looking for ways to cross-sell these items.
High margin, low-popularity items are our big opportunities. We should direct sales and promotion efforts towards these items to increase volume.
Low Margin, High Popularity items are important for customer satisfaction. Given the low margin on these items, it is invaluable to have an efficient inventory tracking and forecasting system to reduce waste. These items are also prime for pricing optimization…we can probably charge a little more and still maintain the high levels of popularity.
Low margin, low popularity items: do we need to be selling these?
Let’s see what products fall into what quadrants. First of all I’m going to filter for our 100 top products by revenue, otherwise this plot would just be too noisy. I’m also going to include a normalized revenue measure to include in the visualization as a size/color variable. This will be key in making any decisions, especially if we are thinking about cutting an item all together. Ultimately, we wouldn’t want to cut a product that is generating good revenues and profits.
tab <- trans_prod %>%
group_by(product_name) %>%
summarize(Margin = (sum(profit)/sum(revenue))*100,
Trans = sum(fact_count),
Rev = sum(revenue)) %>%
ungroup %>% top_n(100, Rev)
tab$Rev.Scaled <- scale(tab$Rev)
# here we calculate the medians of our two dimensions in order to draw the quadrants discussed above
med.margin <- median(tab$Margin)
med.Trans <- median(tab$Trans)
ggplotly(
ggplot(tab,
aes(Trans,
Margin,
color = Rev.Scaled,
size = Rev.Scaled,
text = paste("Product:", product_name))
) +
geom_point() +
geom_hline(aes(yintercept = med.margin)) +
geom_vline(aes(xintercept = med.Trans)) +
scale_size(guide = 'none'))
Now we can dig into this visualization and start picking specific items to take action on per the above. Directing sales and promotion efforts towards the items in the top left quadrant is the biggest opportunity here. These items are under-bought and given the high margin we can still make a good profit even with promotions. On the other hand, perhaps our margin is too high on these products and that explains the low popularity. Looking at this information relative to competitors could provide insight to food market’s pricing strength against other supermarkets.
Speaking of promotions…
Promotions are a key tool in any mass-market commerical enterprise’s toolbox. They allow us to direct the attention of our customers towards specific subsets of products. Additionally we can use promotion performance data to help make decisions about cutting products. If we have a low popularity, low margin, low revenue item that we can’t even sell with a promotion, that sounds like a good candidate for elimination.
First let’s make sure there aren’t any anomalies in our promotion info table.
summary(promo)
## promotion_id promotion_district_id promotion_name
## Min. : 0.0 Min. : 0.0 Length:1864
## 1st Qu.: 473.8 1st Qu.:115.0 Class :character
## Median : 947.5 Median :121.0 Mode :character
## Mean : 947.8 Mean :121.4
## 3rd Qu.:1420.2 3rd Qu.:127.0
## Max. :1896.0 Max. :133.0
## media_type cost start_date end_date
## Length:1864 Min. : 0 Length:1864 Length:1864
## Class :character 1st Qu.: 7524 Class :character Class :character
## Mode :character Median : 9986 Mode :character Mode :character
## Mean :10027
## 3rd Qu.:12537
## Max. :14997
Everything looks pretty normal here. I see there is a start_date and end_date for the promotions so that is something to consider. Let’s merge the promotion info on the transactions table and take a quick look at a histogram to see what promotions are being used. As I look at some of the start and end dates I am noticing the they don’t seem to be accurate. There are many cases where a promotion was redeemed outside of this active window. The only assumption I can make given the data available is that these start and end dates are not firm. Bed, Bath, and Beyond is one very well known retailer who’s coupons technically have an expiration date…but they are still accepted regardless. This is definitely a worthwhile strategy if margins are high enough that profits won’t be cannibalized by over-promotion.
all <- trans_prod %>%
merge(promo,
by = 'promotion_id',
all.x = TRUE)
all_promo <- all %>% filter(promotion_id != 0)
Promo_Counts <- all_promo %>%
group_by(promotion_id, promotion_name) %>%
summarize(Redemptions = sum(fact_count),
Cost = mean(cost)) %>%
ungroup %>%
arrange(desc(Redemptions)) %>%
mutate(Rolling_Prop_Reds = cumsum(Redemptions) / sum(Redemptions),
Prop_Reds_Rank = dense_rank(-(Redemptions / sum(Redemptions))),
Rolling_Prop_Cost = cumsum(Cost) / sum(Cost),
Prop_Cost_Rank = dense_rank(-(Cost / sum(Cost))))
Wow, we can see that the top 10 most often used promotions account for 75% of total promotion redemptions but only 7% of our total cost of all promotions that were redeemed in the year in question. This initially is looking like a red flag, we should definitely be dedicating a higher percent of our promotion budget towards the promos that are being redeemed often.
One dimensions that we haven’t though of is the profits generated by these promotions. Now I’m going to create a summary table with a number of financial related KPIs for the most and least profitable promotions.
top_bot_promos <- all_promo %>%
group_by(promotion_id, promotion_name) %>%
summarize(
'Redemptions' = sum(fact_count),
'Rev' = sum(revenue),
'Promo.Cost' = mean(cost),
'Promo.Cost/Use' = Promo.Cost / Redemptions,
'Item.Cost' = sum(product.cost),
'Profit' = Rev - Item.Cost - Promo.Cost,
'ProfitPerRed' = Profit / n(),
'Avg_margin' = mean(margin)
) %>%
ungroup %>%
top_n(5, Profit) %>%
bind_rows(
all_promo %>%
group_by(promotion_id, promotion_name) %>%
summarize(
'Redemptions' = n(),
'Rev' = sum(revenue),
'Promo.Cost' = mean(cost),
'Promo.Cost/Use' = Promo.Cost / Redemptions,
'Item.Cost' = sum(product.cost),
'Profit' = Rev - Item.Cost - Promo.Cost,
'ProfitPerRed' = Profit / n(),
'Avg_margin' = mean(margin)
) %>%
ungroup %>%
top_n(5, -Profit)
)
p <- top_bot_promos %>% plot_ly(x = promotion_name,
y = Profit,
color = Redemptions,
mode = "markers",
marker = list(size = 20),
text = paste('Redemptions:', Redemptions, 'Avg_Margin:', round(Avg_margin,2)))
p <- layout(p,
title = "Top 5 and Bottom 5 Promos by Profit",
margin = list(b = 90),
xaxis = list(
title = ""
))
p
Above we can see that the 5 least profitable promotions (of the ones that were redeemed at all) all have extremely low redemption rates. With under 100 redemptions each, it’s not surprising so much money was lost on them.
We have really only scratched the surface of the insights available in this data set. One limitation of the analysis so far is that we are only looking at one year of data. Additionally, having access to the transaction level data rather than monthly aggregates would prove quite useful. Thoughts on next steps include:
Get a handle on what is going on with the promotion start and end dates. In order to truly get a sense of the success of promotions we would need to reliably be able to calculate product sales before, during, and after the promotional period. We could then start looking at the incremental revenue driven by promotions.
Once we are looking at the transaction level data we can get an idea of brand loyalty via repeat purchases. This is key in understanding the health of various brands. Do customers always come back to certain brands? Are some brands low volume unless there is a promotion on them? That could mean we are charging too much relative to competitor. Are there toxic brands which people never repeat buy?
We looked at categories, subcategories, and brands. We didn’t really get into looking at metrics at a product level. One important consideration is that the same/similar product is offered by many brands. With that in mind, we could clean up the product_name variable to aggregate similar products regardless of brand. This analysis would tell us if apples are overall more popular than oranges…we could then adjust store/web layouts accordingly. Given the large number of stores, we have lots of room for experimenting with floor layouts and product configurations on the shelves.
A number of machine learning algorithms could be applied for data mining purposes. Having access to the rest of the data in this database, particularly the customer table would be very useful. We could use a clustering algorithm to segment customers with similar transactions history and/or demographics. We could analyze the defining attributes of these clusters and create targeted promotions/advertising accordingly.
This type of information could be used to help re-organize the store. Could use the data to place items that are strongly associated near each other. Ideally we could place the highest margin options of each product most prominently.