knitr::opts_chunk$set(warnings = FALSE, messages = FALSE)

Question 1: Categories & Brands

Data Reading and Cleaning

First I am going to read the four CSVs into my R session and take a look at a summary of our transaction data. The summary function is always a good place to start to catch any glaring data quality issues.

prod <-
     read.csv("product.csv",  
              quote = "'", 
              stringsAsFactors = FALSE)

prod_class <-
     read.csv("product_class.csv",
              quote = "'",
              stringsAsFactors = FALSE)
promo <-
     read.csv("promotion.csv",
              quote = "'",
              stringsAsFactors = FALSE)
trans <-
     read.csv("transactions.csv",
              quote = "'")

summary(trans)
##    product_id      customer_id       store_id      promotion_id   
##  Min.   :   1.0   Min.   :    0   Min.   :    2   Min.   :   0.0  
##  1st Qu.: 407.0   1st Qu.:    0   1st Qu.:   13   1st Qu.:   0.0  
##  Median : 790.0   Median : 1168   Median :   24   Median :   7.0  
##  Mean   : 785.7   Mean   : 2703   Mean   : 2576   Mean   : 140.6  
##  3rd Qu.:1171.0   3rd Qu.: 5092   3rd Qu.: 5092   3rd Qu.:  17.0  
##  Max.   :1559.0   Max.   :10277   Max.   :10277   Max.   :1869.0  
##  month_of_year    quarter       the_year     store_sales    
##  Min.   : 1.000   Q1:43160   Min.   :1997   Min.   : 0.500  
##  1st Qu.: 4.000   Q2:40714   1st Qu.:1997   1st Qu.: 3.780  
##  Median : 7.000   Q3:42892   Median :1997   Median : 5.920  
##  Mean   : 6.639   Q4:46836   Mean   :1997   Mean   : 6.512  
##  3rd Qu.:10.000              3rd Qu.:1997   3rd Qu.: 8.640  
##  Max.   :12.000              Max.   :1997   Max.   :30.560  
##    store_cost       unit_sales     fact_count  
##  Min.   : 0.153   Min.   : 1.000   1  :173536  
##  1st Qu.: 1.479   1st Qu.: 3.000   1);:     2  
##  Median : 2.368   Median : 3.000   2  :    64  
##  Mean   : 2.599   Mean   : 3.073               
##  3rd Qu.: 3.472   3rd Qu.: 4.000               
##  Max.   :12.064   Max.   :10.000

I know this is an aggregated table due to the presence of fact_count column. Each entry in this table is representing an entire month and the fact_count column tells us if the same customer engaged in the exact same transaction multiple times in the same month. I can see a couple entries in the fact_count column were mistranslated somewhere along the way. Let’s clean those up.

trans$fact_count <-
     as.numeric(
          gsub(pattern = "1);",
               replace = "1",
               x = trans$fact_count))

Now I want to take a look at histograms of our numeric variables to make sure we don’t have any obvious issues there.

par(mfrow = c(3,1))   
hist(trans$store_sales, breaks = 'scott')
hist(trans$store_cost, breaks = 'scott')
hist(trans$unit_sales, breaks = 10)

Everything looks good here, I don’t see any values that are way out of line.

Product Category

To get a feel for what is going on with our product categories first I need to merge the transaction and various product info tables. Then, we will calculate a number of KPIs for each product category.

library(dplyr) # useful package for data munging
trans_prod <- merge(
     x = merge(
          x = trans,
          y = prod,
          by = 'product_id',
          all.x = TRUE
     ),
     y = prod_class,
     by = 'product_class_id',
     all.x = TRUE
)
# here we calculate some row level kpis
trans_prod <- trans_prod %>%
     mutate('revenue' = unit_sales * store_sales * fact_count,
            'profit' =  unit_sales * fact_count * (store_sales - store_cost),
            'product.cost' = unit_sales * store_cost * fact_count,
            'margin' = ((profit/revenue) * 100))
# calculating total revenue for the year for later use
total_rev <- sum(trans_prod$revenue)
# grouping by category and calculating kpis
cat_kpis <- trans_prod %>%
     group_by(product_category) %>%
     summarize(
          Sales = sum(fact_count),
          Units = sum(fact_count * unit_sales),
          Rev_by_1000s = sum(revenue)/1000,
          RevPerSale = (sum(revenue) / Sales),
          Pct_Tot_Rev = (sum(revenue) / total_rev)*100,
          Margin = (sum(profit)/sum(revenue))*100)

I’m going to filter this down and plot out the 10 categories with the highest revenue.

top_rev <- top_n(cat_kpis, 10, Rev_by_1000s)
# The package plotly lets us quickly create interactive d3 visualizations
library(plotly)

p <- top_rev %>% plot_ly(x = product_category, 
                    y = Rev_by_1000s,
                    color = Margin,
                    mode = "markers",
                    marker = list(size = 20))

p <- layout(p,              
    title = "Top Categories",
    margin = list(b = 90),
    xaxis = list(           
        title = ""
    ),
    yaxis = list(           
        title = "$ Revenue (1000s)"      
    ))

p

Below you will find an html table widget containing the product category KPI data. One can sort, search, and filter for anything of interest.

prettifyTable <- function(x) {
     
     bools <- sapply(x,class) %in% c('numeric', 'integer')
     x[,bools] <- apply(x[,bools], round, digits = 2, MARGIN = 2)
     x[,bools] <- apply(x[,bools], prettyNum, big.mark = ",", MARGIN = 2)
    return(x)
     
} # wrapper function I made to format numbers
cat_kpis.pretty <- prettifyTable(cat_kpis)

DT::datatable(cat_kpis.pretty, 
              rownames = FALSE,
              filter = "top",
              options = list(),
              class = "display")

Clearly a significant portion of sales come from several core categories; vegetables, snack food, dairy, meat, fruit, jams and jellies, bread and canned soup account for about half of revenue.
Let’s take a quick look at the top subcategories within these core categories.

core_cats <- c('Vegetables','Snack Foods','Meat','Dairy','Fruit', 'Bread', 'Jams and Jellies', 'Canned Soup')

top_subcats <- trans_prod %>%
     filter(product_category %in% core_cats) %>%
     group_by(product_subcategory) %>%
     summarize(
          Sales = sum(fact_count),
          Units = sum(fact_count * unit_sales),
          Rev_by_1000s = sum(revenue)/1000,
          RevPerSale = (sum(revenue) / Sales),
          Pct_Tot_Rev = (sum(revenue) / total_rev)*100,
          Margin = (sum(profit)/sum(revenue))*100) %>%
     top_n(10, Rev_by_1000s)

p<-top_subcats %>% plot_ly(x = product_subcategory,
                        y = Rev_by_1000s,
                        color = Margin,
                        mode = "markers",
                        marker = list(size = 20))

p <- layout(p,              
    title = "Top Core Sub-categories",
    margin = list(b = 90),
    xaxis = list(           
        title = ""
    ),
    yaxis = list(           
        title = "$ Revenue (1000s)"      
    ))
p

Fresh vegetables and fresh fruits are far and away our two largest drivers of revenue. That’s great to see, food market can consider itself to have a positive influence on the community, primarily providing high quality nutrition to its customers.

Many of these food items are staples in households. Let’s look at the subcategories for categories that aren’t the core food items accounting for half of revenue.

noncore_totrev <- sum(filter(trans_prod, !(product_category %in% core_cats))$revenue)

subcat_kpis <- trans_prod %>%
     filter(!(product_category %in% core_cats)) %>%
     group_by(product_subcategory) %>%
     summarize(
          Sales = sum(fact_count),
          Units = sum(fact_count * unit_sales),
          Rev_by_1000s = sum(revenue)/1000,
          RevPerSale = (sum(revenue) / Sales),
          Pct_Tot_Rev = (sum(revenue) / noncore_totrev)*100,
          Margin = (sum(profit)/sum(revenue))*100)

top_rev <- top_n(subcat_kpis, 10, Rev_by_1000s)

p<- top_rev %>% plot_ly(x = product_subcategory,
                    y = Rev_by_1000s,
                    color = Margin,
                    mode = "markers",
                    marker = list(size = 20))

p <- layout(p,              
    title = "Top Non-Core Sub-categories",
    margin = list(b = 90),
    xaxis = list(           
        title = ""
    ),
    yaxis = list(           
        title = "$ Revenue (1000s)"      
    ))
p

Outside of food pantry staples we can see paper wipes, wine, batteries, and candy are some of the largest drivers of revenue. These would be great products to focus promotions on. Below is another table displaying KPIs for the non-core subcategories.

subcat_kpis.pretty <- prettifyTable(subcat_kpis)

DT::datatable(subcat_kpis.pretty, 
              rownames = FALSE,
              filter = "top",
              options = list(),
              class = "display")

Searching for Seasonality

One interesting way to look at category level revenues is with a heat map. It can inform if there are certain times of the year that we can capitalize on high demand for specific products. Check out interactive heatmap below.

heatmap <- trans_prod %>%
     group_by(month_of_year, product_category) %>%
     summarise(Rev = sum(revenue))

# reshape data to feed into heatmap function

heatmapspread <- tidyr::spread(heatmap, month_of_year, Rev)
cats <- heatmapspread$product_category
heatmapspread <- heatmapspread[,-1]
row.names(heatmapspread) <- cats
d3heatmap::d3heatmap(heatmapspread, 
                     dendrogram = 'row', 
                     colors = 'Blues', 
                    anim_duration = 1000, 
                    labCol = c('Jan',
                     'Feb',
                     'Mar',
                     'Apr',
                     'May',
                     'Jun',
                     'Jul',
                     'Aug',
                     'Sep',
                     'Oct',
                     'Nov',
                     'Dec'))

We can see clear signs of seasonality for all products. Sales are higher across the board in March, July and the last two months of the year.

Brands

Now that we have a handle on the categories, let’s look at sales through the lens of brands. While considering categories we were mostly looking at revenue, but I think that there are more relevant metrics to use for brand analysis. High-end brands are going to naturally cost more than their competitors and so they would always look like the top brand if we looked at revenue. Another metric would be to count the number of transactions overall. This captures volume of sales but one down side of this metric is that brands that sell many different types of products are going to always come out on top. As a compromise I’m going to be looking at the average transactions per product per brand.

top_brands <- trans_prod %>%
     group_by(brand_name) %>%
     summarise(TransPerProd = sum(fact_count)/length(unique(product_name)),
               Prods_Offered = length(unique(product_name))) %>%
     top_n(10, TransPerProd)
top_brands$brand_name <- as.factor(top_brands$brand_name)

p <- top_brands %>% plot_ly(x = brand_name,
                       y = TransPerProd,
                       color = Prods_Offered,
                       mode = "markers",
                       marker = list(size = 20))
p <- layout(p,              
    title = "Top 10 Brands",
    margin = list(b = 90),
    xaxis = list(           
        title = ""
    ),
    yaxis = list(           
        title = "Buys per Product Offered"      
    ))
p

We can see our top three brands by this metric are Special, Dollar and Big Time. Let’s see what they make.

trans_prod %>%
     filter(brand_name %in% c('Special', 'Dollar', 'Big Time')) %>%
     select(brand_name,product_name) %>%
     unique %>% 
     arrange(desc(brand_name))
##    brand_name                     product_name
## 1     Special                  Special Oatmeal
## 2     Special               Special Corn Puffs
## 3     Special              Special Wheat Puffs
## 4     Special                    Special Grits
## 5      Dollar   Dollar Monthly Sports Magazine
## 6      Dollar     Dollar Monthly Home Magazine
## 7      Dollar  Dollar Monthly Fashion Magazine
## 8      Dollar Dollar Monthly Computer Magazine
## 9      Dollar     Dollar Monthly Auto Magazine
## 10   Big Time               Big Time Popsicles
## 11   Big Time      Big Time Ice Cream Sandwich
## 12   Big Time               Big Time Ice Cream
## 13   Big Time       Big Time Chicken TV Dinner
## 14   Big Time        Big Time Turkey TV Dinner
## 15   Big Time          Big Time Beef TV Dinner
## 16   Big Time             Big Time Pancake Mix
## 17   Big Time         Big Time Frozen Pancakes
## 18   Big Time         Big Time Low Fat Waffles
## 19   Big Time       Big Time Blueberry Waffles
## 20   Big Time  Big Time Apple Cinnamon Waffles
## 21   Big Time                 Big Time Waffles
## 22   Big Time          Big Time Frozen Carrots
## 23   Big Time             Big Time Frozen Peas
## 24   Big Time      Big Time Frozen Cauliflower
## 25   Big Time             Big Time Frozen Corn
## 26   Big Time         Big Time Frozen Broccoli
## 27   Big Time Big Time Home Style French Fries
## 28   Big Time    Big Time Low Fat French Fries
## 29   Big Time     Big Time Fajita French Fries
## 30   Big Time   Big Time Frozen Mushroom Pizza
## 31   Big Time  Big Time Frozen Pepperoni Pizza
## 32   Big Time     Big Time Frozen Cheese Pizza
## 33   Big Time    Big Time Frozen Sausage Pizza
## 34   Big Time    Big Time Frozen Chicken Wings
## 35   Big Time   Big Time Frozen Chicken Breast
## 36   Big Time   Big Time Frozen Chicken Thighs
## 37   Big Time          Big Time Lime Popsicles
## 38   Big Time        Big Time Orange Popsicles
## 39   Big Time         Big Time Lemon Popsicles
## 40   Big Time         Big Time Grape Popsicles

Question 2: Insights

Margin vs Popularity

In a super market like food market there is valuable insight to be found in the interplay between the margin for a particular product and it’s popularity. If you imagine a plot with margin on the Y axis and item popularity on the X axis, then we can consider 4 distinct quadrants.

  • High-margin, high-popularity items are our bread and butter. Our number one priority with these should be never running too low on supply and looking for ways to cross-sell these items.

  • High margin, low-popularity items are our big opportunities. We should direct sales and promotion efforts towards these items to increase volume.

  • Low Margin, High Popularity items are important for customer satisfaction. Given the low margin on these items, it is invaluable to have an efficient inventory tracking and forecasting system to reduce waste. These items are also prime for pricing optimization…we can probably charge a little more and still maintain the high levels of popularity.

  • Low margin, low popularity items: do we need to be selling these?

Let’s see what products fall into what quadrants. First of all I’m going to filter for our 100 top products by revenue, otherwise this plot would just be too noisy. I’m also going to include a normalized revenue measure to include in the visualization as a size/color variable. This will be key in making any decisions, especially if we are thinking about cutting an item all together. Ultimately, we wouldn’t want to cut a product that is generating good revenues and profits.

tab <- trans_prod %>%
     group_by(product_name) %>%
     summarize(Margin = (sum(profit)/sum(revenue))*100,
               Trans = sum(fact_count),
               Rev = sum(revenue)) %>%
     ungroup %>% top_n(100, Rev)
    tab$Rev.Scaled <- scale(tab$Rev)

    
# here we calculate the medians of our two dimensions in order to draw the quadrants discussed above    
med.margin <- median(tab$Margin) 
med.Trans <- median(tab$Trans)
ggplotly(
     ggplot(tab, 
            aes(Trans, 
                Margin, 
                color = Rev.Scaled, 
                size = Rev.Scaled, 
                text = paste("Product:", product_name))
            ) + 
              geom_point() + 
              geom_hline(aes(yintercept = med.margin)) + 
              geom_vline(aes(xintercept = med.Trans)) + 
              scale_size(guide = 'none'))

Now we can dig into this visualization and start picking specific items to take action on per the above. Directing sales and promotion efforts towards the items in the top left quadrant is the biggest opportunity here. These items are under-bought and given the high margin we can still make a good profit even with promotions. On the other hand, perhaps our margin is too high on these products and that explains the low popularity. Looking at this information relative to competitors could provide insight to food market’s pricing strength against other supermarkets.

Speaking of promotions…

Which promotions are working?

Promotions are a key tool in any mass-market commerical enterprise’s toolbox. They allow us to direct the attention of our customers towards specific subsets of products. Additionally we can use promotion performance data to help make decisions about cutting products. If we have a low popularity, low margin, low revenue item that we can’t even sell with a promotion, that sounds like a good candidate for elimination.

First let’s make sure there aren’t any anomalies in our promotion info table.

summary(promo)
##   promotion_id    promotion_district_id promotion_name    
##  Min.   :   0.0   Min.   :  0.0         Length:1864       
##  1st Qu.: 473.8   1st Qu.:115.0         Class :character  
##  Median : 947.5   Median :121.0         Mode  :character  
##  Mean   : 947.8   Mean   :121.4                           
##  3rd Qu.:1420.2   3rd Qu.:127.0                           
##  Max.   :1896.0   Max.   :133.0                           
##   media_type             cost        start_date          end_date        
##  Length:1864        Min.   :    0   Length:1864        Length:1864       
##  Class :character   1st Qu.: 7524   Class :character   Class :character  
##  Mode  :character   Median : 9986   Mode  :character   Mode  :character  
##                     Mean   :10027                                        
##                     3rd Qu.:12537                                        
##                     Max.   :14997

Everything looks pretty normal here. I see there is a start_date and end_date for the promotions so that is something to consider. Let’s merge the promotion info on the transactions table and take a quick look at a histogram to see what promotions are being used. As I look at some of the start and end dates I am noticing the they don’t seem to be accurate. There are many cases where a promotion was redeemed outside of this active window. The only assumption I can make given the data available is that these start and end dates are not firm. Bed, Bath, and Beyond is one very well known retailer who’s coupons technically have an expiration date…but they are still accepted regardless. This is definitely a worthwhile strategy if margins are high enough that profits won’t be cannibalized by over-promotion.

all <- trans_prod %>% 
  merge(promo,
        by = 'promotion_id',
        all.x = TRUE)
all_promo <- all %>% filter(promotion_id != 0)

Promo_Counts <- all_promo %>%
  group_by(promotion_id, promotion_name) %>%
  summarize(Redemptions = sum(fact_count),
            Cost = mean(cost)) %>%
  ungroup %>%
  arrange(desc(Redemptions)) %>%
  mutate(Rolling_Prop_Reds = cumsum(Redemptions) / sum(Redemptions),
         Prop_Reds_Rank = dense_rank(-(Redemptions / sum(Redemptions))),
         Rolling_Prop_Cost = cumsum(Cost) / sum(Cost),
         Prop_Cost_Rank = dense_rank(-(Cost / sum(Cost))))

Wow, we can see that the top 10 most often used promotions account for 75% of total promotion redemptions but only 7% of our total cost of all promotions that were redeemed in the year in question. This initially is looking like a red flag, we should definitely be dedicating a higher percent of our promotion budget towards the promos that are being redeemed often.

One dimensions that we haven’t though of is the profits generated by these promotions. Now I’m going to create a summary table with a number of financial related KPIs for the most and least profitable promotions.

top_bot_promos <- all_promo %>%
     group_by(promotion_id, promotion_name) %>%
     summarize(
          'Redemptions' = sum(fact_count),
          'Rev' = sum(revenue),
          'Promo.Cost' = mean(cost),
          'Promo.Cost/Use' = Promo.Cost / Redemptions,
          'Item.Cost' = sum(product.cost),
          'Profit' = Rev - Item.Cost - Promo.Cost,
          'ProfitPerRed' = Profit / n(),
          'Avg_margin' = mean(margin)
     ) %>%
     ungroup %>%
     top_n(5, Profit) %>%
     bind_rows(
          all_promo %>%
               group_by(promotion_id, promotion_name) %>%
               summarize(
                    'Redemptions' = n(),
                    'Rev' = sum(revenue),
                    'Promo.Cost' = mean(cost),
                    'Promo.Cost/Use' = Promo.Cost / Redemptions,
                    'Item.Cost' = sum(product.cost),
                    'Profit' = Rev - Item.Cost - Promo.Cost,
                    'ProfitPerRed' = Profit / n(),
                    'Avg_margin' = mean(margin)
               ) %>%
               ungroup %>%
               top_n(5, -Profit)
     )

p <- top_bot_promos %>% plot_ly(x = promotion_name,
                       y = Profit,
                       color = Redemptions,
                       mode = "markers",
                       marker = list(size = 20),
                      text = paste('Redemptions:', Redemptions, 'Avg_Margin:', round(Avg_margin,2)))

p <- layout(p,              
    title = "Top 5 and Bottom 5 Promos by Profit",
   margin = list(b = 90),
    xaxis = list(           
        title = ""
    ))
p

Above we can see that the 5 least profitable promotions (of the ones that were redeemed at all) all have extremely low redemption rates. With under 100 redemptions each, it’s not surprising so much money was lost on them.

Next Steps

We have really only scratched the surface of the insights available in this data set. One limitation of the analysis so far is that we are only looking at one year of data. Additionally, having access to the transaction level data rather than monthly aggregates would prove quite useful. Thoughts on next steps include:

  • Get a handle on what is going on with the promotion start and end dates. In order to truly get a sense of the success of promotions we would need to reliably be able to calculate product sales before, during, and after the promotional period. We could then start looking at the incremental revenue driven by promotions.

  • Once we are looking at the transaction level data we can get an idea of brand loyalty via repeat purchases. This is key in understanding the health of various brands. Do customers always come back to certain brands? Are some brands low volume unless there is a promotion on them? That could mean we are charging too much relative to competitor. Are there toxic brands which people never repeat buy?

  • We looked at categories, subcategories, and brands. We didn’t really get into looking at metrics at a product level. One important consideration is that the same/similar product is offered by many brands. With that in mind, we could clean up the product_name variable to aggregate similar products regardless of brand. This analysis would tell us if apples are overall more popular than oranges…we could then adjust store/web layouts accordingly. Given the large number of stores, we have lots of room for experimenting with floor layouts and product configurations on the shelves.

  • A number of machine learning algorithms could be applied for data mining purposes. Having access to the rest of the data in this database, particularly the customer table would be very useful. We could use a clustering algorithm to segment customers with similar transactions history and/or demographics. We could analyze the defining attributes of these clusters and create targeted promotions/advertising accordingly.

  • Another algorithm that could be used to extract useful insights from this data would be the Apriori algorithm for association rules and frequent item sets.
  • This would give us excellent cross-selling opportunity; elucidating popular combinations of products
  • This type of information could be used to help re-organize the store. Could use the data to place items that are strongly associated near each other. Ideally we could place the highest margin options of each product most prominently.