A Market Basket Analysis

Introduction

Problem Statement

The objective of this analysis is to study the interaction between two pairs of products - Pasta with Pasta Sauce and Syrup with Pancake Mix. We intend to ascertain the popularity index for each commodity-brand combination and further analyze the magnitude of association between pairs of brands each belonging to one of the associated pairs of commodities (E.g. Brand X of Pasta Sauce with Brand Y of Pasta).

Data & Methodology

  • The data used is the Carbo-Loading dataset which contains 2 years of Pasta, Pasta Sauce, Syrup and Pancake Mix transactions at household level which was obtained through the loyalty card program of the Kroger departmental store chain.

  • Using Association Rule Mining we wish to classify not only the brands which are highly popular overall, but pairs which have high lift or magnitude of association.

Proposed Approach

  • Firstly, we intend to rank each brand within each commodity category according to its overall popularity by examining total units sold and then total dollar value (We prefer to rank based on units sold over dollar value, as we are assuming that lower prices is a factor in popularity).

  • Then, we intend to compute certain statistics which will help in our Market Basket Analysis. These are the support for each commodity-brand entity overall (not within each category) which is the frequency of the itemset in the dataset, the confidence of item pairs which is the how often we find that association rule to be true, and the lift which is the main statistic of interest and is the degree to which those two occurrences (the purchase of the two items) are dependent on one another.

  • A supporting feature of our analysis is to find the top n popular brands within each commodity

Take Away

This analysis will give the company an understanding of the popularity distribution of the different brands they sell and the associations between product pairs will help the company increase sales by marketing these associated pairs of brands together, incorporating coupons, discounts, promotional schemes etc.

Packages Required

library(haven)
library(dplyr)
library(tidyr)
library(fBasics)
library(ggplot2)
library(gridExtra)
library(knitr)
library(kableExtra)
library(wordcloud)
  • haven : To Import SAS files

  • dplyr: Data Manipulation

  • tidyr: Data Cleaning

  • fBasics: For Summary Statistics (::basicStats)

  • ggplot2: Plotting

  • gridExtra: Arranging plots in a grid

  • knitr and kableExtra: Dynamic Report Generation - For displaying tables with scroll bars

  • wordcloud : To create wordclouds

Data Preparation

Original Data Source

Source of Data

Original Data Source from 84.51: CARBO-LOADING: A RELATIONAL DATABASE

Link to Original Dataset

Explanation of Data

  • The data was originally collected to understand the interaction between 4 categories Pasta, Pasta Sauce, Pancake Mix, Syrups.

  • The dataset has 2 years of transactions for the above categories across 387 stores with household level obtained from the loyalty program of the leading US grocer. We have 4 tables transactions, store, product and causal.

  1. Transactions: 5,197,681 rows and 11 variables (columns)

  2. Store: 387 rows and 2 variables

  3. Product: 927 rows and 5 variables

  4. Causal: 351,372 rows and 6 variables

There are missing rows in the data.
  • In the Causal table, weeks 1-42 data are missing, 6 stores are missing, 50%(448) products out of 927 are missing .

  • In the Transactions table Day 159 in week 23 is missing.

Data Importing and Cleaning

  • We have a SAS dataset which we import using read_sas. We have a relational database which is linked with the help of primary and foreign keys. With the help of these keys we combine the datasets to create a master dataset. On joining, missing values are generated due to a mismatch of values in upc variable. This occurred due to presence of zero at the start of upc in the Transactions table which wasnt the case in the other tables for the same upc ids.

  • This mismatch is remedied by converting upc variable in all tables to numeric and then joining.

  • Geography is a redundant column in the causal dataset as it is also there in transaction dataset. Hence, we remove it from the causal dataset

##################
# Importing Data #
##################

files <- list.files(path = "data",pattern = ".sas7bdat")

for (i in seq_along(files)) {

  df <- read_sas(paste0("data/",files[i]))
  assign(paste0(unlist(strsplit(files[i], "[.]"))[[1]]),df)
  rm(df)

}

########################
# Cleaning and Joining #
########################


causal_lookup$upc <- as.numeric(causal_lookup$upc)
product_lookup$upc <- as.numeric(product_lookup$upc)
transactions$upc <- as.numeric(transactions$upc)
causal_lookup$geography <- NULL

trx_prod <- merge(transactions, product_lookup, by = 'upc',
                  all.x = T)
trx_activity_product_lookup_store <- merge(trx_prod, store_lookup,
                                           by = 'store', all.x = T)
  • Dollar_sales in the transactions table have negative and zero values for 9585 rows (0.18%), we assume these as abnormal values and as the percentage of rows is low and as dollar sales is not a main variable of concern for this problem statement, we impute these values with the below formula :

\[ Imputed.dollar.sales = Average.price.per.unit.of.product * Number.of.units.of.the.observation \]

############################
# Imputing abnormal values #
############################

product_lookup_price <- trx_activity_product_lookup_store %>%
                        group_by(commodity) %>%
                        summarize(product_lookup_avg_price =
                                  sum(dollar_sales) / sum(units) )

trx_activity_product_lookup_store <- 
                        trx_activity_product_lookup_store %>%
                        left_join(product_lookup_price, by = "commodity") %>%
                        mutate(imputed_dollar_sales = case_when(
                              dollar_sales > 0 ~ dollar_sales,
                              dollar_sales < 0 ~ units * product_lookup_avg_price
  ))

trx_activity_product_lookup_store$dollar_sales <- NULL

Final Dataset

Cleaned Dataset : Top 50 rows

Cleaned Dataset
store upc units time_of_transaction geography week household basket day coupon product_description commodity brand product_size store_zip_code product_lookup_avg_price imputed_dollar_sales
1 5100011464 1 1800 1 9 33094 826468 60 0 PREGO MUSHROOM SUPREME SC pasta sauce Prego 25.75 OZ 37865 1.740049 1.50
1 3905972884 1 1308 1 88 46014 2818933 616 0 SPRING TREE MAPLE SYRUP syrups Spring Tree 8.5 OZ 37865 2.401884 4.49
1 9999981816 1 2311 1 76 448923 2373667 531 0 PRIVATE LABEL MAPLE SYRUP syrups Private Label Premium 12 OZ 37865 2.401884 4.29
1 1510000017 1 1547 1 78 370371 2433682 540 0 CRMET LONG SPAG pasta Creamette 7 OZ 37865 0.989966 0.69
1 3620000468 1 0822 1 42 66170 1298998 291 0 RAGUONION/GARLIC pasta sauce Ragu 26 OZ 37865 1.740049 1.61
1 7130000140 1 1406 1 30 168809 684743 208 0 RONZONI TOM BSL LINGUINE pasta Ronzoni 12 OZ 37865 0.989966 1.49
1 9999966070 1 1552 1 13 380733 328933 89 0 PRIVATE LABEL BUTTERED SYRUP syrups Private Label P 24 OZ 37865 2.401884 1.79
1 9999985004 1 1042 1 26 235012 536968 176 0 PRIVATE LABEL SPAGHETTI REGULAR pasta Private Label 16 OZ 37865 0.989966 0.69
1 4144900110 1 1812 1 26 357535 544249 179 0 KRUSTEAZ COMP BM PANCAKE pancake mixes Krusteaz 32 OZ 37865 1.702382 2.29
1 2066200022 1 1043 1 3 66466 426081 21 0 NEWMANSSAUCE pasta sauce Newman’s 26 OZ 37865 1.740049 2.49
1 5100002794 1 1831 1 78 78447 2447915 543 0 PREGO SPAG SAUCE MEAT pasta sauce Prego 26 OZ 37865 1.740049 1.58
1 1510000026 1 1415 1 80 55160 2548204 560 0 CRMET EGG NOODLES-WIDE pasta Creamette 1 LB 37865 0.989966 1.79
1 601011296 1 1849 1 49 243508 1511932 343 0 BARILLA BASILICO SAUCE pasta sauce Barilla 26 OZ 37865 1.740049 2.00
1 9999981175 1 1928 1 41 284378 1271568 284 0 PRIVATE LABEL X WIDE EGG NOODLES pasta Private Label Value 1 LB 37865 0.989966 0.79
1 4112927445 1 0938 1 43 244182 1331140 300 0 CLSCO CABERNET MARINARA pasta sauce Classico 26 OZ 37865 1.740049 2.50
1 1510000256 1 1341 1 76 384918 2373672 532 0 CREAMETTE ROTINI MACARONI pasta Creamette 16 OZ 37865 0.989966 1.29
1 9999981581 1 1312 1 31 68360 698006 213 0 PRIVATE LABEL SPAG SCE MUSHROOM pasta sauce Private Label 26.50 OZ 37865 1.740049 0.88
1 9999985005 1 1720 1 30 237576 680473 206 0 PRIVATE LABEL SPAGHETTI REGULAR pasta Private Label 32OZ 37865 0.989966 1.19
1 5100001212 1 1330 1 65 234461 2013783 455 0 PREGO MUSHROOM & GARLIC pasta sauce Prego 26 OZ 37865 1.740049 1.65
1 1510000007 1 1758 1 85 32681 2692564 590 0 CREAMETTE ELBOW MACARONI pasta Creamette N 1 LB 37865 0.989966 0.99
1 3620000445 1 1132 1 75 132318 2321928 519 0 RAGU GS SUPER MUSHROOM pasta sauce Ragu 26.3 OZ 37865 1.740049 1.58
1 5100002796 1 2116 1 49 75500 1489583 337 0 PREGO MUSHROOM SPAG SAUCE pasta sauce Prego 26 OZ 37865 1.740049 1.58
1 5100002548 1 1814 1 1 50579 231266 5 0 PREGO SPAG SAUCE PLAIN pasta sauce Prego 14 OZ 37865 1.740049 1.39
1 7680852138 1 1809 1 3 60929 415988 17 0 BARILLA ITALIAN CHEESE SC pasta sauce Barilla 26 OZ 37865 1.740049 2.00
1 3000005970 1 2032 1 10 32640 844933 68 0 AUNT JEMIMA ORIGINL SYRUP syrups Aunt Jemima 24 OZ 37865 2.401884 3.19
1 3620001375 1 1853 1 2 32019 1736377 11 0 RAGU ROBUSTO SIX CHEESE pasta sauce Ragu 26 OZ 37865 1.740049 1.99
1 3620001495 1 1114 1 5 293538 456247 33 0 (S)RAGU CHNKY GRDNSTYL GR pasta sauce Ragu 45 OZ 37865 1.740049 2.79
1 3620001376 1 1431 1 49 338989 1489575 337 0 RAGU ROBUSTO ITAL SAUSAGE pasta sauce Ragu 26 OZ 37865 1.740049 1.59
1 3620001364 2 1817 1 89 400504 2843382 622 0 RAGU CHUNKY G/S MAMAS SCE pasta sauce Ragu 26 OZ 37865 1.740049 3.16
1 1510002700 1 1701 1 32 241528 714709 220 0 HEALTHY HARVEST ROTINI pasta Healthy Harvest 13.25 OZ 37865 0.989966 1.59
1 3620001369 1 1233 1 13 236391 330934 90 0 RAGUCREATION ALFRDO pasta sauce Ragu 16 OZ 37865 1.740049 0.99
1 9999985050 2 1431 1 60 434162 1823508 415 0 PRIVATE LABEL FUSILLI-ROTINI-SPRL pasta Private Label Premium 16 OZ 37865 0.989966 1.78
1 1510000031 1 1044 1 11 241368 287252 73 0 CREAMETTE LASAGNA MACR pasta Creamette 16 Z 37865 0.989966 1.89
1 4112907702 1 2030 1 95 346966 3012692 659 0 CLSCO SAUCE ITALIAN SAUSG pasta sauce Classico 26 OZ 37865 1.740049 2.00
1 1510000043 1 1918 1 67 240908 2077902 469 0 CREAMETTE SPAGHETTI pasta Creamette 16 OZ 37865 0.989966 1.39
1 5100002549 1 1658 1 10 31809 839958 66 0 PREGO REG SPAGHETTI SAUCE pasta sauce Prego 26 OZ 37865 1.740049 1.50
1 9999967725 1 1152 1 98 502296 3100354 680 0 PRIVATE LABEL PANCAKE SYRUP syrups Private Label 12 OZ 37865 2.401884 1.19
1 7680828008 1 1428 1 27 374916 565789 186 0 BARILLA SPAGHETTI PASTA pasta Barilla 16 OZ 37865 0.989966 0.99
1 3000003690 1 1409 1 52 292753 1571765 358 0 ANT JEMIMA CONTRY RICH SY syrups Aunt Jemima 24 OZ 37865 2.401884 1.99
1 5100002549 1 1259 1 15 275242 362295 104 0 PREGO REG SPAGHETTI SAUCE pasta sauce Prego 26 OZ 37865 1.740049 1.50
1 3000005970 1 1812 1 18 324141 405940 122 0 AUNT JEMIMA ORIGINL SYRUP syrups Aunt Jemima 24 OZ 37865 2.401884 3.19
1 9999971306 1 1653 1 93 50278 2965902 649 0 PRIVATE LABEL HS SPAG SAUCE TOM/BAS pasta sauce Private Label 26 OZ 37865 1.740049 1.00
1 5100002796 1 1209 1 94 372618 3008213 658 0 PREGO MUSHROOM SPAG SAUCE pasta sauce Prego 26 OZ 37865 1.740049 1.58
1 9999985133 1 1349 1 8 241085 268397 51 0 PRIVATE LABEL SPAGHETTI pasta Private Label 8 OZ 37865 0.989966 0.39
1 3620001376 1 1749 1 41 37510 1267969 283 0 RAGU ROBUSTO ITAL SAUSAGE pasta sauce Ragu 26 OZ 37865 1.740049 1.61
1 3620000480 1 1334 1 36 405085 1114352 246 0 RAGU HEARTY ITAL PSTA SCE pasta sauce Ragu 26.3 OZ 37865 1.740049 1.50
1 9999985067 1 1718 1 74 242778 2296306 514 0 PRIVATE LABEL VERMICELLI pasta Private Label 16 OZ 37865 0.989966 0.52
1 3620000482 1 1716 1 86 329958 2726765 597 0 RAGU HEARTY SAUT ON/MUSH pasta sauce Ragu 26.3 OZ 37865 1.740049 1.58
1 1510000027 1 0939 1 15 338989 362293 104 0 CREAMETTE X-WIDE NOODLE pasta Creamette 16 OZ 37865 0.989966 1.79
1 3000005560 1 1605 1 86 77712 2731616 598 0 AUNT JEMIMA LITE SYRUP syrups Aunt Jemima 36 OZ 37865 2.401884 3.99

Summary Info

  • There are 17 variables in total from the 4 tables these can be majorly divided among store, customer, product, discount and transaction. We have 2 numerical variables number of units and the corresponding sales value of a product type in a basket.

  • The categorical variables are store, product, household, basket, commodity, brand and a Boolean variable coupon.

  • The main variables of concern for the project are brand, commodity, basket and household. Pasta and pasta sauce have 56 and 43 type of brands respectively, which can have association with each other, whereas pancake mixes and syrups have 35 and 17 type of brands respectively, which can have association with each other.

###############################
# Summary - numeric variables #
###############################


basic_stats <- basicStats(trx_activity_product_lookup_store[ , c("units",
                                                 "imputed_dollar_sales") ])[
                                                   c("Mean",
                                                     "Stdev",
                                                     "Median",
                                                     "Minimum",
                                                     "Maximum"),
                                                  ]
Summary - numeric variables
units imputed_dollar_sales
Mean 1.196616 1.759720
Stdev 0.568602 1.126983
Median 1.000000 1.500000
Minimum 1.000000 0.010000
Maximum 156.000000 153.140000
#################################
# Summary - character variables #
#################################
char_var <- trx_activity_product_lookup_store %>%
                  summarize(store = n_distinct(store),
                            product_lookup = n_distinct(upc),
                            household = n_distinct(household),
                            basket = n_distinct(basket),
                            coupon = n_distinct(coupon),
                            commodity = n_distinct(commodity),
                            brand = n_distinct(brand)) %>%
                  gather(variable, number_of_levels)
Summary - character variables
variable number_of_levels
store 387
product_lookup 927
household 510027
basket 3316349
coupon 2
commodity 4
brand 131
#######################################
# Number of brands per each commodity #
#######################################

num_brands <- trx_activity_product_lookup_store %>%
                  group_by(commodity) %>%
                      summarize(no_distinct_brand = n_distinct(brand))
Number of brands per each commodity
commodity no_distinct_brand
pancake mixes 17
pasta 56
pasta sauce 43
syrups 35

EDA

Exploratory Data Analysis

Feature Engineering / Data Summarization

  • We have calculated support using the whole dataset for each commodity-brand entity. We group by-ed household and commodity to find total values, and then summed for each brand. This will help us calculate support.
#######################
# Support Calculation #
#######################

trx_prod_support <-  as.data.frame(trx_prod %>%
                                     group_by(basket, commodity, brand) %>%
                                     summarize(num_units = sum(units)) %>%
                                     select( basket, commodity, brand,
                                             num_units) %>%
                                     group_by(commodity, brand) %>%
                                     summarize(sum = sum(num_units))) %>%
                                     mutate('support' = sum / sum(sum))

trx_prod1 <- left_join(trx_prod, trx_prod_support, by = c('commodity','brand') )


trx_prod1 <- trx_prod1[,c('basket','commodity','brand','support', 'units')]

Finding associated pairs

  • We then split the dataset based on commodity and joined the Pasta with Pasta sauce and Pancake Mix with Syrup datasets by basket, to create datasets where a single lineitem contains both a certain Pancake Mix with a certain Syrup, and likewise for the pasta pair. This will give us associated pairs of products that were bought together.

  • We then appended the pasta dataset to the pancake dataset and calculated the support of each of the pairs. Following this we calculated lift which is the ratio between support of associated pairs and product of the support of each of the two entities in the pair.

#############################
# Creating Associated Pairs #
#############################

j = 1
name = c('pancake' , 'syrup' , 'pasta' , 'sauce')

for (i in c('pancake mixes' , 'syrups' , 'pasta' , 'pasta sauce'))
{
  assign(paste0(name[j]), trx_prod1[trx_prod1$commodity == i, ])
  
  j = j + 1
}

pancake2 <- inner_join(pancake, syrup, by = 'basket')
pasta2 <- inner_join(pasta, sauce, by = 'basket')

full <- rbind(pancake2, pasta2)
Associated Pairs
basket commodity.x brand.x support.x units.x commodity.y brand.y support.y units.y
3060398 pancake mixes Fastshake 6.30e-06 2 syrups Aunt Jemima 0.0199269 1
2899256 pancake mixes Fastshake 6.30e-06 4 syrups Aunt Jemima 0.0199269 1
3125620 pancake mixes Fastshake 6.30e-06 1 syrups Aunt Jemima 0.0199269 1
3121056 pancake mixes Fastshake 6.30e-06 1 syrups Aunt Jemima 0.0199269 1
3153652 pancake mixes Fastshake 6.30e-06 2 syrups Mrs Butterworth 0.0084565 1
1832615 pancake mixes Fastshake 6.30e-06 1 syrups Private Label 0.0357777 1
531796 pancake mixes Fastshake 6.30e-06 1 syrups Aunt Jemima 0.0199269 1
154279 pancake mixes Fastshake 6.30e-06 1 syrups Cozy Cottage 0.0029629 1
2127566 pancake mixes Fastshake 6.30e-06 1 syrups Cozy Cottage 0.0029629 1
2311710 pancake mixes M W Flapstax 4.92e-04 2 syrups Karo 0.0083613 1
1572785 pancake mixes M W Flapstax 4.92e-04 6 syrups Log Cabin 0.0068337 1
3056835 pancake mixes M W Flapstax 4.92e-04 2 syrups Private Label Value 0.0016007 1
1517469 pancake mixes M W Flapstax 4.92e-04 1 syrups Pennant 0.0019466 1
2166265 pancake mixes M W Flapstax 4.92e-04 2 syrups Private Label 0.0357777 1
650903 pancake mixes M W Flapstax 4.92e-04 2 syrups Log Cabin 0.0068337 1
776828 pancake mixes M W Flapstax 4.92e-04 2 syrups Mrs Butterworth 0.0084565 1
1455951 pancake mixes M W Flapstax 4.92e-04 2 syrups Aunt Jemima 0.0199269 1
3067503 pancake mixes M W Flapstax 4.92e-04 1 syrups Private Label 0.0357777 1
1806408 pancake mixes M W Flapstax 4.92e-04 1 syrups Private Label 0.0357777 1
1073388 pancake mixes M W Flapstax 4.92e-04 3 syrups Pennant 0.0019466 1
2628692 pancake mixes M W Flapstax 4.92e-04 1 syrups Log Cabin 0.0068337 1
2679760 pancake mixes M W Flapstax 4.92e-04 3 syrups Private Label 0.0357777 1
2007072 pancake mixes M W Flapstax 4.92e-04 2 syrups Aunt Jemima 0.0199269 1
1498655 pancake mixes M W Flapstax 4.92e-04 1 syrups Private Label 0.0357777 1
1885958 pancake mixes M W Flapstax 4.92e-04 4 syrups Mrs Butterworth 0.0084565 1
1576442 pancake mixes M W Flapstax 4.92e-04 1 syrups Mrs Butterworth 0.0084565 1
1314426 pancake mixes M W Flapstax 4.92e-04 1 syrups Cozy Cottage 0.0029629 1
1577620 pancake mixes M W Flapstax 4.92e-04 1 syrups Private Label 0.0357777 1
2358459 pancake mixes M W Flapstax 4.92e-04 1 syrups Private Label 0.0357777 1
791808 pancake mixes M W Flapstax 4.92e-04 2 syrups Spring Tree 0.0015634 1
3217880 pancake mixes M W Flapstax 4.92e-04 2 syrups Kellogg 0.0031179 1
2536741 pancake mixes M W Flapstax 4.92e-04 2 syrups Aunt Jemima 0.0199269 1
1098956 pancake mixes M W Flapstax 4.92e-04 1 syrups Aunt Jemima 0.0199269 1
51445 pancake mixes M W Flapstax 4.92e-04 1 syrups Pennant 0.0019466 1
2507968 pancake mixes M W Flapstax 4.92e-04 2 syrups Mrs Butterworth 0.0084565 1
1107583 pancake mixes M W Flapstax 4.92e-04 2 syrups Hungry Jack 0.0056838 1
20593 pancake mixes M W Flapstax 4.92e-04 1 syrups Log Cabin 0.0068337 1
121886 pancake mixes M W Flapstax 4.92e-04 2 syrups Cozy Cottage 0.0029629 1
2780417 pancake mixes M W Flapstax 4.92e-04 2 syrups Aunt Jemima 0.0199269 1
3512 pancake mixes M W Flapstax 4.92e-04 2 syrups Mrs Butterworth 0.0084565 1
2681321 pancake mixes M W Flapstax 4.92e-04 2 syrups Private Label 0.0357777 1
2421347 pancake mixes M W Flapstax 4.92e-04 2 syrups Private Label Premium 0.0043776 1
42991 pancake mixes M W Flapstax 4.92e-04 2 syrups Private Label 0.0357777 1
1557178 pancake mixes M W Flapstax 4.92e-04 2 syrups Private Label 0.0357777 1
1691914 pancake mixes M W Flapstax 4.92e-04 3 syrups Cary’s 0.0023183 1
1691914 pancake mixes M W Flapstax 4.92e-04 3 syrups Private Label 0.0357777 2
1901897 pancake mixes M W Flapstax 4.92e-04 1 syrups Aunt Jemima 0.0199269 1
1591607 pancake mixes M W Flapstax 4.92e-04 1 syrups Mrs Butterworth 0.0084565 1
2127478 pancake mixes M W Flapstax 4.92e-04 2 syrups Aunt Jemima 0.0199269 1
207758 pancake mixes M W Flapstax 4.92e-04 2 syrups Aunt Jemima 0.0199269 1
########
# Lift #
########

association['support.xy'] <- association$xUy / (nrow(trx_prod_double))
association['lift'] <- association$support.xy / (association$support.x * 
                                                       association$support.y)

lift_ass <- unique(association[ ,c('commoditybrand.xy', 'lift')])

lift <- lift_ass[order(lift_ass$lift, decreasing = TRUE), ]

Popularity Index

  • We then calculated the overall popularity of each brand within each commodity category and ranked the brands based on popularity.
####################
# Popularity Index #
####################

popularity <- as.data.frame(trx_prod %>%
                            group_by(commodity, brand) %>%
                            summarize(total_sales = sum(dollar_sales), 
                                      num_house = 
                                              n_distinct(factor(household))) %>%
                            mutate('popularity.score' = 
                                     (num_house * 
                                        (total_sales ^ (1 / 3)) / 10 ^ 4))  %>% 
                            select(commodity, brand, num_house, 
                                            total_sales, popularity.score)) %>%
                            group_by(commodity) %>%
                            mutate(rank = 
                                     order(order(popularity.score, 
                                                 decreasing = TRUE)))

Filter out rows with low sales

We filtered for brands that have been bought by at least 1000 households and have had at least $10,000 of sales. This new information was joined with our earlier dataset, so we can summarize the brand pairs with the highest lift while having a popularity index for each brand alongside.

Associated pairs with lift and rank
commodity.x brand.x commodity.y brand.y lift rank.x rank.y total_sales.x total_sales.y num_house.x num_house.y
pancake mixes Krusteaz syrups Smuckers 15.289399 4 14 30528.77 24862.79 9241 6315
pancake mixes Mrs Butterworth syrups Mrs Butterworth 15.192832 7 3 11415.08 159165.47 5164 31809
pancake mixes Hodgson Mills syrups Cozy Cottage 15.103232 8 10 12857.64 41890.14 4287 11379
pancake mixes Private Label syrups Private Label Value 14.060408 3 15 73616.82 13190.32 32199 7227
pancake mixes Hodgson Mills syrups Spring Tree 11.739578 8 12 12857.64 32740.97 4287 6460
pancake mixes Hungry Jack syrups Hungry Jack 11.290699 2 7 111435.32 113414.69 35557 18766
pancake mixes Hodgson Mills syrups Maple Grove 10.997654 8 17 12857.64 21175.21 4287 2452
pancake mixes Hodgson Mills syrups Cary’s 10.254439 8 11 12857.64 46800.22 4287 8417
pancake mixes Hodgson Mills syrups Private Label Premium 10.182362 8 8 12857.64 103906.08 4287 14060
pancake mixes Aunt Jemima syrups Aunt Jemima 8.723005 1 2 206698.12 359668.92 56103 65654
pancake mixes Private Label syrups Private Label 8.097007 3 1 73616.82 352930.86 32199 106181
pancake mixes Hodgson Mills syrups Smuckers 7.985365 8 14 12857.64 24862.79 4287 6315
pancake mixes Krusteaz syrups Maple Grove 7.307305 4 17 30528.77 21175.21 9241 2452
pancake mixes Mrs Butterworth syrups Private Label Value 7.306707 7 15 11415.08 13190.32 5164 7227
pancake mixes Hungry Jack syrups Mrs Butterworth 6.558429 2 3 111435.32 159165.47 35557 31809
pancake mixes Aunt Jemima syrups Mrs Butterworth 6.097321 1 3 206698.12 159165.47 56103 31809
pancake mixes Hungry Jack syrups Smuckers 6.017868 2 14 111435.32 24862.79 35557 6315
pancake mixes Mrs Butterworth syrups Northwoods 5.893625 7 5 11415.08 90610.37 5164 29456
pancake mixes Mrs Butterworth syrups Aunt Jemima 5.381893 7 2 11415.08 359668.92 5164 65654
pancake mixes Krusteaz syrups Kellogg 5.215197 4 9 30528.77 55015.38 9241 12796
pancake mixes Hungry Jack syrups Kellogg 5.213002 2 9 111435.32 55015.38 35557 12796
pancake mixes Hodgson Mills syrups Log Cabin 5.090279 8 6 12857.64 130904.45 4287 23590
pancake mixes Mrs Butterworth syrups Pennant 5.084162 7 13 11415.08 24661.83 5164 6490
pancake mixes Hungry Jack syrups Aunt Jemima 5.016468 2 2 111435.32 359668.92 35557 65654
pancake mixes Krusteaz syrups Mrs Butterworth 4.896092 4 3 30528.77 159165.47 9241 31809
pancake mixes Hungry Jack syrups Log Cabin 4.739713 2 6 111435.32 130904.45 35557 23590
pancake mixes Aunt Jemima syrups Northwoods 4.623519 1 5 206698.12 90610.37 56103 29456
pancake mixes Hungry Jack syrups Northwoods 4.500026 2 5 111435.32 90610.37 35557 29456
pancake mixes Krusteaz syrups Log Cabin 4.461440 4 6 30528.77 130904.45 9241 23590
pancake mixes Aunt Jemima syrups Log Cabin 4.330303 1 6 206698.12 130904.45 56103 23590
pancake mixes Mrs Butterworth syrups Private Label 4.315227 7 1 11415.08 352930.86 5164 106181
pasta Barilla Plus pasta sauce Bertolli 4.286954 13 6 19262.89 258090.84 6849 40187
pancake mixes Krusteaz syrups Northwoods 4.280993 4 5 30528.77 90610.37 9241 29456
pasta Dececco pasta sauce Emeril’s 4.174479 15 10 14780.83 60173.35 3111 9564
pancake mixes Aunt Jemima syrups Hungry Jack 4.117391 1 7 206698.12 113414.69 56103 18766
pancake mixes Krusteaz syrups Private Label Premium 4.092764 4 8 30528.77 103906.08 9241 14060
pancake mixes Krusteaz syrups Hungry Jack 4.026349 4 7 30528.77 113414.69 9241 18766
pancake mixes Krusteaz syrups Aunt Jemima 3.985543 4 2 30528.77 359668.92 9241 65654
pancake mixes Bisquick syrups Hungry Jack 3.902702 5 7 18437.72 113414.69 7021 18766
pancake mixes Mrs Butterworth syrups Smuckers 3.887910 7 14 11415.08 24862.79 5164 6315
pasta Healthy Harvest pasta sauce Emeril’s 3.871710 11 10 28112.45 60173.35 8054 9564
pancake mixes Mrs Butterworth syrups Hungry Jack 3.862348 7 7 11415.08 113414.69 5164 18766
pasta Barilla pasta sauce Barilla 3.843450 2 7 433764.78 175055.76 125579 28574
pancake mixes Aunt Jemima syrups Smuckers 3.841292 1 14 206698.12 24862.79 56103 6315
pancake mixes Bisquick syrups Smuckers 3.812462 5 14 18437.72 24862.79 7021 6315
pancake mixes Private Label syrups Northwoods 3.755225 3 5 73616.82 90610.37 32199 29456
pancake mixes Hungry Jack syrups Private Label Value 3.746628 2 15 111435.32 13190.32 35557 7227
pancake mixes Krusteaz syrups Spring Tree 3.707536 4 12 30528.77 32740.97 9241 6460
pancake mixes Bisquick syrups Mrs Butterworth 3.703635 5 3 18437.72 159165.47 7021 31809
pancake mixes Aunt Jemima syrups Kellogg 3.626670 1 9 206698.12 55015.38 56103 12796

Plots and Tables

Bar Charts

For commodity-wise Brand Support

Word Cloud

For Top 12 Pairwise Brand per commodity by Lift
1. Pancake mix and Syrup | 2. Pasta and Pasta Sauce

Heat Matrix

For Top Pairwise Products by Lift

Summary

Problem Statement adressed

The objective of this analysis was to study the interaction between two pairs of products - Pasta with Pasta Sauce and Syrup with Pancake Mix.

We ranked brands with a popularity score for within each commodity and further calculated the magnitude of association between pairs of brands each belonging to one of the associated pairs of commodities (E.g. Brand X of Pasta Sauce with Brand Y of Pasta).

Data Used

  • The data used is the Carbo-Loading dataset which contains 2 years of Pasta, Pasta Sauce, Syrup and Pancake Mix transactions at household level which was obtained through the loyalty card program of the Kroger departmental store chain.

Support of single product

We first joined the transactions table with the product table and then grouped the transactions by household, commodity, brand and basket to calculate the number of units bought by a certain household of a commodity-brand in a single purchase.

We then calculated the total number of units bought for every unique commodity-brand and divided it by the total units bought across all stores to calculate the support.

Support is an indication of how frequently the itemset appears in the dataset

\[ Support = TotalNumProductX / TotalNumAllProducts \]

Support of product pairs

We then split the data into four datasets on the commodity and join the pairs of datasets - Pasta with Pasta Sauce and Syrup with Pancake Mix. This allowed us to have a dataset with each row containing every product pair that was bought together

Following this, we calculate xUy which is the support of each product-pair.

Note : We use minimum number of units for the product pair as this makes more sense when calculating the frequency of the pair.

Once we have the pair-frequency we proceeded to join the dataset on itself so we had each row containing the frequency and support for product X, product Y and product X & Y.

We removed the rows where the same commodity-brand appears twice in the same row.

Lift

We then calculate the lift for each product pair which measures the degree to which those two occurrences are dependent on one another

\[ Lift(X => Y) = Support(X U Y) / [Support(X) * Support(Y)] \]

Popularity Index

Here we ascertained the popularity index of each product using households and sales giving a higher weightage for number of distinct households that bought the product. We use this score to rank each induvidual product by its popularity within each commodity category

Final List

Finally we filtered for products that have sales greater than $10000 and those which have been bought by atleast 1000 distinct households

Interesting Insights

We found out the top brand pairs for pancake mixes and syrups and also pasta and pasta sauce.

Associated pairs with lift and rank
commodity.x brand.x commodity.y brand.y lift rank.x rank.y total_sales.x total_sales.y num_house.x num_house.y
pancake mixes Krusteaz syrups Smuckers 15.289399 4 14 30528.77 24862.79 9241 6315
pancake mixes Mrs Butterworth syrups Mrs Butterworth 15.192832 7 3 11415.08 159165.47 5164 31809
pancake mixes Hodgson Mills syrups Cozy Cottage 15.103232 8 10 12857.64 41890.14 4287 11379
pancake mixes Private Label syrups Private Label Value 14.060408 3 15 73616.82 13190.32 32199 7227
pancake mixes Hodgson Mills syrups Spring Tree 11.739578 8 12 12857.64 32740.97 4287 6460
pancake mixes Hungry Jack syrups Hungry Jack 11.290699 2 7 111435.32 113414.69 35557 18766
pancake mixes Hodgson Mills syrups Maple Grove 10.997654 8 17 12857.64 21175.21 4287 2452
pancake mixes Hodgson Mills syrups Cary’s 10.254439 8 11 12857.64 46800.22 4287 8417
pancake mixes Hodgson Mills syrups Private Label Premium 10.182362 8 8 12857.64 103906.08 4287 14060
pancake mixes Aunt Jemima syrups Aunt Jemima 8.723005 1 2 206698.12 359668.92 56103 65654
pancake mixes Private Label syrups Private Label 8.097007 3 1 73616.82 352930.86 32199 106181
pancake mixes Hodgson Mills syrups Smuckers 7.985365 8 14 12857.64 24862.79 4287 6315
pancake mixes Krusteaz syrups Maple Grove 7.307305 4 17 30528.77 21175.21 9241 2452
pancake mixes Mrs Butterworth syrups Private Label Value 7.306707 7 15 11415.08 13190.32 5164 7227
pancake mixes Hungry Jack syrups Mrs Butterworth 6.558429 2 3 111435.32 159165.47 35557 31809
pancake mixes Aunt Jemima syrups Mrs Butterworth 6.097321 1 3 206698.12 159165.47 56103 31809
pancake mixes Hungry Jack syrups Smuckers 6.017868 2 14 111435.32 24862.79 35557 6315
pancake mixes Mrs Butterworth syrups Northwoods 5.893625 7 5 11415.08 90610.37 5164 29456
pancake mixes Mrs Butterworth syrups Aunt Jemima 5.381893 7 2 11415.08 359668.92 5164 65654
pancake mixes Krusteaz syrups Kellogg 5.215197 4 9 30528.77 55015.38 9241 12796
pancake mixes Hungry Jack syrups Kellogg 5.213002 2 9 111435.32 55015.38 35557 12796
pancake mixes Hodgson Mills syrups Log Cabin 5.090279 8 6 12857.64 130904.45 4287 23590
pancake mixes Mrs Butterworth syrups Pennant 5.084162 7 13 11415.08 24661.83 5164 6490
pancake mixes Hungry Jack syrups Aunt Jemima 5.016468 2 2 111435.32 359668.92 35557 65654
pancake mixes Krusteaz syrups Mrs Butterworth 4.896092 4 3 30528.77 159165.47 9241 31809
pancake mixes Hungry Jack syrups Log Cabin 4.739713 2 6 111435.32 130904.45 35557 23590
pancake mixes Aunt Jemima syrups Northwoods 4.623519 1 5 206698.12 90610.37 56103 29456
pancake mixes Hungry Jack syrups Northwoods 4.500026 2 5 111435.32 90610.37 35557 29456
pancake mixes Krusteaz syrups Log Cabin 4.461440 4 6 30528.77 130904.45 9241 23590
pancake mixes Aunt Jemima syrups Log Cabin 4.330303 1 6 206698.12 130904.45 56103 23590
pancake mixes Mrs Butterworth syrups Private Label 4.315227 7 1 11415.08 352930.86 5164 106181
pasta Barilla Plus pasta sauce Bertolli 4.286954 13 6 19262.89 258090.84 6849 40187
pancake mixes Krusteaz syrups Northwoods 4.280993 4 5 30528.77 90610.37 9241 29456
pasta Dececco pasta sauce Emeril’s 4.174479 15 10 14780.83 60173.35 3111 9564
pancake mixes Aunt Jemima syrups Hungry Jack 4.117391 1 7 206698.12 113414.69 56103 18766
pancake mixes Krusteaz syrups Private Label Premium 4.092764 4 8 30528.77 103906.08 9241 14060
pancake mixes Krusteaz syrups Hungry Jack 4.026349 4 7 30528.77 113414.69 9241 18766
pancake mixes Krusteaz syrups Aunt Jemima 3.985543 4 2 30528.77 359668.92 9241 65654
pancake mixes Bisquick syrups Hungry Jack 3.902702 5 7 18437.72 113414.69 7021 18766
pancake mixes Mrs Butterworth syrups Smuckers 3.887910 7 14 11415.08 24862.79 5164 6315
pasta Healthy Harvest pasta sauce Emeril’s 3.871710 11 10 28112.45 60173.35 8054 9564
pancake mixes Mrs Butterworth syrups Hungry Jack 3.862348 7 7 11415.08 113414.69 5164 18766
pasta Barilla pasta sauce Barilla 3.843450 2 7 433764.78 175055.76 125579 28574
pancake mixes Aunt Jemima syrups Smuckers 3.841292 1 14 206698.12 24862.79 56103 6315
pancake mixes Bisquick syrups Smuckers 3.812462 5 14 18437.72 24862.79 7021 6315
pancake mixes Private Label syrups Northwoods 3.755225 3 5 73616.82 90610.37 32199 29456
pancake mixes Hungry Jack syrups Private Label Value 3.746628 2 15 111435.32 13190.32 35557 7227
pancake mixes Krusteaz syrups Spring Tree 3.707536 4 12 30528.77 32740.97 9241 6460
pancake mixes Bisquick syrups Mrs Butterworth 3.703635 5 3 18437.72 159165.47 7021 31809
pancake mixes Aunt Jemima syrups Kellogg 3.626670 1 9 206698.12 55015.38 56103 12796

Pancake mixes and syrups dominate the list with high lift meaning that they are the product pairs that are bought togetehr a lot more

We also can see that the top brands within each product category arent the ones that occur at the top of the list when bought together. Infact some of the lesser preffered brands occur in the high lift pairs.

Implications

The implications are tremendous :-

We have provided a summary which is highly counter-intuitive to what was assumed. The top brands arent bought along with its product counterpart - this could be due to finances, existing offers etc.

We suggested that the company revise the offer scheme and also changed the placement points for these products - placing the ones with higher affinity with each other.

Also, only some brands who produce for example both pancake mixes and syrups are the preffered brands for both products.

Limitations

We were unable to delve deeper and understand why it was that certain brands were preffered with certain others. With deeper domain knowledge and customer feedback one could ascertain this.

We also could not analyze how often a customer came back for a certain product/ product pair, and we wish to do this in the future.