I am looking at the carbo-loading household level transactions over a period of two years from four categories- Pasta, Pasta Sauce, Syrup, and Pancake Mix.
Based on the richness of the data, I want to know:
Are complementary products bought together?
if product sales depend on the display?
I will look at the household-week level transactions for the product categories- pasta and pasta sauce and see if I can identify a pattern in the items bought across the weeks.
I will slice the data at product, week, store, display_desc, and sales level. Then I will see the trend for each product to see if we observe any pattern in sales based on the location of temporary in-store.
This analysis will help retailer design marketing campaigns for the complementary products.
The product display will help understand how customers make a decision and how display causes impact on the choice a customer makes.
The following packages are required in order to run code without errors.
library(tidyverse) #to visualize, transform, input, tidy and join data
library(haven) #to input data from SAS
library(dplyr) #data wrangling
library(stringr) #string related functions
library(kableExtra) #to create HTML Table
library(DT) #to preview the data sets
library(lubridate) #to apply the date functions
library(arules) #to represent, manipulate and analyze transactional data
library(extrafont) #to use specific font in plots
Note: You would also require to install PLYR package but do not load it yet as it might contradict with some of the functions in DPLYR.
We obtained the carbo-loading household level data, obtained through the loyalty card program of a leading US grocer. It contains four data-sets, and was obtained from here. Please see the code book at the same location.
| Variable | Description |
|---|---|
| upc | Standard 10 digit UPC. |
| dollar_sales | Amount of dollars spent by the consumer |
| units | Number of products purchased by the consumer |
| time_of_transaction | The time of transaction(military time) |
| geography | Distinguishes between two large geographical regions, possibly values are 1 or 2 |
| week | Week of the transaction, values are from 1 to 104 |
| household | Unique households |
| store | Unique stores |
| basket | Unique baskets/trips to store |
| day | day of the transaction, possible values are from 1 to 728 |
| coupon | Indicates coupon usage, 1 if used, 0 otherwise |
| Variable | Description |
|---|---|
| upc | Standard 10 digit UPC. |
| product_description | Description of product |
| commodity | specifies the four product categories under consideration |
| brand | Specific brand of item |
| product_size | Specifies package size of product |
| Variable | Description |
|---|---|
| upc | Standard 10 digit UPC. |
| store | Identifies unique store |
| week | Week of transaction, possible values are 1 through 104 |
| feature_desc | Describes product location on weekly mailer |
| display_desc | Describes temporary in-store display |
| geography | Distinguishes between two large geographical regions, possible values are 1 or 2 |
| Variable | Features |
|---|---|
| store | Identifies unique stores |
| store_zip_code | 5 digit zip code |
I load the four datasets-
files <- c("causal_lookup", "product_lookup","store_lookup","transactions")
names <- c("causal_lookup","prd_lookup","store_lookup","transactions")
for (i in seq_along(files)) {
df <- read_sas(paste0("data/",files[i],".sas7bdat"))
assign(names[i], df)
}
unique(nchar(x = prd_lookup$upc))
## [1] 9 10
prd_lookup_crtd <- prd_lookup
prd_lookup_crtd$upc <- str_pad(prd_lookup$upc, 10, side = "left", pad = '0')
quantile(transactions$dollar_sales, probs = seq(0, 1, 0.01))
## 0% 1% 2% 3% 4% 5% 6% 7% 8% 9%
## -11.76 0.44 0.50 0.50 0.52 0.55 0.59 0.67 0.69 0.75
## 10% 11% 12% 13% 14% 15% 16% 17% 18% 19%
## 0.79 0.79 0.79 0.79 0.80 0.85 0.88 0.89 0.89 0.90
## 20% 21% 22% 23% 24% 25% 26% 27% 28% 29%
## 0.99 0.99 0.99 0.99 0.99 0.99 0.99 0.99 0.99 0.99
## 30% 31% 32% 33% 34% 35% 36% 37% 38% 39%
## 1.00 1.00 1.03 1.09 1.09 1.18 1.19 1.19 1.19 1.25
## 40% 41% 42% 43% 44% 45% 46% 47% 48% 49%
## 1.25 1.29 1.30 1.37 1.39 1.45 1.49 1.50 1.50 1.50
## 50% 51% 52% 53% 54% 55% 56% 57% 58% 59%
## 1.50 1.53 1.58 1.58 1.58 1.59 1.59 1.60 1.65 1.67
## 60% 61% 62% 63% 64% 65% 66% 67% 68% 69%
## 1.69 1.78 1.79 1.85 1.90 1.98 1.98 1.99 1.99 1.99
## 70% 71% 72% 73% 74% 75% 76% 77% 78% 79%
## 1.99 1.99 2.00 2.00 2.18 2.19 2.25 2.29 2.38 2.45
## 80% 81% 82% 83% 84% 85% 86% 87% 88% 89%
## 2.49 2.50 2.55 2.69 2.69 2.79 2.89 2.99 2.99 2.99
## 90% 91% 92% 93% 94% 95% 96% 97% 98% 99%
## 3.00 3.18 3.19 3.34 3.49 3.59 3.98 4.00 4.77 5.70
## 100%
## 153.14
It seems like few data points are inconsistent with the other. So, I was interested in knowing if the data is actually skewed, so I ran the quantile function on units as well.
quantile(transactions$units, probs = seq(0, 1, 0.01))
## 0% 1% 2% 3% 4% 5% 6% 7% 8% 9% 10% 11% 12% 13% 14%
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 15% 16% 17% 18% 19% 20% 21% 22% 23% 24% 25% 26% 27% 28% 29%
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 30% 31% 32% 33% 34% 35% 36% 37% 38% 39% 40% 41% 42% 43% 44%
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 45% 46% 47% 48% 49% 50% 51% 52% 53% 54% 55% 56% 57% 58% 59%
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 60% 61% 62% 63% 64% 65% 66% 67% 68% 69% 70% 71% 72% 73% 74%
## 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## 75% 76% 77% 78% 79% 80% 81% 82% 83% 84% 85% 86% 87% 88% 89%
## 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2
## 90% 91% 92% 93% 94% 95% 96% 97% 98% 99% 100%
## 2 2 2 2 2 2 2 2 3 3 156
I observe the same in units too.
Finally I looked at the transactions -
transactions %>%
mutate_all(as.vector) %>%
filter( units > 100) %>%
left_join(prd_lookup_crtd, by = "upc")
## # A tibble: 3 x 15
## upc dollar_sales units time_of_transac~ geography week household store
## <chr> <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 9999~ 118. 120 1400 1 65 352180 102
## 2 2700~ 153. 156 1541 2 73 88418 270
## 3 9999~ 132. 133 1500 1 81 398723 9
## # ... with 7 more variables: basket <dbl>, day <dbl>, coupon <dbl>,
## # product_description <chr>, commodity <chr>, brand <chr>,
## # product_size <chr>
Since there are exactly three entries which tell us that more than 100 units of pasta sauce(mapped with the prd_lookup_crtd table to get the commodity name) were bought by the households at one time, which is not expected. There might be an issue with the data entry here, so I plan to drop these rows for my further analysis purpose.
#identifying the data index
which(transactions$units > 100)
## [1] 3127414 3522595 4020790
#cross-validating data at the index
transactions[c(3127414,3522595,4020790),]
## # A tibble: 3 x 11
## upc dollar_sales units time_of_transac~ geography week household store
## <chr> <dbl> <dbl> <chr> <dbl> <dbl> <dbl> <dbl>
## 1 9999~ 118. 120 1400 1 65 352180 102
## 2 2700~ 153. 156 1541 2 73 88418 270
## 3 9999~ 132. 133 1500 1 81 398723 9
## # ... with 3 more variables: basket <dbl>, day <dbl>, coupon <dbl>
#deleting the row and creating another table so as to not mess with the original
transactions_crtd <- transactions[-c(3127414,3522595,4020790),]
Hence, the identified issues were fixed and thus the tables are ready to be used for further analysis.
Please note: Instead of creating a master dataset with more than 15 columns, we plan to slice and dice and apply joins in the required data levels.
A brief of the transactions data which will be majorly used for the entire analysis.
The unique number of UPCs that we will primarily use to join our tables: 927
All these UPCs are identified in the product_lookup table, so we can fetch the product, brand information
transactions_crtd %>%
mutate_all(as.vector) %>%
anti_join(prd_lookup_crtd , by = "upc")
## # A tibble: 0 x 11
## # ... with 11 variables: upc <chr>, dollar_sales <dbl>, units <dbl>,
## # time_of_transaction <chr>, geography <dbl>, week <dbl>,
## # household <dbl>, store <dbl>, basket <dbl>, day <dbl>, coupon <dbl>
The different features avaiable in the data and are of interest- Wrap Interior Feature, Wrap Back Feature, Interior Page Feature, Not on Feature, Interior Page Line Item, Wrap Front Feature, Front Page Feature, Back Page Feature
The different temporary in-location display options avaiable: Not on Display, Rear End Cap, Store Rear, Front End Cap, In-Shelf, Store Front, Secondary Location Display, In-Aisle, Promo/Seasonal Aisle, Side-Aisle End Cap, Mid-Aisle End Cap
The sales and units by display_desc -
## # A tibble: 11 x 3
## display_desc total_sales total_units
## <chr> <dbl> <dbl>
## 1 Not on Display 324395. 281389
## 2 Rear End Cap 79471. 68814
## 3 In-Shelf 63139. 51951
## 4 Front End Cap 31889. 28086
## 5 Secondary Location Display 23581. 16422
## 6 In-Aisle 21124. 16219
## 7 Promo/Seasonal Aisle 13619. 12359
## 8 Mid-Aisle End Cap 13288. 11084
## 9 Store Rear 10016. 8349
## 10 Store Front 6147. 5256
## 11 Side-Aisle End Cap 1835. 1250
I observe that for products that aren’t in display have more sales overall
## # A tibble: 8 x 3
## feature_desc total_sales total_units
## <chr> <dbl> <dbl>
## 1 Interior Page Feature 240347. 196841
## 2 Not on Feature 172546. 136393
## 3 Front Page Feature 105429. 106489
## 4 Wrap Interior Feature 22507. 19338
## 5 Back Page Feature 17707. 17476
## 6 Wrap Front Feature 11098. 11210
## 7 Wrap Back Feature 9492. 7467
## 8 Interior Page Line Item 9378. 5965
I observe that for products that are featured in the interior page have more sales overall
In this section, I started with initial data observation, tried to identify a buying pattern of complementary products and also looked at how sales differ based on the temporary in-store display of the products.
Which brands are responsible for more sales within each commodity?(Top 5)
#which brand is performing better
transactions_crtd %>%
mutate_all(as.vector) %>%
inner_join(prd_lookup_crtd, by = c("upc") ) %>%
select(upc, week,store, geography, commodity, brand, dollar_sales,units) %>%
group_by( commodity,brand) %>%
summarize( total_sales = sum(dollar_sales), total_units = sum(units)) %>%
arrange(desc(total_sales),desc(total_units)) %>%
top_n(5, wt = total_sales) %>%
ggplot( aes(x = reorder(brand,-total_sales), total_sales)) +
geom_bar(stat = "identity", fill = "Indian red") +
facet_wrap(~commodity, scales = "free") +
scale_x_discrete("Brand") +
scale_y_continuous("Total Sales",labels = scales::comma) +
labs(title = "Total sales by Brand and Commodity",
subtitle = "Out of 131 brands, we observe top five brands from each commodity; \nPrivate Label is one of the most preferred brands among all the commodities") +
theme_minimal() +
theme(panel.grid.minor = element_blank(),
panel.grid.major.x = element_blank(),
legend.title = element_blank(),
legend.justification = c(0, 1),
legend.position = c(.1, 1.075),
legend.background = element_blank(),
legend.direction="horizontal",
text = element_text(family = "Georgia", size = 7),
plot.title = element_text(size = 20, margin = margin(b = 10)),
plot.subtitle = element_text(size = 10, color = "darkslategrey", margin = margin(b = 25)),
plot.caption = element_text(size = 8, margin = margin(t = 10), color = "grey70", hjust = 0))
What is the preferrable shopping time?
#extracting hour information
library(lubridate)
transactions_crtd$time_of_transaction <- as.POSIXct(transactions_crtd$time_of_transaction,format="%H")
transactions_crtd$time_of_transaction <- hour(transactions_crtd$time_of_transaction)
transactions_crtd %>%
ggplot(aes(x = time_of_transaction)) +
geom_histogram(stat = "count",fill = "indianred") +
theme(rect = element_blank()) +
scale_x_continuous("Time of Day" ) +
scale_y_continuous("Count", labels = scales::comma) +
labs(title = "Most Preferred Shopping Time",
subtitle = "people start shopping around morning 10; \n15:00 to 18:00 are the rush hours") +
theme_minimal() +
theme(panel.grid.minor = element_blank(),
panel.grid.major.x = element_blank(),
legend.title = element_blank(),
legend.justification = c(0, 1),
legend.position = c(.1, 1.075),
legend.background = element_blank(),
legend.direction="horizontal",
text = element_text(family = "Georgia", size = 7),
plot.title = element_text(size = 20, margin = margin(b = 10)),
plot.subtitle = element_text(size = 10, color = "darkslategrey", margin = margin(b = 25)),
plot.caption = element_text(size = 8, margin = margin(t = 10), color = "grey70", hjust = 0))
Which items are sold more?
transactions_crtd %>%
mutate_all(as.vector) %>%
filter(dollar_sales > 0) %>%
inner_join(prd_lookup_crtd, by = c("upc") ) %>%
select(upc, week,store, geography,
commodity,basket, brand, dollar_sales,units) %>%
group_by(commodity) %>%
summarize(count = sum(units)) %>%
arrange(desc(count)) %>%
ggplot(aes(x=reorder(commodity,count), y=count))+
geom_bar(stat="identity",fill="indian red")+
scale_x_discrete("Commodity") +
scale_y_continuous("Total units bought",labels = scales::comma) +
labs(title = "Total Units Bought by Commodity",
subtitle = "Pasta is the most preferred product; \nPancakes are less preferred") +
theme_minimal() +
theme(rect = element_blank()) +
theme(panel.grid.minor = element_blank(),
panel.grid.major.x = element_blank(),
legend.title = element_blank(),
legend.justification = c(0, 1),
legend.position = c(.1, 1.075),
legend.background = element_blank(),
legend.direction="horizontal",
text = element_text(family = "Georgia", size = 7),
plot.title = element_text(size = 20, margin = margin(b = 10)),
plot.subtitle = element_text(size = 10, color = "darkslategrey", margin = margin(b = 25)),
plot.caption = element_text(size = 8, margin = margin(t = 10), color = "grey70", hjust = 0))
Since Pasta and Pasta Sauce happens to generate more sales, I would like to know -
A. if a relationship exists between the sales of complementary commodities (Pasta and Pasta Sauce)
B. if sales is dependent on the display of the product
Part A: Checking if a relationship exists between the sales of complementary commodities (Pasta and Pasta Sauce)
I am using the basket analysis which uses Association Rule Mining to look for combinations of items that occur together frequently in transactions. It tells us which items customer frequently buys together by generating a set of rules. There are two statistical measures that can be used to determine whether or not a rule is deemed “interesting”
Support: Fraction of transactions that contain the item-set. If Support = 0.2, it means 20% of transaction show which A is bought with B
Confidence: It shows the percentage in which B is bought with A. If Confidence = 0.6; it means 60% of customers who purchase A also purchase B
#filtered the data for geography 1 for the analysis
market <- transactions_crtd %>%
left_join(prd_lookup_crtd, by="upc") %>%
filter(geography == 1 ) %>%
arrange(desc(commodity)) %>%
select(basket,commodity) %>%
distinct()
#using ddply package for ddply function
#The function ddply() accepts a data frame, splits it into pieces based on one or more factors, computes on the pieces, and then returns the results as a data frame
transactionData <- plyr::ddply(market,c("basket"),
function(market)paste(market$commodity,
collapse = ","))
#removing all columns except for the items
transactionData$basket = NULL
colnames(transactionData) <- c("items")
#writing the transactional data in another file
write.csv(x = transactionData$items,
file = "edited_transaction_required.csv",
quote = FALSE, row.names = FALSE)
groceries <- read.transactions('edited_transaction_required.csv',
format = 'basket', sep=',', skip = 1)
summary(groceries)
## transactions as itemMatrix in sparse format with
## 1776754 rows (elements/itemsets/transactions) and
## 4 columns (items) and a density of 0.3359036
##
## most frequent items:
## pasta pasta sauce syrups pancake mixes (Other)
## 1042750 819467 364649 160406 0
##
## element (itemset/transaction) length distribution:
## sizes
## 1 2 3 4
## 1222469 506297 39743 8245
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 1.000 1.000 1.344 2.000 4.000
##
## includes extended item information - examples:
## labels
## 1 pancake mixes
## 2 pasta
## 3 pasta sauce
So, we have 1,776,754 transactions containing 4 different items. We observe the count of transactions that involved an item. The second block of our output contains summary statistics about the size of each transaction. For example, there were 506,297 transactions in which only 2 items were bought and 8,245 transaction in which all 4 items were bought.
Below is a view of the transactions -
inspect(groceries[1:10])
## items
## [1] {pancake mixes}
## [2] {pasta}
## [3] {pancake mixes,syrups}
## [4] {pancake mixes}
## [5] {pasta}
## [6] {pasta,pasta sauce}
## [7] {pasta}
## [8] {pasta}
## [9] {pancake mixes,syrups}
## [10] {pasta}
Let’s look at the frequency of items that are purchased-
itemFrequency(groceries[, 1:4])
## pancake mixes pasta pasta sauce syrups
## 0.09028037 0.58688485 0.46121579 0.20523325
We observe that only 9% of transaction involves pancake mixes whereas Pasta was bought in 58% of the transactions. Below is the plot for the same-
itemFrequencyPlot(groceries, topN = 4)
We create certain rules using the Apriori algorithm. We pass support = 0.05 and confidence = 0.5 to return all the rules that have a support of at least 5% and confidence of at least 50%
#sorting on the decreasing order of confidence
rules <- sort(rules, by ='confidence', decreasing = TRUE)
summary(rules)
## set of 6 rules
##
## rule length distribution (lhs + rhs):sizes
## 1 2 3
## 1 1 4
##
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 2.25 3.00 2.50 3.00 3.00
##
## summary of quality measures:
## support confidence lift count
## Min. :0.009255 Min. :0.5339 Min. :0.9302 Min. : 16444
## 1st Qu.:0.011355 1st Qu.:0.5504 1st Qu.:1.0138 1st Qu.: 20175
## Median :0.017654 Median :0.5754 Median :1.0755 Median : 31367
## Mean :0.148749 Mean :0.5822 Mean :1.0769 Mean : 264290
## 3rd Qu.:0.193256 3rd Qu.:0.6111 3rd Qu.:1.1422 3rd Qu.: 343368
## Max. :0.586885 Max. :0.6432 Max. :1.2227 Max. :1042750
##
## mining info:
## data ntransactions support confidence
## groceries 1776754 0.005 0.5
We obtain 6 rules, out of which 4 rules are for the length of 3 items.
Let’s print all the rules-
inspect_rules[,c(-1,-2,-3)] <- round(inspect_rules[,c(-1,-2,-3)],3)
datatable(inspect_rules)
Thus we observe that -
More than 60% of the time, customers who bought Pasta also bought Pasta Sauce. However, we do not observe any such pattern for Pancake mixes and Syrups.
More than 50% of the time, people who bought Pasta and Pasta Sauce, also bought Syrups. This is little different from my expectation. However, I researched and believe it or not, it became a popular dish after being featured in the movie Elf.
They said- “The combination might be unorthodox, but the contrast between sweet and savory couldn’t be more classic”
(p.s. this is not based on any analysis :P). You can read more about it here
please note: Since the dataset is huge, I have filtered the data for geogrpahy = 1 in this analysis
Part B: Checking impact of display_desc on the sales of commodities
I would like to see if average sales across the different display differ using ANOVA.
Null hypothesis Ho : sales are independent of display of product
Alternate hypothesis Ha : sales are dependent on display
#filtered the data for geography 1 for the analysis
#using the previously created dataset
trans_geo <- trans_cut %>%
filter(geography == 1 )
#applying ANOVA
anova <- aov(tot_sales ~ display_desc, data = trans_geo)
summary(anova)
## Df Sum Sq Mean Sq F value Pr(>F)
## display_desc 10 31005 3100.5 462.7 <2e-16 ***
## Residuals 93638 627400 6.7
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Since the p-value is < 0.05(the significance level), we can say reject the null hypothesis. Thus, ANOVA results tells that the sales is dependent on the display of the product and the results are significant overall, but it doesn’t tell us where those differences lie.
I use TukeyHSD(Honestly Singificant Difference) test to do pair-wise comparison and identify the ones that differ. It tells us which specific groups’s means are different.
When p adj is less than 0.05, we reject Ho and thus we can say that there is a significant difference in the sales in that particular pair of display.
tukey_ouput <- TukeyHSD(anova, ordered = TRUE)
tukey_ouput <- tukey_ouput$display_desc %>% #strips off some headers in kk
as.data.frame()
tukey_ouput <- round(tukey_ouput,5)
tukey_ouput_cut <- tukey_ouput[tukey_ouput$`p adj` < 0.05,]
datatable(tukey_ouput_cut, filter = 'top',
options = list(pageLength = 12, autoWidth = TRUE ))
Therefore, we can clearly note an increase in difference of average sales for the ones which were on Promo/Seasonal Aisle than the ones which were Not on Display. Diference lies in 36 such pairs. This completely changes our initial observations.
please note: Since the dataset is huge, I have filtered the data for geogrpahy = 1 in this analysis
Problem Statement
This analysis is intended to identify shopping pattern of customers and their preferences which would help retailers in generating more revenue and also enhance customers’ experience.
Methodology
In order to gain a clear understanding of the customers buying pattern, I used Apriori Algorithm to perform Market Basket Analysis. This technique helps us to uncover associations between items, by looking for combinations of items that occur togther in transactions and providing information to understand the purchase behavior
For identifying if sales depend on the in-store display of the products, I performed ANOVA test. ANOVA test tells us if the results are significant or not, but to identify where exactly the difference lies, I used TukeyHSD test
Insights
We observed that in more than 60% of the transactions, Pasta and Pasta Sauce are bought together. This will help us to redesign the store layout by putting them together and develop cross-promotional programs
Since customers are buying the products at a time, putting both the items on promotion at the same time might not create a significant increase in revenue, while a promotion involving just one of the items would likely drive sales of the other
No identifiable pattern was found between the sales of Pancake mixes and Syrups. But again, only 9% of the transactions invloved sales for Pancake mixes
The store layout does affect sales. The overall numbers say that products that were Not On Display generated more sales, however after thorough analysis, we observed an entire different story. Products placed in Store Front or Seasonal Aisle generated more sales on an average when compared to the ones that were Not on display.
Implications
The analysis can be used to gain an understanding of customer’s journey while shopping. Clients can use the rules generated by the algorithm for numerous marketing strategies -
Changing the store layout
Customer behavior analysis
Catalogue(feature in mailer) design
Cross marketing
Customized emails with add-on samples
providing recommendations while online shopping
Limitations
This analysis was limited by the number of items in the data. In future, I would like to include more items and observe pattern in customers’ preference. I would also like to look at the top performing stores, identify reasons for their performances and build strategies for the ones that aren’t performing well comparatively.