Carbo-loading Data Wrangling with R

Carbo Loading Analysis(for a US retailer)

Introduction

What am I trying to do?

I am looking at the carbo-loading household level transactions over a period of two years from four categories- Pasta, Pasta Sauce, Syrup, and Pancake Mix.

Based on the richness of the data, I want to know:

Are complementary products bought together?
if product sales depend on the display?

How to go about this?

I will look at the household-week level transactions for the product categories- pasta and pasta sauce and see if I can identify a pattern in the items bought across the weeks.

I will slice the data at product, week, store, display_desc, and sales level. Then I will see the trend for each product to see if we observe any pattern in sales based on the location of temporary in-store.

How will this be useful?

This analysis will help retailer design marketing campaigns for the complementary products.
The product display will help understand how customers make a decision and how display causes impact on the choice a customer makes.

Packages Required

The following packages are required in order to run code without errors.

library(tidyverse)  #to visualize, transform, input, tidy and join data
library(haven)      #to input data from SAS
library(dplyr)      #data wrangling
library(stringr)    #string related functions
library(kableExtra) #to create HTML Table
library(DT)         #to preview the data sets
library(lubridate)  #to apply the date functions
library(arules)     #to represent, manipulate and analyze transactional data
library(extrafont)  #to use specific font in plots

Note: You would also require to install PLYR package but do not load it yet as it might contradict with some of the functions in DPLYR.

Data Preparation

Data source

We obtained the carbo-loading household level data, obtained through the loyalty card program of a leading US grocer. It contains four data-sets, and was obtained from here. Please see the code book at the same location.

transactions : household level data over a period of two years from four categories: Pasta, Pasta Sauce, Syrup, and Pancake Mix.

Variable	Description
upc	Standard 10 digit UPC.
dollar_sales	Amount of dollars spent by the consumer
units	Number of products purchased by the consumer
time_of_transaction	The time of transaction(military time)
geography	Distinguishes between two large geographical regions, possibly values are 1 or 2
week	Week of the transaction, values are from 1 to 104
household	Unique households
store	Unique stores
basket	Unique baskets/trips to store
day	day of the transaction, possible values are from 1 to 728
coupon	Indicates coupon usage, 1 if used, 0 otherwise

product lookup : detailed product information

Variable	Description
upc	Standard 10 digit UPC.
product_description	Description of product
commodity	specifies the four product categories under consideration
brand	Specific brand of item
product_size	Specifies package size of product

causal lookup : trade activity for each product/week

Variable	Description
upc	Standard 10 digit UPC.
store	Identifies unique store
week	Week of transaction, possible values are 1 through 104
feature_desc	Describes product location on weekly mailer
display_desc	Describes temporary in-store display
geography	Distinguishes between two large geographical regions, possible values are 1 or 2

store lookup : store and it’s zip code

Variable	Features
store	Identifies unique stores
store_zip_code	5 digit zip code

I load the four datasets-

files <- c("causal_lookup", "product_lookup","store_lookup","transactions")
names <- c("causal_lookup","prd_lookup","store_lookup","transactions")

for (i in seq_along(files)) {
 df <- read_sas(paste0("data/",files[i],".sas7bdat"))
 assign(names[i], df)
 }

Data cleaning

UPC is 10 digit in length, so to keep it consistent across the four datasets, I padded UPC in the product lookup dataset, which originally had UPC of length 9 and 10

unique(nchar(x = prd_lookup$upc))
## [1]  9 10

prd_lookup_crtd <- prd_lookup
prd_lookup_crtd$upc <-  str_pad(prd_lookup$upc, 10, side = "left", pad = '0')

quantiles on the dollar_sales column to check for any absurdity in the data.

quantile(transactions$dollar_sales, probs = seq(0, 1, 0.01))
##     0%     1%     2%     3%     4%     5%     6%     7%     8%     9% 
## -11.76   0.44   0.50   0.50   0.52   0.55   0.59   0.67   0.69   0.75 
##    10%    11%    12%    13%    14%    15%    16%    17%    18%    19% 
##   0.79   0.79   0.79   0.79   0.80   0.85   0.88   0.89   0.89   0.90 
##    20%    21%    22%    23%    24%    25%    26%    27%    28%    29% 
##   0.99   0.99   0.99   0.99   0.99   0.99   0.99   0.99   0.99   0.99 
##    30%    31%    32%    33%    34%    35%    36%    37%    38%    39% 
##   1.00   1.00   1.03   1.09   1.09   1.18   1.19   1.19   1.19   1.25 
##    40%    41%    42%    43%    44%    45%    46%    47%    48%    49% 
##   1.25   1.29   1.30   1.37   1.39   1.45   1.49   1.50   1.50   1.50 
##    50%    51%    52%    53%    54%    55%    56%    57%    58%    59% 
##   1.50   1.53   1.58   1.58   1.58   1.59   1.59   1.60   1.65   1.67 
##    60%    61%    62%    63%    64%    65%    66%    67%    68%    69% 
##   1.69   1.78   1.79   1.85   1.90   1.98   1.98   1.99   1.99   1.99 
##    70%    71%    72%    73%    74%    75%    76%    77%    78%    79% 
##   1.99   1.99   2.00   2.00   2.18   2.19   2.25   2.29   2.38   2.45 
##    80%    81%    82%    83%    84%    85%    86%    87%    88%    89% 
##   2.49   2.50   2.55   2.69   2.69   2.79   2.89   2.99   2.99   2.99 
##    90%    91%    92%    93%    94%    95%    96%    97%    98%    99% 
##   3.00   3.18   3.19   3.34   3.49   3.59   3.98   4.00   4.77   5.70 
##   100% 
## 153.14

It seems like few data points are inconsistent with the other. So, I was interested in knowing if the data is actually skewed, so I ran the quantile function on units as well.

quantile(transactions$units, probs = seq(0, 1, 0.01))
##   0%   1%   2%   3%   4%   5%   6%   7%   8%   9%  10%  11%  12%  13%  14% 
##    1    1    1    1    1    1    1    1    1    1    1    1    1    1    1 
##  15%  16%  17%  18%  19%  20%  21%  22%  23%  24%  25%  26%  27%  28%  29% 
##    1    1    1    1    1    1    1    1    1    1    1    1    1    1    1 
##  30%  31%  32%  33%  34%  35%  36%  37%  38%  39%  40%  41%  42%  43%  44% 
##    1    1    1    1    1    1    1    1    1    1    1    1    1    1    1 
##  45%  46%  47%  48%  49%  50%  51%  52%  53%  54%  55%  56%  57%  58%  59% 
##    1    1    1    1    1    1    1    1    1    1    1    1    1    1    1 
##  60%  61%  62%  63%  64%  65%  66%  67%  68%  69%  70%  71%  72%  73%  74% 
##    1    1    1    1    1    1    1    1    1    1    1    1    1    1    1 
##  75%  76%  77%  78%  79%  80%  81%  82%  83%  84%  85%  86%  87%  88%  89% 
##    1    1    1    1    1    1    1    1    1    1    2    2    2    2    2 
##  90%  91%  92%  93%  94%  95%  96%  97%  98%  99% 100% 
##    2    2    2    2    2    2    2    2    3    3  156

I observe the same in units too.

Finally I looked at the transactions -

transactions %>% 
   mutate_all(as.vector) %>%
   filter( units > 100) %>% 
   left_join(prd_lookup_crtd, by = "upc")

## # A tibble: 3 x 15
##   upc   dollar_sales units time_of_transac~ geography  week household store
##   <chr>        <dbl> <dbl> <chr>                <dbl> <dbl>     <dbl> <dbl>
## 1 9999~         118.   120 1400                     1    65    352180   102
## 2 2700~         153.   156 1541                     2    73     88418   270
## 3 9999~         132.   133 1500                     1    81    398723     9
## # ... with 7 more variables: basket <dbl>, day <dbl>, coupon <dbl>,
## #   product_description <chr>, commodity <chr>, brand <chr>,
## #   product_size <chr>

Since there are exactly three entries which tell us that more than 100 units of pasta sauce(mapped with the prd_lookup_crtd table to get the commodity name) were bought by the households at one time, which is not expected. There might be an issue with the data entry here, so I plan to drop these rows for my further analysis purpose.

special characters and different measurement units being used in the product size column in the product lookup dataset. However, I am not using it for my analysis.

Cleaned Datasets

After identifying the absurd data point, I removed its entry from the transactions. I also created a copy of the file to keep the original data as it is.

#identifying the data index
which(transactions$units > 100)
## [1] 3127414 3522595 4020790

#cross-validating data at the index
transactions[c(3127414,3522595,4020790),]
## # A tibble: 3 x 11
##   upc   dollar_sales units time_of_transac~ geography  week household store
##   <chr>        <dbl> <dbl> <chr>                <dbl> <dbl>     <dbl> <dbl>
## 1 9999~         118.   120 1400                     1    65    352180   102
## 2 2700~         153.   156 1541                     2    73     88418   270
## 3 9999~         132.   133 1500                     1    81    398723     9
## # ... with 3 more variables: basket <dbl>, day <dbl>, coupon <dbl>

#deleting the row and creating another table so as to not mess with the original
transactions_crtd <- transactions[-c(3127414,3522595,4020790),]

Hence, the identified issues were fixed and thus the tables are ready to be used for further analysis.

Please note: Instead of creating a master dataset with more than 15 columns, we plan to slice and dice and apply joins in the required data levels.

Summary

A brief of the transactions data which will be majorly used for the entire analysis.
The unique number of UPCs that we will primarily use to join our tables: 927
All these UPCs are identified in the product_lookup table, so we can fetch the product, brand information

transactions_crtd %>%
  mutate_all(as.vector) %>% 
  anti_join(prd_lookup_crtd , by = "upc")

## # A tibble: 0 x 11
## # ... with 11 variables: upc <chr>, dollar_sales <dbl>, units <dbl>,
## #   time_of_transaction <chr>, geography <dbl>, week <dbl>,
## #   household <dbl>, store <dbl>, basket <dbl>, day <dbl>, coupon <dbl>

The different features avaiable in the data and are of interest- Wrap Interior Feature, Wrap Back Feature, Interior Page Feature, Not on Feature, Interior Page Line Item, Wrap Front Feature, Front Page Feature, Back Page Feature
The different temporary in-location display options avaiable: Not on Display, Rear End Cap, Store Rear, Front End Cap, In-Shelf, Store Front, Secondary Location Display, In-Aisle, Promo/Seasonal Aisle, Side-Aisle End Cap, Mid-Aisle End Cap
The sales and units by display_desc -

## # A tibble: 11 x 3
##    display_desc               total_sales total_units
##    <chr>                            <dbl>       <dbl>
##  1 Not on Display                 324395.      281389
##  2 Rear End Cap                    79471.       68814
##  3 In-Shelf                        63139.       51951
##  4 Front End Cap                   31889.       28086
##  5 Secondary Location Display      23581.       16422
##  6 In-Aisle                        21124.       16219
##  7 Promo/Seasonal Aisle            13619.       12359
##  8 Mid-Aisle End Cap               13288.       11084
##  9 Store Rear                      10016.        8349
## 10 Store Front                      6147.        5256
## 11 Side-Aisle End Cap               1835.        1250

I observe that for products that aren’t in display have more sales overall

The sales and units by feature_desc-

## # A tibble: 8 x 3
##   feature_desc            total_sales total_units
##   <chr>                         <dbl>       <dbl>
## 1 Interior Page Feature       240347.      196841
## 2 Not on Feature              172546.      136393
## 3 Front Page Feature          105429.      106489
## 4 Wrap Interior Feature        22507.       19338
## 5 Back Page Feature            17707.       17476
## 6 Wrap Front Feature           11098.       11210
## 7 Wrap Back Feature             9492.        7467
## 8 Interior Page Line Item       9378.        5965

I observe that for products that are featured in the interior page have more sales overall

Exploratory Data Analysis

In this section, I started with initial data observation, tried to identify a buying pattern of complementary products and also looked at how sales differ based on the temporary in-store display of the products.

Initial Data Observations

Which brands are responsible for more sales within each commodity?(Top 5)

#which brand is performing better
transactions_crtd %>% 
  mutate_all(as.vector) %>% 
  inner_join(prd_lookup_crtd, by = c("upc") ) %>%
  select(upc, week,store, geography, commodity, brand, dollar_sales,units) %>% 
  group_by( commodity,brand) %>% 
  summarize( total_sales = sum(dollar_sales), total_units = sum(units)) %>%
  arrange(desc(total_sales),desc(total_units)) %>%
  top_n(5, wt = total_sales) %>% 
  ggplot( aes(x = reorder(brand,-total_sales), total_sales)) +
  geom_bar(stat = "identity", fill = "Indian red") +
  facet_wrap(~commodity, scales = "free") +
  scale_x_discrete("Brand") +
  scale_y_continuous("Total Sales",labels = scales::comma) +
  labs(title = "Total sales by Brand and Commodity",
             subtitle = "Out of 131 brands, we observe top five brands from each commodity; \nPrivate Label is one of the most preferred brands among all the commodities") +
  theme_minimal() +
  theme(panel.grid.minor = element_blank(),
        panel.grid.major.x = element_blank(),
        legend.title = element_blank(),
        legend.justification = c(0, 1), 
        legend.position = c(.1, 1.075),
        legend.background = element_blank(),
        legend.direction="horizontal",
        text = element_text(family = "Georgia", size = 7),
        plot.title = element_text(size = 20, margin = margin(b = 10)),
        plot.subtitle = element_text(size = 10, color = "darkslategrey", margin = margin(b = 25)),
        plot.caption = element_text(size = 8, margin = margin(t = 10), color = "grey70", hjust = 0))

What is the preferrable shopping time?

#extracting hour information 
library(lubridate)
transactions_crtd$time_of_transaction <- as.POSIXct(transactions_crtd$time_of_transaction,format="%H") 
transactions_crtd$time_of_transaction <- hour(transactions_crtd$time_of_transaction) 

transactions_crtd %>% 
  ggplot(aes(x = time_of_transaction)) + 
  geom_histogram(stat = "count",fill = "indianred") +
  theme(rect = element_blank()) + 
  scale_x_continuous("Time of Day" ) +
  scale_y_continuous("Count",  labels = scales::comma) +
  labs(title = "Most Preferred Shopping Time",
             subtitle = "people start shopping around morning 10; \n15:00 to 18:00 are the rush hours") +
  theme_minimal() +
  theme(panel.grid.minor = element_blank(),
        panel.grid.major.x = element_blank(),
        legend.title = element_blank(),
        legend.justification = c(0, 1), 
        legend.position = c(.1, 1.075),
        legend.background = element_blank(),
        legend.direction="horizontal",
        text = element_text(family = "Georgia", size = 7),
        plot.title = element_text(size = 20, margin = margin(b = 10)),
        plot.subtitle = element_text(size = 10, color = "darkslategrey", margin = margin(b = 25)),
        plot.caption = element_text(size = 8, margin = margin(t = 10), color = "grey70", hjust = 0))

Which items are sold more?

transactions_crtd %>% 
  mutate_all(as.vector) %>% 
  filter(dollar_sales > 0) %>% 
  inner_join(prd_lookup_crtd, by = c("upc") ) %>%
  select(upc, week,store, geography, 
         commodity,basket, brand, dollar_sales,units) %>%
  group_by(commodity) %>% 
  summarize(count = sum(units)) %>% 
  arrange(desc(count)) %>% 
  ggplot(aes(x=reorder(commodity,count), y=count))+
  geom_bar(stat="identity",fill="indian red")+
  scale_x_discrete("Commodity") +
  scale_y_continuous("Total units bought",labels = scales::comma) +
  labs(title = "Total Units Bought by Commodity",
             subtitle = "Pasta is the most preferred product; \nPancakes are less preferred") +
  theme_minimal() +
  theme(rect = element_blank()) +
  theme(panel.grid.minor = element_blank(),
        panel.grid.major.x = element_blank(),
        legend.title = element_blank(),
        legend.justification = c(0, 1), 
        legend.position = c(.1, 1.075),
        legend.background = element_blank(),
        legend.direction="horizontal",
        text = element_text(family = "Georgia", size = 7),
        plot.title = element_text(size = 20, margin = margin(b = 10)),
        plot.subtitle = element_text(size = 10, color = "darkslategrey", margin = margin(b = 25)),
        plot.caption = element_text(size = 8, margin = margin(t = 10), color = "grey70", hjust = 0))

Since Pasta and Pasta Sauce happens to generate more sales, I would like to know -

A. if a relationship exists between the sales of complementary commodities (Pasta and Pasta Sauce)

B. if sales is dependent on the display of the product

Complementary Products

Part A: Checking if a relationship exists between the sales of complementary commodities (Pasta and Pasta Sauce)

I am using the basket analysis which uses Association Rule Mining to look for combinations of items that occur together frequently in transactions. It tells us which items customer frequently buys together by generating a set of rules. There are two statistical measures that can be used to determine whether or not a rule is deemed “interesting”

Support: Fraction of transactions that contain the item-set. If Support = 0.2, it means 20% of transaction show which A is bought with B

Confidence: It shows the percentage in which B is bought with A. If Confidence = 0.6; it means 60% of customers who purchase A also purchase B

#filtered the data for geography 1 for the analysis
market <- transactions_crtd %>%
            left_join(prd_lookup_crtd, by="upc") %>% 
            filter(geography == 1  ) %>% 
            arrange(desc(commodity)) %>% 
            select(basket,commodity) %>% 
            distinct() 

#using ddply package for ddply function
#The function ddply() accepts a data frame, splits it into pieces based on one or more factors, computes on the pieces, and then returns the results as a data frame
transactionData <- plyr::ddply(market,c("basket"),
                         function(market)paste(market$commodity,
                                            collapse = ","))

#removing all columns except for the items 
transactionData$basket = NULL

colnames(transactionData) <- c("items")

#writing the transactional data in another file
write.csv(x = transactionData$items, 
          file = "edited_transaction_required.csv", 
          quote = FALSE, row.names = FALSE)

groceries <- read.transactions('edited_transaction_required.csv',
                               format = 'basket', sep=',', skip = 1)
summary(groceries)

## transactions as itemMatrix in sparse format with
##  1776754 rows (elements/itemsets/transactions) and
##  4 columns (items) and a density of 0.3359036 
## 
## most frequent items:
##         pasta   pasta sauce        syrups pancake mixes       (Other) 
##       1042750        819467        364649        160406             0 
## 
## element (itemset/transaction) length distribution:
## sizes
##       1       2       3       4 
## 1222469  506297   39743    8245 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   1.000   1.000   1.344   2.000   4.000 
## 
## includes extended item information - examples:
##          labels
## 1 pancake mixes
## 2         pasta
## 3   pasta sauce

So, we have 1,776,754 transactions containing 4 different items. We observe the count of transactions that involved an item. The second block of our output contains summary statistics about the size of each transaction. For example, there were 506,297 transactions in which only 2 items were bought and 8,245 transaction in which all 4 items were bought.

Below is a view of the transactions -

inspect(groceries[1:10])

##      items                 
## [1]  {pancake mixes}       
## [2]  {pasta}               
## [3]  {pancake mixes,syrups}
## [4]  {pancake mixes}       
## [5]  {pasta}               
## [6]  {pasta,pasta sauce}   
## [7]  {pasta}               
## [8]  {pasta}               
## [9]  {pancake mixes,syrups}
## [10] {pasta}

Let’s look at the frequency of items that are purchased-

itemFrequency(groceries[, 1:4])

## pancake mixes         pasta   pasta sauce        syrups 
##    0.09028037    0.58688485    0.46121579    0.20523325

We observe that only 9% of transaction involves pancake mixes whereas Pasta was bought in 58% of the transactions. Below is the plot for the same-

itemFrequencyPlot(groceries, topN  = 4)

We create certain rules using the Apriori algorithm. We pass support = 0.05 and confidence = 0.5 to return all the rules that have a support of at least 5% and confidence of at least 50%

#sorting on the decreasing order of confidence
rules <- sort(rules, by ='confidence', decreasing = TRUE)
summary(rules)

## set of 6 rules
## 
## rule length distribution (lhs + rhs):sizes
## 1 2 3 
## 1 1 4 
## 
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    1.00    2.25    3.00    2.50    3.00    3.00 
## 
## summary of quality measures:
##     support           confidence          lift            count        
##  Min.   :0.009255   Min.   :0.5339   Min.   :0.9302   Min.   :  16444  
##  1st Qu.:0.011355   1st Qu.:0.5504   1st Qu.:1.0138   1st Qu.:  20175  
##  Median :0.017654   Median :0.5754   Median :1.0755   Median :  31367  
##  Mean   :0.148749   Mean   :0.5822   Mean   :1.0769   Mean   : 264290  
##  3rd Qu.:0.193256   3rd Qu.:0.6111   3rd Qu.:1.1422   3rd Qu.: 343368  
##  Max.   :0.586885   Max.   :0.6432   Max.   :1.2227   Max.   :1042750  
## 
## mining info:
##       data ntransactions support confidence
##  groceries       1776754   0.005        0.5

We obtain 6 rules, out of which 4 rules are for the length of 3 items.

Let’s print all the rules-

inspect_rules[,c(-1,-2,-3)] <- round(inspect_rules[,c(-1,-2,-3)],3)
datatable(inspect_rules)

Thus we observe that -

More than 60% of the time, customers who bought Pasta also bought Pasta Sauce. However, we do not observe any such pattern for Pancake mixes and Syrups.
More than 50% of the time, people who bought Pasta and Pasta Sauce, also bought Syrups. This is little different from my expectation. However, I researched and believe it or not, it became a popular dish after being featured in the movie Elf.

They said- “The combination might be unorthodox, but the contrast between sweet and savory couldn’t be more classic”
(p.s. this is not based on any analysis :P). You can read more about it here

please note: Since the dataset is huge, I have filtered the data for geogrpahy = 1 in this analysis

Role of Display

Part B: Checking impact of display_desc on the sales of commodities

I would like to see if average sales across the different display differ using ANOVA.

Null hypothesis Ho : sales are independent of display of product

Alternate hypothesis Ha : sales are dependent on display

#filtered the data for geography 1 for the analysis
#using the previously created dataset
trans_geo <- trans_cut %>% 
              filter(geography == 1 ) 

#applying ANOVA
anova <- aov(tot_sales ~ display_desc, data = trans_geo)
summary(anova)

##                 Df Sum Sq Mean Sq F value Pr(>F)    
## display_desc    10  31005  3100.5   462.7 <2e-16 ***
## Residuals    93638 627400     6.7                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Since the p-value is < 0.05(the significance level), we can say reject the null hypothesis. Thus, ANOVA results tells that the sales is dependent on the display of the product and the results are significant overall, but it doesn’t tell us where those differences lie.

I use TukeyHSD(Honestly Singificant Difference) test to do pair-wise comparison and identify the ones that differ. It tells us which specific groups’s means are different.

When p adj is less than 0.05, we reject Ho and thus we can say that there is a significant difference in the sales in that particular pair of display.

tukey_ouput <- TukeyHSD(anova, ordered = TRUE)
tukey_ouput <- tukey_ouput$display_desc %>%     #strips off some headers in kk
                        as.data.frame()
tukey_ouput <- round(tukey_ouput,5)
tukey_ouput_cut <- tukey_ouput[tukey_ouput$`p adj` < 0.05,] 
datatable(tukey_ouput_cut, filter = 'top', 
          options = list(pageLength = 12, autoWidth = TRUE ))

Therefore, we can clearly note an increase in difference of average sales for the ones which were on Promo/Seasonal Aisle than the ones which were Not on Display. Diference lies in 36 such pairs. This completely changes our initial observations.

please note: Since the dataset is huge, I have filtered the data for geogrpahy = 1 in this analysis

Summary

Problem Statement

This analysis is intended to identify shopping pattern of customers and their preferences which would help retailers in generating more revenue and also enhance customers’ experience.

Methodology

In order to gain a clear understanding of the customers buying pattern, I used Apriori Algorithm to perform Market Basket Analysis. This technique helps us to uncover associations between items, by looking for combinations of items that occur togther in transactions and providing information to understand the purchase behavior
For identifying if sales depend on the in-store display of the products, I performed ANOVA test. ANOVA test tells us if the results are significant or not, but to identify where exactly the difference lies, I used TukeyHSD test

Insights

We observed that in more than 60% of the transactions, Pasta and Pasta Sauce are bought together. This will help us to redesign the store layout by putting them together and develop cross-promotional programs
Since customers are buying the products at a time, putting both the items on promotion at the same time might not create a significant increase in revenue, while a promotion involving just one of the items would likely drive sales of the other
No identifiable pattern was found between the sales of Pancake mixes and Syrups. But again, only 9% of the transactions invloved sales for Pancake mixes
The store layout does affect sales. The overall numbers say that products that were Not On Display generated more sales, however after thorough analysis, we observed an entire different story. Products placed in Store Front or Seasonal Aisle generated more sales on an average when compared to the ones that were Not on display.

Implications

The analysis can be used to gain an understanding of customer’s journey while shopping. Clients can use the rules generated by the algorithm for numerous marketing strategies -

Changing the store layout
Customer behavior analysis
Catalogue(feature in mailer) design
Cross marketing
Customized emails with add-on samples
providing recommendations while online shopping

Limitations

This analysis was limited by the number of items in the data. In future, I would like to include more items and observe pattern in customers’ preference. I would also like to look at the top performing stores, identify reasons for their performances and build strategies for the ones that aren’t performing well comparatively.

Carbo-loading Data Wrangling with R

Varsha Agarwalla

December 9, 2018

Carbo Loading Analysis(for a US retailer)

Introduction

What am I trying to do?

How to go about this?

How will this be useful?

Packages Required

Data Preparation

Data source

Data cleaning

Cleaned Datasets

Summary

Exploratory Data Analysis

Initial Data Observations

Complementary Products

Role of Display

Summary