1 Introduction

This report analyzes the trends of customers’ purchase for select items and explore methods to improve sales. We study the interactions between four commodities, namely Pasta, Pasta Sauce, Syrup and Pancake-Mix. The available data consists of purchase information spanning two years.

The data contains purchase information mapped to household and brand data for a duration of two years. Our goal is to find a pattern in consumer purchase and explore ways to increase sales.

The analysis is geared toward using coupons to improve sales. Coupons have become an essential tool in retail. They are often used for promotional activities, for building customer loyalty and driving sale of a product by bucketing them with existing popular items.

We assess the impact of coupons on the spending habits of the consumers in terms of the number of units purchased and the overall amount spent by the consumer. In addition to this, we also want to explore the relationship between the sale of complementary and substitute product categories.

Our focus is on the following KPIs:

Units purchased
Coupon usage
Customer Spending per Transaction/Household

We analyze these KPIs at commodity, brand and household level.

We apply generalized statistical methods, which can be easily transferred to similar problems in other commercial segments. In addition, the gained insights are readily applicable to similar product combinations. This approach can also be used to understand the effectiveness of any similar promotional activity.

Based on the data provided, we plan to develop a multiple linear regression model which will facilitate the understanding of the degree of variability in the customer spending as a function of coupon usage.

This analysis will help retailers

Assess the impact of coupons in driving sales and increasing market penetration
To segregate market segments and tailor promotional schemes specific to each market segment
Increase sales and greater customer satisfaction
Additionally, help drive sales of a particular brand e.g. store-owned brands.

2 Package requirements

Following R packages were used in this analysis:

haven: Import/export SAS data files
dplyr: For data manipulation e.g. filter, mutate
base: Base R functions
stringr: Simplify string operations
ggplot2: Plot variables of interest

3 Data Preparation

Data Source

The data set for this project “Carbo-Loading SAS” can be downloaded at this location.

3.1 Data Dictionary

The data used for this analysis is gathered over a period of 2 years and has been fully anonymized. It contains household level transactions over a period of two years from four categories: Pasta, Pasta Sauce, Syrup and Pancake Mix. These categories were chosen so that interactions between the categories can be detected and studied. The data-set consists of four tables exported from SAS. The schema design of the source tables involved is as follows:

Following are the details of the Source tables(Final Table)

3.1.1 dh_causal_lookup(CAUSAL_LOOKUP) table

Column Name	Data type	Definition
upc	numeric	standard 10-digit upc. Uniquely identifies a commodity of a brand
store	numeric	Identifier for a store
week	numeric	Ordered identifier for a week, can vary from 43 to 104
feature_desc	varchar	Product location on weekly mailer
display_desc	varchar	Product location in temporary in-store display
geography	numeric	Identifier for a geography

3.1.2 dh_transactions(TRANSACTIONS) table

Column Name	Data type	Definition
upc	numeric	standard 10-digit upc. Uniquely identifies a commodity of a brand
dollar_sales	numeric	amount of dollars spent by customer
units	numeric	number of units purchased
time_of_transaction	numeric	military time of transaction
geography	numeric	Identifier for a geography
week	numeric	Ordered identifier for a week, can vary from 1 to 104
household	numeric	identifies a unique household
store	numeric	Identifier for a store
basket	numeric	identifies a unique trip to the store
day	numeric	Day of transaction from 1 to 728
coupon	numeric	indicates a coupon was used, 1 = Yes, 0 = No

3.1.3 dh_store_lookup(STORE_LOOKUP) table

Column Name	Data type	Definition
store	numeric	Unique identifier of store
store_zip_code	numeric	Zip code of the store

3.1.4 dh_product_lookup(PRODUCT_LOOKUP) table

Column Name	Data type	Definition
upc	numeric	standard 10-digit upc. Uniquely identifies a commodity of a brand
product_description	varchar	description of the product
commodity	varchar	what type of product that is
brand	varchar	to which brand does the product belong
product_size	varchar	what are the dimensions of the product like weight

Based on this, following entities were defined for analysis:

Entity	Description
Brand	The Brand of the Product sold
Commodity	The type of Product sold, like Pasta Sauce, Pasta, etc.
Geography	Geographical group of stores
Store	Point of sale for each product
Time	The time when the transaction occurred
Day	Chronological Day of sales for a 2-year period
Week	Chronological Week of sales for a 2-year period
Household	Represents a unique customer
Basket	Represents the trip the customer made
Coupon	Whether the coupon was used or not while purchase
Causal	Details of the promotional offer made for a given product

All KPIs involve the following two measures:

KPI	Description
Dollar Sales	Amount spent during a transaction
Unit	number of units purchased during a transaction

The Count aggregation of any entity can also be counted as a KPI

3.2 Data Import and Cleaning

All the .SAS files involved were imported into R using haven package. The files to be imported are kept at the home directory for the R project and hence does not require specifying full paths. Once the files are imported, it is imperative that we clean the data and perform different checks like referential integrity and data uniqueness. If these checks are not performed than we can’t ascertain the data quality which in turn can discredit our analyses.

file_name <- c("causal_lookup","product_lookup","store_lookup","transactions")
for(i in seq_along(file_name)) {
  full_path <- paste0("data/",file_name[i],".sas7bdat")
  if(file.exists(full_path)) {
    df <- read_sas(full_path,NULL)
    assign(file_name[i], df)
    remove(df)
  } else {
    print("No such file exists")
  }
}

While importing, not all columns are imported as per their expected data type and hence we check if the data types need recoding. This becomes important in cases like if the Sales field is encoded as character, we cannot perform numerical summary and aggregation on top of it. We verify the structure of all columns in the four tables below

Causal Lookup

str(causal_lookup)

## Classes 'tbl_df', 'tbl' and 'data.frame':    351372 obs. of  6 variables:
##  $ upc         : chr  "7680850108" "5100001212" "5100002792" "3620000300" ...
##   ..- attr(*, "format.sas")= chr "$"
##  $ store       : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ week        : num  68 66 72 55 68 55 66 55 55 76 ...
##  $ feature_desc: chr  "Wrap Interior Feature" "Wrap Back Feature" "Interior Page Feature" "Wrap Interior Feature" ...
##   ..- attr(*, "format.sas")= chr "$"
##  $ display_desc: chr  "Not on Display" "Not on Display" "Not on Display" "Not on Display" ...
##   ..- attr(*, "format.sas")= chr "$"
##  $ geography   : num  1 1 1 1 1 1 1 1 1 1 ...
##  - attr(*, "label")= chr "DH_CAUSAL_LOOKUP"

Product Lookup

str(product_lookup)

## Classes 'tbl_df', 'tbl' and 'data.frame':    927 obs. of  5 variables:
##  $ upc                : chr  "111112360" "566300023" "566300028" "566300029" ...
##  $ product_description: chr  "VINCENT S ORIG MARINARA S" "PINE MOUNTAIN SYRUP" "MILLER CANE SYRUP" "MILLER CANE SYRUP" ...
##  $ commodity          : chr  "pasta sauce" "syrups" "syrups" "syrups" ...
##  $ brand              : chr  "Vincent's" "Pine Mountain" "Miller" "Miller" ...
##  $ product_size       : chr  "25 OZ" "40 OZ" "19 OZ" "12 OZ" ...
##  - attr(*, "label")= chr "DH_PRODUCT_LOOKUP"

Store Lookup

str(store_lookup)

## Classes 'tbl_df', 'tbl' and 'data.frame':    387 obs. of  2 variables:
##  $ store         : num  1 2 3 4 5 6 7 8 9 10 ...
##  $ store_zip_code: chr  "37865" "30084" "30039" "31210" ...
##   ..- attr(*, "format.sas")= chr "$"
##  - attr(*, "label")= chr "DH_STORE_LOOKUP"

Transactions

str(transactions)

## Classes 'tbl_df', 'tbl' and 'data.frame':    5197681 obs. of  11 variables:
##  $ upc                : chr  "7680850106" "3620000470" "1800028064" "9999985067" ...
##   ..- attr(*, "format.sas")= chr "$"
##  $ dollar_sales       : num  0.8 3.59 2.25 0.85 2.19 2.19 3.45 1.29 0.75 2.19 ...
##   ..- attr(*, "label")= chr "SHOP_SPEND"
##  $ units              : num  1 1 1 1 1 1 1 1 1 1 ...
##   ..- attr(*, "label")= chr "QUANTITY"
##  $ time_of_transaction: chr  "1100" "1100" "1137" "1148" ...
##   ..- attr(*, "label")= chr "TRANS_TIME"
##   ..- attr(*, "format.sas")= chr "$"
##  $ geography          : num  2 2 2 2 2 2 2 2 2 2 ...
##  $ week               : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ household          : num  125434 125434 108320 162016 89437 ...
##  $ store              : num  244 244 244 244 244 244 244 244 244 244 ...
##  $ basket             : num  1 1 2 3 4 4 5 5 6 7 ...
##  $ day                : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ coupon             : num  0 0 0 0 0 0 0 0 0 0 ...
##  - attr(*, "label")= chr "DH_TRANSACTIONS"

Based on the observations made above, following code transforms the require data types for the columns because we notice that Zipcode should be numeric and UPC is a numeric identifier.

store_lookup$store_zip_code = as.numeric(store_lookup$store_zip_code)
product_lookup$upc = as.numeric(product_lookup$upc)
causal_lookup$upc = as.numeric(causal_lookup$upc)
transactions$upc = as.numeric(transactions$upc)

Also, as discussed in previous section, since our analysis is not at Product level, we are creating a subset of product_lookup coded as commodity_lookup which stores distinct combinations of Brands and Commodities

commodity_lookup <- product_lookup %>% distinct(upc, brand, commodity)

We are also checking if all Zipcodes provided are valid. In case we have invalid zipcodes, we need to remove all transactions pertaining to them from the datasets because any inference based on invalid zipcodes can’t translate into actionable insight. If that is a possiblity, we can also get the Data engineers to look into this discrepancy. The following code shows that we did not find such discrepancies in the data

#checking if all stroe zip codes are valid
us_postal_codes <- read.csv("data/us_postal_codes.csv")
setdiff(store_lookup$store_zip_code,us_postal_codes$Zip.Code)

## numeric(0)

Since Transactions is the most important table in our data-set, all fields of this table are validated to check if there are any orphaned keys, missing information or anomalous data. For example, we verify that all transactions recorded in this table do belong to the set of Commodities we have in the Commodity_Lookup. We also validate if the time of transaction is a valid military time or not. We also check if the variables Day, Week, Geography and Coupon have valid values as per the dataset description provided earlier

#checking if transaction has occurred for valid commodities
setdiff(transactions$upc, commodity_lookup$upc)

## numeric(0)

#checking if transaction has occurred at valid store
setdiff(transactions$store, store_lookup$store)

## numeric(0)

#Transaction Time validation
distinct(as.data.frame(nchar(transactions$time_of_transaction)))

##   nchar(transactions$time_of_transaction)
## 1                                       4

#checking minutes
time.check1 <- as.numeric(substr(transactions$time_of_transaction, 3, 4))
time.check11 <- -1 < time.check1 & 60 > time.check1
sum(time.check11) == nrow(transactions)

## [1] TRUE

#checking hours
time.check2 <- as.numeric(substr(transactions$time_of_transaction, 1, 2))
time.check22 <- -1 < time.check2 & 24 > time.check2
sum(time.check22) == nrow(transactions)

## [1] TRUE

#checking geography
distinct(as.data.frame(transactions$geography))

##   transactions$geography
## 1                      2
## 2                      1

#checking week
week.check <- 0 < transactions$week & 105 > transactions$week
sum(week.check) == nrow(transactions)

## [1] TRUE

#checking day
day.check <- 0 < transactions$day & 729 > transactions$day
sum(day.check) == nrow(transactions)

## [1] TRUE

#checking coupon
distinct(as.data.frame(transactions$coupon))

##   transactions$coupon
## 1                   0
## 2                   1

Similarly, we verify the relate table causal_lookup which stores promotional campaign information for various commodities sold. We validate if there are no orphaned keys for UPC, Store, Geography and Week.

#causal  for valid commodities
setdiff(causal_lookup$upc, commodity_lookup$upc)

#causal at valid store
setdiff(causal_lookup$store, store_lookup$store)

#checking week
week.check <- -1 < causal_lookup$week & 43 > causal_lookup$week
sum(week.check) == 0

#causal  for valid geographies
distinct(as.data.frame(causal_lookup$geography))

According to the definition of UPC, it should be of 10 digits. However, for 1.12% transactions, the length of UPC is 9 digits. Nonetheless, since this does not violate the referential integrity of this variable because we do have 9 digits’ UPCs in causal_lookup and commodity_lookup, we choose to not remove these rows. This is validated by the following code:

distinct(as.data.frame(nchar(causal_lookup$upc)))

##   nchar(causal_lookup$upc)
## 1                       10
## 2                        9

distinct(as.data.frame(nchar(commodity_lookup$upc)))

##   nchar(commodity_lookup$upc)
## 1                           9
## 2                          10

distinct(as.data.frame(nchar(transactions$upc)))

##   nchar(transactions$upc)
## 1                      10
## 2                       9

3.3 Data review

Steps in subsection Data Import and Cleaning clean the data and make it ready for consumption. Below is a brief of how our concerned datasets looks like after we cleaned them up:

Causal Lookup

str(causal_lookup)

## Classes 'tbl_df', 'tbl' and 'data.frame':    351372 obs. of  6 variables:
##  $ upc         : num  7.68e+09 5.10e+09 5.10e+09 3.62e+09 4.11e+09 ...
##  $ store       : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ week        : num  68 66 72 55 68 55 66 55 55 76 ...
##  $ feature_desc: chr  "Wrap Interior Feature" "Wrap Back Feature" "Interior Page Feature" "Wrap Interior Feature" ...
##   ..- attr(*, "format.sas")= chr "$"
##  $ display_desc: chr  "Not on Display" "Not on Display" "Not on Display" "Not on Display" ...
##   ..- attr(*, "format.sas")= chr "$"
##  $ geography   : num  1 1 1 1 1 1 1 1 1 1 ...
##  - attr(*, "label")= chr "DH_CAUSAL_LOOKUP"

Commodity Lookup

str(commodity_lookup)

## Classes 'tbl_df', 'tbl' and 'data.frame':    927 obs. of  3 variables:
##  $ upc      : num  1.11e+08 5.66e+08 5.66e+08 5.66e+08 5.66e+08 ...
##  $ commodity: chr  "pasta sauce" "syrups" "syrups" "syrups" ...
##  $ brand    : chr  "Vincent's" "Pine Mountain" "Miller" "Miller" ...
##  - attr(*, "label")= chr "DH_PRODUCT_LOOKUP"

Store Lookup

str(store_lookup)

## Classes 'tbl_df', 'tbl' and 'data.frame':    387 obs. of  2 variables:
##  $ store         : num  1 2 3 4 5 6 7 8 9 10 ...
##  $ store_zip_code: num  37865 30084 30039 31210 30044 ...
##  - attr(*, "label")= chr "DH_STORE_LOOKUP"

Transactions

str(transactions)

## Classes 'tbl_df', 'tbl' and 'data.frame':    5197681 obs. of  11 variables:
##  $ upc                : num  7.68e+09 3.62e+09 1.80e+09 1.00e+10 1.00e+10 ...
##  $ dollar_sales       : num  0.8 3.59 2.25 0.85 2.19 2.19 3.45 1.29 0.75 2.19 ...
##   ..- attr(*, "label")= chr "SHOP_SPEND"
##  $ units              : num  1 1 1 1 1 1 1 1 1 1 ...
##   ..- attr(*, "label")= chr "QUANTITY"
##  $ time_of_transaction: chr  "1100" "1100" "1137" "1148" ...
##   ..- attr(*, "label")= chr "TRANS_TIME"
##   ..- attr(*, "format.sas")= chr "$"
##  $ geography          : num  2 2 2 2 2 2 2 2 2 2 ...
##  $ week               : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ household          : num  125434 125434 108320 162016 89437 ...
##  $ store              : num  244 244 244 244 244 244 244 244 244 244 ...
##  $ basket             : num  1 1 2 3 4 4 5 5 6 7 ...
##  $ day                : num  1 1 1 1 1 1 1 1 1 1 ...
##  $ coupon             : num  0 0 0 0 0 0 0 0 0 0 ...
##  - attr(*, "label")= chr "DH_TRANSACTIONS"

3.4 Data summary

Below is the list of all the variables(dimensions) and their summaries:

Sr. No.	Variable	Table Name	Data type	Definition	Values
1	basket	TRANSACTIONS	numeric	unique identifier of a trip made by a Household to a store	numeric key
2	brand	COMMODITY_LOOKUP	varchar	unique name of a Brand to which multiple commodities can belong	character values
3	commodity	COMMODITY_LOOKUP	varchar	unique name of a commodity	pancake mixes, syrups, pasta sauce & pasta
4	coupon	TRANSACTIONS	numeric	indicates a coupon was used	1 = Yes, 0 = No
5	day	TRANSACTIONS	numeric	Chronologically ordered identifier for Day of transaction	numeric, 1 to 728
6	display_desc	CAUSAL_LOOKUP	varchar	Product location in temporary in-store display	character values
7	feature_desc	CAUSAL_LOOKUP	varchar	Product location on weekly mailer	character values
8	geography	CAUSAL_LOOKUP, TRANSACTIONS	numeric	identifier for a geography	only two groups, 1 or 2
9	household	TRANSACTIONS	numeric	unique identifier for a household(customer)	numeric key
10	store	CAUSAL_LOOKUP, TRANSACTIONS, STORE_LOOKUP	numeric	unique identifier for a store	numeric, 1 to 387
11	store_zip_code	STORE_LOOKUP	numeric	Zip code of the store	standard zip codes
12	time_of_transaction	TRANSACTIONS	numeric	military time of transaction	4-digit military time
13	upc	CAUSAL_LOOKUP, TRANSACTIONS, STORE_LOOKUP	numeric	Uniquely identifies a commodity of a brand	numeric key
14	week	CAUSAL_LOOKUP, TRANSACTIONS	numeric	Chronologically ordered identifier for a week	numeric, 1 to 104

Below is the list of all the variables(measures) and their summaries

summary(transactions$dollar_sales)
summary(transactions$units)

Sr. no.	variable	Min.	1st Qu.	Median	Mean	3rd Qu.	Max.
1	dollar_sales	-11.76	0.99	1.5	1.757	2.19	153.14
2	units	1	1	1	1.197	1	156

In order to perform EDA, we pool all our data together into a single dataset trnx_cmb with the following code:

trnx_cmb <- transactions %>%
  merge(causal_lookup,by = c("upc","week","store","geography")) %>% 
  merge(commodity_lookup, by = "upc")

4 Exploratory Data Analysis

4.1 Proposed Data discovery

To discover hidden insights in the data, we plan on doing the following:

Plot measures against categorical variables to find trends
Identify outliers in the trends and treat them in to arrive at robust models for estimation
Identify variance in measures based on chronological variables like time and day
Join tables in the datasets to mine more information. For example, joining Transactions with Causal Lookup on Store, Week & UPC can reveal effectiveness of coupons
Slicing and dicing the data is another crucial task because the effect of only those factors can be studied which are well defined. For example, Coupons’ performance cannot be assessed from week 1 to week 42 since Causal data during that period is not available
Create new variables to assess their significance in the variance of the measures. For example, Day can be converted into a set of logistic variables to identify how does the sales and unit of sales vary with Day
Summarizing data based on groups of factors will be very effective to understand the significance of influence they hold on our measures. We plan on using functions of the dplyr package to undertake this task

4.2 Proposed Data Visualizations

Following types of data visualizations shall be crucial during the course of our analysis

Scatter plot: Quick and robust way to indicate if a supposed correlation is worth pursuing
Line plot: Line plots with measures plotted against time-series variables shall reveal time-dependent patterns
Box-plot: To be used for Outlier detection so that they can be treated
Histograms: This will reveal distribution of measures against categorical variables like store and commodity
QQ plot: to assess the similarity of distribution of measures against different values of a categorical variable

4.3 EDA for Coupon Effectiveness

As discussed earlier the intent of this report is to analyze and further suggest improvements for Coupon Effectiveness. We studied the distribution of different Categorical variables and Measures with respect to Coupons in order to acheive our goal.

4.3.1 Brand, Commodity and Dollar Sales

trnx_cmb %>% filter(coupon == 1) %>% 
ggplot(aes(x = as.factor(brand), y = as.factor(commodity), col = dollar_sales)) +
geom_jitter(alpha = 0.35) + coord_flip() +
ggtitle("Dollar Sales with Coupons by Brand & Commodity") +
xlab("Brands") + ylab("Commodities")

We first studied the distribution of Sales (in $ ) with Brand and Commodities as factors when a Coupon is used. As observed, Pasta Sauce of the brands Ragu, Prego, Classico and Bertolli are amongst the most popular products purchased while redeeming coupons. Also, they are amongnst the cheapest options available.

4.3.2 Feature Description, Display Description and Dollar Sales

trnx_cmb %>% filter(coupon == 1) %>% 
ggplot(aes(x = as.factor(feature_desc), y = as.factor(display_desc),
  col = dollar_sales)) + geom_jitter(alpha = 0.35) +
  ggtitle("Dollar Sales with Coupons by Feature and Display") +
  xlab("Offer Featured") + ylab("Offer Displayed") +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

Secondly, we analyzed the distribution of Sales (in $ ) when the coupon was redeemed based on how the coupon was featured in the newsletter and displayed in the stores. As observed, Coupons that are featured on the Front page and Interior page do great even when that offer is not on display in the store.

4.3.3 Geography, Commodities and Unit Sales

trnx_cmb %>% filter(coupon==1) %>% 
ggplot(aes(x = as.factor(geography), y = units, col = as.factor(commodity))) +
geom_jitter( ) +
ggtitle("Geographical Distribution of Unit of Commodities sold with Coupons") +
  xlab("Geographies") + ylab("Units")

Furthermore, we studied the geographical distributions of the usage of coupons and the units purchased grouped by commodities. We observe that the most popular commodity sold is Pasta Sauce with units varying from 1 to 4 and it remains consistent amongst geography.

4.3.4 Brands, Commodities and Units

trnx_cmb %>% filter(coupon == 1) %>% 
ggplot( aes(x = brand, y = units, col= as.factor(commodity))) +
geom_jitter(alpha = 0.35) + coord_flip() +
ggtitle("Units with Coupons by Brand & Commodity") +
xlab("Brands") + ylab("Units")

We also sliced the data by Brands and Commodities to see the Quantity of products purchased when a coupon was used. As we can see, Pasta Sauce and Pasta are the top two commodities with units ranging from 1-4.

4.3.5 Time of Transaction

x <- trnx_cmb$time_of_transaction[trnx_cmb$coupon == 1] %>% as.integer()
plot(density(x), main = "Distribution of Time",
     xlab = "Time when Coupon was Used",
     ylab = "Density")

As we observe, the distribution of time when a coupon is used is almost normal and hence we created an empirical distribution function to study the cumulative distribution of time when plotted as a continuous variable

x.ecdf = ecdf(x)
plot(x.ecdf, las = 1,
main = "Empirical Cumulative Distribution of Time when Coupon was used",
xlab = "Time (HHMM)", ylab = "Cumulative Probability" , col = "darkorange",
lwd = 1, cex=0.2)
Alpha  = 0.01
n = length(x)
Eps = sqrt(log(2 / Alpha)/(2 * n))
grid <- seq(min(x),max(x), length.out = 1000)
lines(grid, pmin(x.ecdf(grid) + Eps, 1), col = 'red')
lines(grid, pmax(x.ecdf(grid) - Eps, 0), col = 'blue')

This helps us answer questions like, what are the chances of a coupon being used between 3 PM to 4 PM on any given day.

x.ecdf(1600) - x.ecdf(1500)

## [1] 0.08964143

4.4 Missing information

Following are some of the points of information that we lack in order to better analyze this data-set

Day has been anonymized so any assumption about the trend observed based on days will be purely empirical and not as robust as it could be if we would have known which day corresponds to the actual day of a week
The grain of transaction is commodity and not product which renders Product size variable as ineffective in our analysis. If Transactions recorded would also provide information about which product was purchased, better assessment of the patterns can be done
The anonymity in time does not help with Day-light Saving application which may skew our regression model
There is no operational link between Geography and Store. Ideally, a Geography variable in the Store Lookup would have greatly helped
Also, as observed using Tableau, the geographical distribution is geographically continuous. This could mean that the store-geography combination can be used for factor analysis. However, we are not aware of the attributes of this relationship

Geographic distribution of stores

5 Summary

Problem Statement

How can we increase the effectiveness of Coupons given the historical usage of coupons done so far?

Approach

We observed the distribution of coupon-usage by different categories present in the dataset like Brand, Commodity, Geography, etc. We also observed the distribution of different measures like Dollar Sales and Units purchased when a coupon was used.

5.1 Insights

Pasta sauce commodity is the most common commodity to be purchased while redeeming a coupon
For commodity Pasta Sauce, most popular brands are Ragu, Prego, Classico and Bertoli. These items are low-priced and purchased heavily while redeeming coupons
Similarly, for Pasta, Barilla, Private Label and Mueller are popular choices of brands. For Private Label and Mueller, although they are high-priced, they are purchased heavily while redeeming coupons
When promoting for coupons, the coupon-feature has a larger bearing on the coupon effectiveness as compared to the coupon-display. For example, even when the coupon is not on display in the store, it is effective if it featured in the coupon booklet’s front page or Interior page
The coupon usage remains largely consistent across geographies. However, the most common basket size(units purchased) while redeeming a coupon is between 1 to 4
The most common basket size across commodities remains 1 to 4 for any given brand
The distribution of the time of transaction when a coupon is redeemed is largely normal. By saying this we mean that we can come up with a probability function to estimate the coupon usage. For example, we can statistically show that the probability to use a coupon for purchase between 3 PM and 4 PM is 9%

5.2 Implications

We should look at launching Coupons which combine Pasta and pasta Sauce of more popular brands of the latter. This way, the sales of Pasta can be increased
Based on the likelihood of Coupon for a given time period, we can launch Coupons targeting customers which visit during the period when the usage is most likely
The Coupon campaigns for Syrup and Pancakes have been very ineffective, we can either discontinue the campaign to focus on other commodities or combine them with more prevalent commodities like Pasta sauce
How the coupon is featured on the coupon booklet is more effective than in-store displays. Strategic featuring of Pasta along with Pasta sauce on the Coupon booklet promises effective campaigning for Pasta

5.3 Limitations

All trends observed are based on visualizing the data. In order to make them more concrete and actionable, we need to perform Hypothesis testing. This will not only give us the statistical significance of the trend reported, it will also allow us to obtain confidence intervals in which the sales and units purchased can vary
Since Day and Week were anonymized, we have not based any analysis on top of them
Since the UPCs are being recycled, we had to discard the Product information attached to the UPC

Coupon Success Analysis

Shruti Arvind Jibhakate | Syed Imad Husain

December 9, 2018