Introduction

If you’ve ever seen or read a Sunday paper, you’re well aware of the dozens upon dozens of grocery coupon inserts tucked inside. Perhaps you receive discount flyers in the mail or you grab the list of specials at the front of the market before you do your shopping. In today’s age, supermarket coupons are practically unavoidable - there’s even a TV show called “Extreme Couponing” that showcases individuals’ abilities to save the most they can from these discounts. From the consumer perspective, this probably seems like your market of preference is doing you a favor, because who doesn’t like a deal?

With this in mind, I set out to examine the impact of grocery stores’ coupon programs. The main questions I hope to answer are:

  1. How many people use coupons to buy a product for the first time?
  2. Of those people, how many continued to buy the same product without a coupon, and how many additional
    units did they purchase?
  3. How many people only use coupons for a one-time discount?
  4. What is the effect of coupons on total spending, trip frequency, and spending per trip?

The data I use comes from Dunnhumby’s dataset, “The Complete Journey”, which can be downloaded here. This dataset tracks household spending over two years from a group of 2,500 households who have been identified as frequent shoppers. For anonymity purposes, the grocery store chain has not been named. The dataset is broken down into eight data frames, which can be categorized into two groups: data tables and look-up tables.

Within the data tables, there are three data frames: hh_demographic contains demographic information on 801 households; transaction_data contains all products purchased by households participating in the study; and campaign_table lists the marketing campaigns received by each household.

Within the look-up tables there are five data frames: campaign_desc is a table which gives the length of time for which the campaign runs; product contains all the product information for each product ID; coupon lists all the products each coupon is good for (note that most coupons are good for more than one product); coupon_redempt identifies which households redeemed a coupon and when they did; and causal_data signifies whether a given product was featured in the weekly mailer or part of an in-store display.

To be clear, there are three distinct types of discounts: loyalty programs (like Price Chopper’s AdvantEdge Card), manufacturer coupons, and matched coupons. Loyalty programs and matched coupons both affect the store’s bottom line, but manufacturer coupons are reimbursed by the issuing manufacturer. In this paper I will be looking at the effect of aggregate coupon savings, because transactions where coupons are present usually consist of a combination of the three, but the vast majority of savings comes from manufacturer coupons and loyalty discounts.

#load packages
library(tidyverse)
library(stargazer)
library(descr)
library(arules)
library(knitr)

#load data files
camp_desc <- read_csv("dh/campaign_desc.csv")
camp_table <- read_csv("dh/campaign_table.csv")
causal <- read_csv("dh/causal_data.csv")
coupon <- read_csv("dh/coupon.csv")
coupon_red <- read_csv("dh/coupon_redempt.csv")
hh <- read_csv("dh/hh_demographic.csv")
prod <- read_csv("dh/product.csv")
trans <- read_csv("dh/transaction_data.csv")

Exploratory Data Analysis

Let’s first perform an exploratory data analysis to familiarize ourselves with the data. Table A shows a summary of total spending, number of products purchased, and number of trips to the store at the household level over two years.

#summarize household purchasing
buyers <- trans %>% group_by(household_key) %>%
  summarize(tot_spend=sum(SALES_VALUE),
            num_prods=n(),
            trips=n_distinct(BASKET_ID)) %>%
  arrange(desc(tot_spend)) %>%
  as.data.frame()

stargazer(select(buyers,tot_spend,num_prods,trips),
          title = "Table A: Household Purchasing", type = "text",
          median = TRUE, iqr = TRUE, digits = 0)
## 
## Table A: Household Purchasing
## ==================================================================
## Statistic   N   Mean  St. Dev. Min Pctl(25) Median Pctl(75)  Max  
## ------------------------------------------------------------------
## tot_spend 2,500 3,223  3,349    8    971    2,158   4,413   38,320
## num_prods 2,500 1,038   999     4    325     734   1,454.5  6,851 
## trips     2,500  111    116     1     39      79    142.2   1,300 
## ------------------------------------------------------------------

Table A shows us that over the two-year period, the average consumer:

  • spent a total $3,223
  • purchased 1,038 products
  • visited the supermarket 111 times

Figure 1

#create a running average for household spending
trend <- trans %>%
  filter(SALES_VALUE != 0) %>%
  group_by(household_key,DAY) %>%
  summarize(spend=sum(SALES_VALUE)) %>%
  mutate(cum_spend=cumsum(spend),
         trips = 1,
         cum_trips=cumsum(trips))
trend$avg_trip_spend <- trend$cum_spend/trend$cum_trips

#filter out noise
trend100 <- trend %>%
  select(-trips) %>%
  mutate(num_trips=n())
trend100 <- filter(trend100, num_trips>100, cum_trips>20)

#graph the running average of household spending
trend100$household_key <- as.factor(trend100$household_key)
ggplot(trend100,aes(x=cum_trips,y=avg_trip_spend,color=household_key)) + 
  geom_point() +
  ggtitle("Running Average of Household Spending") +
  xlab("Trips") + ylab("Average Spending (dollars)") + 
  labs(caption = "Figure 1") +
  theme(legend.position="none")

#show first 5 entries of a household to demonstrate running average calculation
cols <- c("Household Key || ","Day || ","Trip Spend || ","Running Total || ",
          "Trip Number || ","Average Trip Spend || ","Total Trips")
kable(trend100[1:5,], caption = "Table B", 
      format = "html", col.names = cols)
Table B
Household Key || Day || Trip Spend || Running Total || Trip Number || Average Trip Spend || Total Trips
6 166 21.47 726.24 21 34.58286 220
6 173 63.52 789.76 22 35.89818 220
6 175 12.72 802.48 23 34.89043 220
6 178 56.97 859.45 24 35.81042 220
6 180 20.91 880.36 25 35.21440 220

Table B aims to provide an example of running average calculation. Each time a customer makes a purchase at the store, their bill is tacked onto their cumulative spending (cum_spend), which is then divided by the total number of trips they’ve made (cum_trips). The fact that the lines are mostly horizontal implies that spending doesn’t vary much among individual households. You’ll notice that for this particular household cum_trips starts at 21 - this is one of two preferential assumptions I’ve made for the model. I only include households who shopped at least 100 times over the two years, or about once per week, with the assumption that for these households, this store chain is their primary supermarket. Of these households, I filtered out their first 20 purchases, establishing a less volatile starting average. Thus, the minimum number of observations for each household is at least 80. Figure 1 is the visual output of this process, with each colored line representing a different household.

Figure 2

#summarize the linear trend for each household's change in spending
lin_trend <- trend100 %>% 
  group_by(household_key,num_trips) %>%
  summarize(trend=round(coef((lm(avg_trip_spend ~ cum_trips)))[2],5)*100,
            trend50=ifelse(trend<=0,0,1))

#histogram of changes in household spending
ggplot(lin_trend,aes(x=trend)) + 
  geom_density() + xlim(-25,25) +
  labs(title = "Distribution of Changes in Household Spending",
       caption = "Figure 2") +
  xlab("Change in Average Household Spending (%)") + ylab("Density")

#descriptive statistics
lin_trend <- as.data.frame(lin_trend)
stargazer(select(lin_trend,trend),type="text",median=TRUE,iqr=TRUE,digits=2,
          title = "Table C: Descriptive statistics of changes in household spending")
## 
## Table C: Descriptive statistics of changes in household spending
## =================================================================
## Statistic  N  Mean St. Dev.  Min   Pctl(25) Median Pctl(75)  Max 
## -----------------------------------------------------------------
## trend     843 0.75   7.35   -44.83  -2.61    0.66    4.26   40.22
## -----------------------------------------------------------------
#summarize descriptive statistics
hist_table <- lin_trend %>% 
  group_by(trend50) %>% 
  summarize(num=n()) %>%
  mutate(pct=num/sum(num)*100) %>%
  as.data.frame()
stargazer(hist_table, type = "text", summary = FALSE, rownames = FALSE, 
          title = "Table D: Number of customers who spend less (0) and more (1)")
## 
## Table D: Number of customers who spend less (0) and more (1)
## ==================
## trend50 num  pct  
## ------------------
## 0       375 44.484
## 1       468 55.516
## ------------------

In order to determine changes in spending, I fit a linear model to each households running average to find the overall trend. Figure 2 gives us a visualization of the changes in household spending over the two-year period. As we would expect, the density plot resembles a normal curve. However, if we look at tables C and D, we see that there is, in fact, a slight negative skew (meaning that slightly more households are buying more over time). About 45.5% of households are spending the same or less, and 55.5% are spending more.

Coupon Effectiveness

Now, let’s examine some of the coupon data. First, however, we have to merge a few of the data frames in order to pair coupon redemptions with all transactions.

#merge product descriptions with coupon data
coups <- inner_join(coupon,prod,by="PRODUCT_ID")

#add coupon and product information to transaction data, keep only transaction data where a coupon could be used
coup_trans <- inner_join(coups,trans,by="PRODUCT_ID")

#merge coupon-transaction data with redeemed coupons, creating a data frame with transaction and product info for all coupon transactions
redeem <- inner_join(coupon_red,coup_trans,
                       by=c("household_key","DAY","COUPON_UPC","CAMPAIGN"))

#some observations have a $0 discount in both `COUPON_DISC` and `RETAIL_DISC` - we can assume a coupon was not actually used in this case, so I filtered these observations out
redeem <- redeem %>%
  group_by(household_key,DAY,COUPON_UPC) %>%
  filter(COUPON_DISC != 0 | RETAIL_DISC != 0)

Because the original Dunnhumby data description states that there are multiple coupons within each coupon UPC (e.g.. a flyer or mailer), I make the assumption that the store honored all coupon-redeemable products that received a store or manufacturer coupon and were purchased on the given date. This explains the difference in observations in coupon_red (2318) and redeem (3271). The resulting data frame redeem contains product and transaction information for all unit transactions where a targeted campaign coupon was used. To be clear, there were many other coupons used during the period of observation, but we are only interested in the store’s direct marketing campaigns.

#merge our redeemed coupon data with all transactions
redeem$coupon <- "yes"
temp <- full_join(trans,prod)
redeem <- full_join(redeem,temp)
redeem$COUPON_UPC[is.na(redeem$COUPON_UPC)] <- 0
redeem$coupon[is.na(redeem$coupon)] <- "no"
redeem$coup_used <- ifelse(redeem$coupon == "yes",1,0)

#create variable (`num_coups_used_per_prod`) that shows the number of coupons used on individual products over time - this will help us determine if a coupon was used to buy a product for the first time
redeem <- redeem %>%
  arrange(DAY) %>%
  group_by(household_key,PRODUCT_ID,COUPON_UPC) %>%
  mutate(num_coups_used_per_prod=cumsum(coup_used))

#create variable (`num_prods_bought`) that shows the cumulative number of individual products bought over time, and another variable (`first_purchase`) that indicates if the transaction was the first time the customer bought that product
redeem$dummy <- 1
redeem <- redeem %>%
  arrange(DAY) %>%
  group_by(household_key,PRODUCT_ID) %>%
  mutate(num_prods_bought=cumsum(dummy),
         first_purchase=ifelse(num_prods_bought==1,1,0))

#create variable (`first_purch_w_coup`) that indicates if the product was first purchased with a coupon
redeem$first_purch_w_coup <- ifelse(redeem$first_purchase==1 &
                                       redeem$num_coups_used_per_prod==1, 1, 0)

#out of all redeemed coupons, summarize first purchases
redeem2 <- redeem %>% 
  filter(coupon=="yes") %>%
  group_by(first_purch_w_coup) %>% 
  summarize(total=n())
redeem2$percent <- round(redeem2$total/sum(redeem2$total),4)*100
redeem2 <- as.data.frame(redeem2)
stargazer(redeem2, type="text", summary=FALSE, digits=2, rownames=FALSE,
          title = "Table E: Coupon Used to Buy a Product for the First Time")
## 
## Table E: Coupon Used to Buy a Product for the First Time
## ================================
## first_purch_w_coup total percent
## --------------------------------
## 0                  2,011  61.48 
## 1                  1,260  38.52 
## --------------------------------

Table E breaks down the number of occurrences when a coupon was used by a household to buy a product for the first time (i.e. a consumer bought a product they otherwise wouldn’t have had they not received the coupon). About 38.5% of coupons were used to buy an unfamiliar product - a very good sign for manufacturers. Let’s take a deeper look into that group and see how many additional units these households purchased after they used their first coupon on a new product.

Figure 3

redeem3 <- redeem %>%
  arrange(DAY) %>%
  group_by(household_key,PRODUCT_ID) %>%
  mutate(tot_prods=max(num_prods_bought))
redeem3$num_prods_after_coup <- ifelse(redeem3$first_purch_w_coup == 1,
                                       redeem3$tot_prods-redeem3$first_purch_w_coup,NA)

#plot bar chart
ggplot(redeem3,aes(x=num_prods_after_coup)) + 
  geom_bar(stat = "count") +
  labs(title = "Coupon Effectiveness",
       caption = "Figure 3") +
  xlab("Additional products") +
  ylab("Number of Instances") +
  scale_x_continuous(breaks = c(1:13))

table(redeem3$num_prods_after_coup, dnn = "Table F: Number of observations")
## Table F: Number of observations
##    0    1    2    3    4    5    6    7    8   13 
## 1018  167   38   16    9    4    4    2    1    1

Unfortunately, the vast majority of these coupons were only used to buy a new product one time, as illustrated in Figure 3. Table F breaks down the number of observations, and we see that only about 19% of households went on to continue purchasing the product. However, if we consider the cost of new customer acquisition, this is still a pretty inexpensive way to gain new customers.

An issue that arises is the fact that we only see a fixed 2-year snapshot of transaction data. It could be reasonable to say that Figure 3 overstates the lower end of the scale and understates the upper end, because we do not know the customers’ purchasing habits afterward. Let’s take a look at coupon redemption over time to see if this is a plausible assumption.

Figure 4

redeem4 <- redeem3 %>%
  filter(!is.na(num_prods_after_coup)) %>%
  group_by(WEEK_NO,num_prods_after_coup) %>%
  summarize(tot_coups=sum(coup_used))

ggplot(redeem4) + 
  geom_bar(stat = "identity",aes(x=WEEK_NO,y=tot_coups,
                   fill=reorder(num_prods_after_coup,-num_prods_after_coup))) +
  scale_fill_discrete(name = "Additional \nProducts \nPurchased") +
  geom_smooth(aes(x=WEEK_NO,y=tot_coups)) +
  labs(title = "Coupon Redemption Over Time",
       caption = "Figure 4") +
  xlab("Week Number") + ylab("Coupons Redeemed")

In Figure 4, each bar shows the number of occurrences when a customer used a coupon for the first time to buy a new product, and the trend line visualizes the weekly changes. The coloring represents the number of additional units the customer bought after they used a coupon. If the assumption that Figure 3 overstates low values of additional products and understates higher values, we would expect to see the average of additional products purchased within each bar to decrease over time. It’s difficult to gauge this from Figure 4 alone, but Figure 5 (below) summarizes the number of additional products purchased at the weekly level.

Figure 5

redeem5 <- redeem4 %>%
  filter(!is.na(num_prods_after_coup)) %>%
  group_by(WEEK_NO) %>%
  summarize(avg_prods_after=(sum(num_prods_after_coup*tot_coups)/sum(tot_coups)))

ggplot(redeem5,aes(x=WEEK_NO,y=avg_prods_after)) + 
  geom_line() + 
  geom_smooth() +
  labs(caption = "Figure 5") +
  xlab("Week Number") + ylab("Average Additional Products")

Each point on the line graph is the average number of additional units purchased starting from that week. As expected, there’s a pretty steady decline in the number of additional products purchased so Figure 3 most likely overstates lower-end values and understates higher-end.

Lastly, we’ll make a frequency plot of the first half of the coupon redemption subset from Figure 4. This way, we’ll reduce some of the error that was introduced in the form of overstating low values and understating high values.

Figure 6

redeem6 <- filter(redeem3,WEEK_NO<67)
ggplot(redeem6,aes(x=num_prods_after_coup)) + 
  geom_bar(stat = "count") +
  labs(title = "Coupon Effectiveness (First Half of Observations)",
       caption = "Figure 6") +
  xlab("Additional products") +
  ylab("Number of Instances")

table(redeem6$num_prods_after_coup, dnn = "Table G: Number of observations")
## Table G: Number of observations
##   0   1   2   3   4 
## 217  69  11   6   3

Figure 6 aims to show the relative distribution of additional units purchased. As we saw in Figure 4, the first coupon was redeemed during Week 33, and the last on Week 100, so Figure 6 only includes data for the first 67 weeks of observation (the midpoint between 33 and 100). The relative changes in additional products purchased are likely more accurate in this chart than in Figure 3.

Tests for Statistical Significance

Finally, we’ll compare coupon users with those who don’t use coupons at the supermarket. The intuition is that those who use coupons spend less, because after all, a coupon is a discount on your overall purchase. But are these shoppers really getting the deal they think they’re getting? Let’s first start by looking at total household spending.

Figure 7

#create data frame that only includes the household numbers of coupon users
coupon_users <- coupon_red %>% 
  group_by(household_key) %>% 
  summarize(coup_user="yes")

#summarize total household spending
test <- trans %>% 
  group_by(household_key) %>% 
  summarize(tot_spend=sum(SALES_VALUE))

#merge total spending with coupon user info
test <- full_join(test,coupon_users,by="household_key")

#define non-users
test$coup_user[is.na(test$coup_user)] <- "no"

#split data frame into subsets of users and non-users
test_user <- as.data.frame(filter(test,coup_user=="yes"))
test_nonuser <- as.data.frame(filter(test,coup_user=="no"))

#density plot
ggplot(test,aes(x=tot_spend,color=coup_user)) + 
  geom_density() +
  labs(title = "Distribution of Total Household Spending",
       caption = "Figure 7") +
  xlab("Total Household Spending") + ylab("Density") +
  scale_color_discrete(name = "Coupon User")

#t-test
t <- t.test(test_user$tot_spend,test_nonuser$tot_spend)
t$data.name <- "Coupon Users (x) and Non-Users (y)"
t
## 
##  Welch Two Sample t-test
## 
## data:  Coupon Users (x) and Non-Users (y)
## t = 20.35, df = 517.8, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  3708.671 4501.253
## sample estimates:
## mean of x mean of y 
##  6615.326  2510.364

Figure 7 shows the distribution of household spending among coupon users (blue) and non-users (red). Without crunching the numbers, it’s pretty obvious that coupon users do in fact spend much more than non-users. The t-test confirms this, showing that on average, coupon users spent $6,615 over two years, and non-users spent about $2,510 This result is statistically significant, too, with a t-statistic of 20.35 However, there could be a number of reasons why this is so. Maybe the non-users do most of their shopping at a different store, and therefore don’t take advantage of this stores discounts. Let’s look at trip frequency between the two groups.

Figure 8

#summarize store trips by households
trips <- trans %>% 
  group_by(household_key) %>% 
  summarize(num_trips=n_distinct(BASKET_ID))
test_user <- inner_join(test_user,trips,by="household_key")
test_nonuser <- inner_join(test_nonuser,trips,by="household_key")
test <- inner_join(test,trips,by="household_key")

#density plot
ggplot(test,aes(num_trips,color=coup_user)) + 
  geom_density() +
  labs(title = "Distribution of Trips Among Coupon Users and Non-Users",
       caption = "Figure 8") +
  xlab("Number of Trips") + ylab("Density") +
  scale_color_discrete(name = "Coupon User")

#t-test
t2 <- t.test(test_user$num_trips,test_nonuser$num_trips)
t2$data.name <- "Coupon Users (x) and Non-Users (y)"
t2
## 
##  Welch Two Sample t-test
## 
## data:  Coupon Users (x) and Non-Users (y)
## t = 14.767, df = 549.71, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##   86.57892 113.14571
## sample estimates:
## mean of x mean of y 
##  193.1198   93.2575

Indeed in Figure 8 we see that coupon users made more trips to the store than non-users. On average, coupon users shopped 193 times over two years, and non-users went 93 times. This result, too, is statistically significant, with a t-statistic of 14.77. Let’s lastly look at average spending per trip, as this should be a more indicative measure of the coupons’ effectiveness. We would expect average spending to be more similar because coupon users both spent more and visited more frequently.

Figure 9

#create variables for household spending per trip
test$trip_spend <- test$tot_spend/test$num_trips
test_user$trip_spend <- test_user$tot_spend/test_user$num_trips
test_nonuser$trip_spend <- test_nonuser$tot_spend/test_nonuser$num_trips

#density plot
ggplot(test,aes(x=trip_spend, color=coup_user)) + 
  geom_density() +
  labs(title = "Distribution of Spending per Trip",
       caption = "Figure 9") +
  xlab("Average Spending per Trip") + ylab("Density") +
  scale_color_discrete(name = "Coupon User")

#t-test
t3 <- t.test(test_user$trip_spend,test_nonuser$trip_spend)
t3$data.name <- "Coupon Users (x) and Non-Users (y)"
t3
## 
##  Welch Two Sample t-test
## 
## data:  Coupon Users (x) and Non-Users (y)
## t = 8.9153, df = 566.7, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##   7.708322 12.064558
## sample estimates:
## mean of x mean of y 
##  39.79091  29.90447

Perhaps the most important of the three density plots, Figure 9 shows us that coupon users actually spend more each time they go to the store than those who do not use coupons. On average, coupon users spent $39.79 per trip, while non-users spent $29.90. A t-statistic of 8.91 means that this result is significant as well.

Another way we can look at a coupon’s effectiveness is by looking at consumer behavior when they first start using coupons. Figure 10 (below) contains information only from the coupon user subset. Do they also start spending more after they use their first coupon, or is this group just predisposed to spending more altogether?

Figure 10

#create data frame that only lists the day on which a household first used a coupon
coupon_users <- coupon_red %>% 
  group_by(household_key) %>% 
  summarize(first_coup=min(DAY))

#merge above data frame with transaction data for those households that used coupons
users <- inner_join(trans,coupon_users,by=c("household_key"))
users <- users %>% 
  mutate(coup_start=ifelse(first_coup>DAY,"no","yes")) %>%
  group_by(household_key,WEEK_NO,coup_start) %>% 
  summarize(weekly_spend=sum(SALES_VALUE),
            dummy=1)

#split data frame into two subsets, before coupon used and after coupon used
users_before <- filter(users,coup_start=="no")
users_after <- filter(users,coup_start=="yes")

#because each household first uses coupons at different times, the `users_before` subset should include negative trip numbers before a coupon was used such that the first trip is the most negative number, and the last is 0
users_before <- users_before %>%
  group_by(household_key) %>%
  mutate(cum_dummy=cumsum(dummy),
         trip=cum_dummy-max(cum_dummy)) %>%
  select(-dummy,-cum_dummy)

#similar to `users_before`, we want to track the cumulative trips after coupons were used such that the first trip is 0 and the last is the largest number
users_after <- users_after %>%
  group_by(household_key) %>%
  mutate(trip=cumsum(dummy)-1) %>%
  select(-dummy)

#bind rows
users2 <- bind_rows(users_before,users_after)

#plot the graph
ggplot(users2,aes(x=trip,y=weekly_spend,color=coup_start)) + 
  geom_smooth() +
  labs(title = "Customers Spend More After They Start Using Coupons",
       caption = "Figure 10") +
  xlab("Week Number (with respect to first coupon use)") + ylab("Weekly Spending (dollars)") +
  scale_color_discrete(name = "Used First Coupon")

#t-test
t4 <- t.test(users_before$weekly_spend,users_after$weekly_spend)
t4$data.name <- "Weekly Spending Before (x) and After (y) First Coupon Was Used"
t4
## 
##  Welch Two Sample t-test
## 
## data:  Weekly Spending Before (x) and After (y) First Coupon Was Used
## t = -8.4101, df = 25485, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -9.010840 -5.604561
## sample estimates:
## mean of x mean of y 
##  85.99348  93.30118

Figure 10 shows the difference in weekly spending before (red/no) and after (blue/yes) the first coupon was used. Every coupon user begins in the “no” group, and once they use their first coupon, their weekly spending is then averaged into the “yes” group. The further we go from zero on the x-axis, the fewer observations we observe (as we saw in Figure 4, first-time coupon redemption among households ranged from Week 33 to Week 100). This is why we see more variation at the tail ends of the graph. However, this also means that the points closest to zero have the most observations, and will tend to be more accurate. The graph itself paints a significant picture, but the t-test confirms that spending among coupon users increases once they use their first coupon. Before the coupon is used, households on average spent $85.99 per week on groceries. After they started using coupons, households spent $93.30 per week. Although it’s not a huge increase, it’s still statistically significant with a t-statistic of -8.41.

Conclusion

I think that coupons are a true marvel of marketing: customers perceive a good deal, but the numbers show that they really end up spending more at the store than they normally would have. There are probably some psychological effects going on here, and this could be a good case study for the emerging field of behavioral economics. Perhaps customers just choose to do the majority of their shopping at the place they think they’re getting the best deal, or maybe they prefer the store for some other reason (such as fresh produce or a pleasant checkout experience). Either way, if I were a marketing executive at a supermarket I would be sure to continue campaigns like the ones seen from this store. Like we heard from Glen Bradley, Price Chopper’s Group VP of Marketing, targeted marketing plays a huge rule in their operation, with three different firms handling various aspects. I know that the data that I used is a few years old, and I think it would be really interesting to delve into some of Price Chopper’s data. As Glen said, “Out of 100,000 mailers, no two are the same”. I can now see why a company like Price Chopper invests so heavily in its business analytics department.

Addendum

How would a supermarket decide which products to offer coupons?
There are many complex analytical ways to determine which products should be targeted for discounts. As I mentioned earlier, companies like Price Chopper are now tailoring their mailers to specific households, and marketing has never been so specifically targeted. The purpose for coupon marketing could be quite varied as well.

  • Maybe the store is trying to push a product that hasn’t been performing well.
  • Maybe they’re trying to improve customer engagement.
  • Perhaps they would like to get new customers in the door.

Regardless of the intention, I’ve created a simple model that looks at the market basket analysis (if a customer buys Product A, what is the chance they will also buy Product B) of the products most frequently purchased with coupons and compared the results with the store’s top selling products.

#create a data frame that has each product from a basket transaction in a separate row
temp1 <- trans %>% group_by(BASKET_ID) %>% summarize()
temp1$dummy <- 1
temp1 <- temp1 %>% mutate(basket_num=cumsum(dummy))
temp1 <- select(trans,SUB_COMMODITY_DESC,basket_num)
temp1 <- full_join(trans,select(temp1,-dummy),by="BASKET_ID")
temp1 <- temp1 %>% 
  group_by(basket_num) %>% 
  mutate(prod_num=cumsum(dummy)) %>% 
  select(SUB_COMMODITY_DESC,basket_num,prod_num)

data <- data.frame()
for (i in 1:nrow(temp1)) {
  temp <- temp1$SUB_COMMODITY_DESC[i]
  data[temp1$basket_num[i],temp1$prod_num[i]] <- temp
}
write_csv(data,"grocery.csv")

Let’s first look at the top selling products in the store. (Note that “Premium” refers to ice cream products, and “Primal” refers to beef cuts)

groceries <- read.transactions("grocery.csv", sep = ",")
itemFrequencyPlot(groceries, topN = 25, type = "relative", 
                  main = "Most Frequently Purchased Items", 
                  ylab = "Relative Item Frequency", cex = 0.6)

rank <- as.data.frame(itemFrequency(groceries, type = "relative")) %>%
  rename(pct=`itemFrequency(groceries, type = "relative")`) %>%
  rownames_to_column("rhs") %>%
  arrange(-pct) %>%
  mutate(dummy=1, rank=cumsum(dummy)) %>%
  select(-dummy)
rank$rhs <- paste("{",rank$rhs,"}",sep = "")

We see that white milk is by far the biggest seller, followed by bananas, white bread, soda cans, gasoline, 2-liter sodas, and so on. Let’s now look at the product categories that had the most coupons used on them.

top_coups <- redeem %>% 
  filter(coupon=="yes") %>%
  group_by(SUB_COMMODITY_DESC) %>% 
  summarize(num=n()) %>%
  mutate(pct=num/sum(num)) %>%
  arrange(-num) %>%
  filter(pct>0.01)
ggplot(top_coups,aes(x=reorder(SUB_COMMODITY_DESC,-pct),y=pct)) + 
  geom_bar(stat = "identity", fill = "gray", color = "black") +
  theme(axis.text.x = element_text(size=7,angle=45,hjust=1,vjust=1)) +
  ggtitle("Most Frequently Redeemed Coupons") +
  xlab("") + ylab("Relative Coupon Use")

Coupons were mainly used to buy frozen entrees, ice cream, and single-serve yogurts. We can now train a model using the apriori function from the arules package to obtain our market basket analysis.

It’s worth noting the statistical measures included in the output:

  • LHS = “Left Hand Side”
  • RHS = “Right Hand Side”
  • Support(LHS) = Count(LHS)/N
  • Confidence(LHS–>RHS) = Support(LHS–>y)/Support(LHS)
  • Lift(LHS–>RHS) = Confidence(LHS–>RHS)/Support(RHS)

In addition, I defined maxlen = 2 so that I only had one product on the left hand side (LHS) and one on the right (RHS). Now we’ll make rules for the top three products that coupons were used for: entrees, ice cream, and yogurt.

rules <- apriori(groceries, parameter = list(supp = 0.001, conf = 0.1, maxlen=2))

rules_entree <- subset(rules, lhs %in% "FRZN SS PREMIUM ENTREES/DNRS/N")
rules_entree <- sort(rules_entree, by="confidence",decreasing = TRUE)
entree <- as.data.frame(inspect(rules_entree[1:10]))
entree <- entree[-2]
entree$rhs <- as.character(entree$rhs)
entree <- inner_join(entree,select(rank,rhs,rank),by="rhs")
write_csv(entree,"entree.csv")
rules_premium <- subset(rules, lhs %in% "PREMIUM")
rules_premium <- sort(rules_premium, by="confidence",decreasing = TRUE)
premium <- as.data.frame(inspect(rules_premium[1:10]))
premium <- premium[-2]
premium$rhs <- as.character(premium$rhs)
premium <- inner_join(premium,select(rank,rhs,rank),by="rhs")
write_csv(premium,"premium.csv")
rules_yogurt <- subset(rules, lhs %in% "YOGURT NOT MULTI-PACKS")
rules_yogurt <- sort(rules_yogurt, by="confidence",decreasing = TRUE)
yogurt <- as.data.frame(inspect(rules_yogurt[1:10]))
yogurt <- yogurt[-2]
yogurt$rhs <- as.character(yogurt$rhs)
yogurt <- inner_join(yogurt,select(rank,rhs,rank),by="rhs")
write_csv(yogurt,"yogurt.csv")
#note: data frames `entree`, `premium`, and `yogurt` evaluate correctly in R environment but return a null data frame when knitting, so I saved the files locally and set `eval = FALSE` in their respective chunk options
entree <- read_csv("entree.csv")
premium <- read_csv("premium.csv")
yogurt <- read_csv("yogurt.csv")

kable(entree[-1], format = "html", caption = "Table H: Frozen Entree Rules",
      col.names = c("RHS","Support","Confidence","Lift","Count","Rank"))
Table H: Frozen Entree Rules
RHS Support Confidence Lift Count Rank
{FLUID MILK WHITE ONLY} 0.0087961 0.4414594 1.988455 2432 1
{BANANAS} 0.0062390 0.3131240 2.854791 1725 2
{YOGURT NOT MULTI-PACKS} 0.0040906 0.2053004 4.396460 1131 16
{SHREDDED CHEESE} 0.0038302 0.1922309 2.525140 1059 7
{SOFT DRINKS 12/18&15PK CAN CAR} 0.0036675 0.1840624 1.982728 1014 4
{MAINSTREAM WHEAT/MULTIGRAIN BR} 0.0028790 0.1444908 2.844609 796 13
{MAINSTREAM WHITE BREAD} 0.0028609 0.1435832 1.478759 791 3
{DAIRY CASE 100% PURE JUICE - O} 0.0027524 0.1381376 2.353387 761 10
{POTATO CHIPS} 0.0026475 0.1328735 2.022776 732 8
{FRZN SS PREMIUM ENTREES/DNRS/T} 0.0026403 0.1325104 5.729947 730 67
kable(premium[-1], format = "html", caption = "Table I: Ice Cream Rules",
      col.names = c("RHS","Support","Confidence","Lift","Count","Rank"))
Table I: Ice Cream Rules
RHS Support Confidence Lift Count Rank
{FLUID MILK WHITE ONLY} 0.0235600 0.4536212 2.043235 6514 1
{BANANAS} 0.0140694 0.2708914 2.469751 3890 2
{MAINSTREAM WHITE BREAD} 0.0108432 0.2087744 2.150160 2998 3
{SHREDDED CHEESE} 0.0106986 0.2059889 2.705865 2958 7
{SOFT DRINKS 12/18&15PK CAN CAR} 0.0102030 0.1964485 2.116151 2821 4
{POTATO CHIPS} 0.0076170 0.1466574 2.232613 2106 8
{DAIRY CASE 100% PURE JUICE - O} 0.0074253 0.1429666 2.435656 2053 10
{EGGS - LARGE} 0.0064596 0.1243733 2.598418 1786 14
{PRIMAL} 0.0063258 0.1217967 2.701787 1749 19
{SFT DRNK 2 LITER BTL CARB INCL} 0.0062896 0.1211003 1.426488 1739 6
kable(yogurt[-1], format = "html", caption = "Table J: Yogurt Rules",
      col.names = c("RHS","Support","Confidence","Lift","Count","Rank"))
Table J: Yogurt Rules
RHS Support Confidence Lift Count Rank
{FLUID MILK WHITE ONLY} 0.0244171 0.5228875 2.355230 6751 1
{BANANAS} 0.0175814 0.3765007 3.432604 4861 2
{SHREDDED CHEESE} 0.0104598 0.2239950 2.942393 2892 7
{DAIRY CASE 100% PURE JUICE - O} 0.0089046 0.1906901 3.248700 2462 10
{SOFT DRINKS 12/18&15PK CAN CAR} 0.0078485 0.1680737 1.810497 2170 4
{MAINSTREAM WHITE BREAD} 0.0077291 0.1655178 1.704662 2137 3
{MAINSTREAM WHEAT/MULTIGRAIN BR} 0.0069551 0.1489428 2.932255 1923 13
{EGGS - LARGE} 0.0068575 0.1468515 3.068036 1896 14
{PREMIUM} 0.0060365 0.1292696 2.488944 1669 12
{COTTAGE CHEESE} 0.0056495 0.1209821 4.213890 1562 43

The variable rank was merged with the output rules and matches up with the RHS. Rank is the relative rank of the most frequently purchased products. From these three data tables, we can now see how likely a customer is to purchase a product from the RHS, and that product’s rank. A table like this could be helpful in determining what else a customer might buy if they redeem a coupon for a LHS product. Or, perhaps the company wants to push a couple products that are performing poorly; instead of discounting both, we could see what the probability of buying a RHS product given the customer has a LHS product in his or her basket.