Introduction

The goal of this analysis is to find an area of demographics where coupon redemptions are lacking for the grocery chain Regork. The hypothesis is the more coupons used the more revenue for Regork. This analysis examines the purchasing habits and coupon redemption behaviors of the 55-64 age group, with a primary focus on the 35-49K income bracket. A tailored marketing approach can enhance the probability of coupon redemptions. Such targeted strategies result in elevated customer satisfaction and increased coupon redemptions, which will lead to increased revenue. The analysis highlights an underutilized demographic that can be marketed to using campaigns and advertising. The analysis also gives a great starting point for future growth: to target all underutilized coupon redemption groups.

Packages Used

```r
library(completejourney)
library(dplyr)
library(tidyverse)
library(ggplot2)
library(lubridate)
```

Package Descriptions

completejourney: Provides datasets from a hypothetical retail scenario.

tidyverse: A collection of R packages designed for data science. Within this analysis:

  • dplyr: Used for data manipulation, enabling operations such as filtering, summarizing, and joining datasets.

  • tidyr: Assists in tidying data, making it more organized for analysis. Offers functions to change data structures, fill missing values, and separate or combine columns.

ggplot2: A tool for creating detailed visualizations.

lubridate: Simplifies working with dates and times in R, making it more straightforward to manipulate date-time data.

Data Preparation

transactions <- get_transactions()
transactions
## # A tibble: 1,469,307 × 11
##    household_id store_id basket_id   product_id quantity sales_value retail_disc
##    <chr>        <chr>    <chr>       <chr>         <dbl>       <dbl>       <dbl>
##  1 900          330      31198570044 1095275           1        0.5         0   
##  2 900          330      31198570047 9878513           1        0.99        0.1 
##  3 1228         406      31198655051 1041453           1        1.43        0.15
##  4 906          319      31198705046 1020156           1        1.5         0.29
##  5 906          319      31198705046 1053875           2        2.78        0.8 
##  6 906          319      31198705046 1060312           1        5.49        0.5 
##  7 906          319      31198705046 1075313           1        1.5         0.29
##  8 1058         381      31198676055 985893            1        1.88        0.21
##  9 1058         381      31198676055 988791            1        1.5         1.29
## 10 1058         381      31198676055 9297106           1        2.69        0   
## # ℹ 1,469,297 more rows
## # ℹ 4 more variables: coupon_disc <dbl>, coupon_match_disc <dbl>, week <int>,
## #   transaction_timestamp <dttm>
promotions <- get_promotions()
promotions
## # A tibble: 20,940,529 × 5
##    product_id store_id display_location mailer_location  week
##    <chr>      <chr>    <fct>            <fct>           <int>
##  1 1000050    316      9                0                   1
##  2 1000050    337      3                0                   1
##  3 1000050    441      5                0                   1
##  4 1000092    292      0                A                   1
##  5 1000092    293      0                A                   1
##  6 1000092    295      0                A                   1
##  7 1000092    298      0                A                   1
##  8 1000092    299      0                A                   1
##  9 1000092    304      0                A                   1
## 10 1000092    306      0                A                   1
## # ℹ 20,940,519 more rows
coupons
## # A tibble: 116,204 × 3
##    coupon_upc  product_id campaign_id
##    <chr>       <chr>      <chr>      
##  1 10000085207 9676830    26         
##  2 10000085207 9676943    26         
##  3 10000085207 9676944    26         
##  4 10000085207 9676947    26         
##  5 10000085207 9677008    26         
##  6 10000085207 9677052    26         
##  7 10000085207 9677385    26         
##  8 10000085207 9677479    26         
##  9 10000085207 9677791    26         
## 10 10000085207 9677878    26         
## # ℹ 116,194 more rows
campaigns
## # A tibble: 6,589 × 2
##    campaign_id household_id
##    <chr>       <chr>       
##  1 1           105         
##  2 1           1238        
##  3 1           1258        
##  4 1           1483        
##  5 1           2200        
##  6 1           293         
##  7 1           529         
##  8 1           536         
##  9 1           568         
## 10 1           630         
## # ℹ 6,579 more rows
demographics
## # A tibble: 801 × 8
##    household_id age   income    home_ownership marital_status household_size
##    <chr>        <ord> <ord>     <ord>          <ord>          <ord>         
##  1 1            65+   35-49K    Homeowner      Married        2             
##  2 1001         45-54 50-74K    Homeowner      Unmarried      1             
##  3 1003         35-44 25-34K    <NA>           Unmarried      1             
##  4 1004         25-34 15-24K    <NA>           Unmarried      1             
##  5 101          45-54 Under 15K Homeowner      Married        4             
##  6 1012         35-44 35-49K    <NA>           Married        5+            
##  7 1014         45-54 15-24K    <NA>           Married        4             
##  8 1015         45-54 50-74K    Homeowner      Unmarried      1             
##  9 1018         45-54 35-49K    Homeowner      Married        5+            
## 10 1020         45-54 25-34K    Homeowner      Married        2             
## # ℹ 791 more rows
## # ℹ 2 more variables: household_comp <ord>, kids_count <ord>
products
## # A tibble: 92,331 × 7
##    product_id manufacturer_id department    brand  product_category product_type
##    <chr>      <chr>           <chr>         <fct>  <chr>            <chr>       
##  1 25671      2               GROCERY       Natio… FRZN ICE         ICE - CRUSH…
##  2 26081      2               MISCELLANEOUS Natio… <NA>             <NA>        
##  3 26093      69              PASTRY        Priva… BREAD            BREAD:ITALI…
##  4 26190      69              GROCERY       Priva… FRUIT - SHELF S… APPLE SAUCE 
##  5 26355      69              GROCERY       Priva… COOKIES/CONES    SPECIALTY C…
##  6 26426      69              GROCERY       Priva… SPICES & EXTRAC… SPICES & SE…
##  7 26540      69              GROCERY       Priva… COOKIES/CONES    TRAY PACK/C…
##  8 26601      69              DRUG GM       Priva… VITAMINS         VITAMIN - M…
##  9 26636      69              PASTRY        Priva… BREAKFAST SWEETS SW GDS: SW …
## 10 26691      16              GROCERY       Priva… PNT BTR/JELLY/J… HONEY       
## # ℹ 92,321 more rows
## # ℹ 1 more variable: package_size <chr>
campaign_descriptions
## # A tibble: 27 × 4
##    campaign_id campaign_type start_date end_date  
##    <chr>       <ord>         <date>     <date>    
##  1 1           Type B        2017-03-03 2017-04-09
##  2 2           Type B        2017-03-08 2017-04-09
##  3 3           Type C        2017-03-13 2017-05-08
##  4 4           Type B        2017-03-29 2017-04-30
##  5 5           Type B        2017-04-03 2017-05-07
##  6 6           Type C        2017-04-19 2017-05-21
##  7 7           Type B        2017-04-24 2017-05-28
##  8 8           Type A        2017-05-08 2017-06-25
##  9 9           Type B        2017-05-31 2017-07-02
## 10 10          Type B        2017-06-28 2017-07-30
## # ℹ 17 more rows
coupon_redemptions
## # A tibble: 2,102 × 4
##    household_id coupon_upc  campaign_id redemption_date
##    <chr>        <chr>       <chr>       <date>         
##  1 1029         51380041013 26          2017-01-01     
##  2 1029         51380041313 26          2017-01-01     
##  3 165          53377610033 26          2017-01-03     
##  4 712          51380041013 26          2017-01-07     
##  5 712          54300016033 26          2017-01-07     
##  6 2488         51200092776 26          2017-01-10     
##  7 2488         51410010050 26          2017-01-10     
##  8 1923         53000012033 26          2017-01-14     
##  9 1923         54300021057 26          2017-01-14     
## 10 1923         57047091041 26          2017-01-14     
## # ℹ 2,092 more rows

Data Frames

# Relevant Data
data_merged <- transactions %>%
  left_join(coupon_redemptions, by = "household_id") %>%
  left_join(demographics, by = "household_id") %>%
  filter(!is.na(age), !is.na(income))  # Filter out NA values for age and income

#demographics for number of households per age
age_distribution_households <- demographics %>%
  group_by(age) %>%
  summarize(total_count = n())


# Age distribution
age_distribution <- data_merged %>%
  group_by(age) %>%
  summarize(total_count = n())

# Distribution of income groups
income_distribution <- demographics %>%
  group_by(income) %>%
  summarize(total_count = n())

# Analyzing demographics of coupon redeemers
coupon_demographics <- data_merged %>%
  filter(!is.na(coupon_upc)) %>%
  group_by(age, income, home_ownership, marital_status, household_size, kids_count) %>%
  summarize(total_redeemed = n_distinct(household_id))


# Compare with overall demographics
overall_demographics <- demographics %>%
  filter(!is.na(age), !is.na(income)) %>%
  group_by(age, income, home_ownership, marital_status, household_size, kids_count) %>%
  summarize(total_households = n())


# Analyzing demographics of coupon redeemers
coupon_demographics2 <- data_merged %>%
  filter(!is.na(coupon_upc)) %>%
  group_by(age, income, home_ownership, marital_status, household_size, kids_count) %>%
  summarize(total_redeemed = n())

# Correlation of Demographics and redemptions
correlation_data <- left_join(coupon_demographics, overall_demographics, by = c("age", "income", "home_ownership", "marital_status", "household_size", "kids_count"))
correlation_data <- correlation_data %>%
  mutate(redemption_rate = (total_redeemed / total_households) * 100)

# Correlation of Demographics and redemptions for plot: age and income (there are two because I was having trouble with n_distinct() for this section)
correlation_data2 <- left_join(coupon_demographics2, overall_demographics, by = c("age", "income", "home_ownership", "marital_status", "household_size", "kids_count"))
correlation_data2 <- correlation_data2 %>%
  mutate(redemption_rate = (total_redeemed / total_households) * 100)

# Finding top products with redeemed coupons
top_products <- data_merged %>%
  filter(!is.na(coupon_upc)) %>%
  group_by(product_id) %>%
  summarize(total_redeemed = n()) %>%
  arrange(-total_redeemed)

#redemption rates by campaign
campaign_redemptions <- data_merged %>%
  filter(!is.na(coupon_upc)) %>%
  group_by(campaign_id) %>%
  summarize(total_redeemed = n())

  1. Initial Data Examination

1.1 Demographics Overview

The analysis first looks at basic demographic data to later compare to coupon redemption rates.

In the plot below the key things to note are the ages of interest, 55-64 years old. More specifically, notice the difference between the 25-34, 35-44, and 45-55 age groups when compared to the 55-64 age group. This is a difference of 83, 135, and 229 number of households respectively.

# Plotting the distribution of age groups

ggplot(age_distribution_households, aes(x=age, y=total_count)) +
  geom_bar(stat="identity", fill="#ff9999") +
  theme_minimal() +
  labs(title="Number of housholds by Age Groups", 
       y="Number of Households", 
       x="Age Group") +
  geom_text(aes(label=total_count), vjust=-0.5)

Before we compare the previous plot to any data, it is important to look at basic demographic income data in order to understand what income brackets fall short in coupon redemption rates.

Notice on the below plot, the largest group of households is 50-74k,followed by 35-49k. It is expected that the coupon redemption rate would be proportional to the number of households in both age and income ranges. In other words, it would be expected that these income brackets will have the highest redemptions, when compared to other income brackets, because they have more people to use coupons. This idea applies to the age distribution as well. It will be seen in the next few plots that this is not the case and implies a variable that is affecting their coupon usage per demographic.

# Plotting the distribution of income groups
ggplot(income_distribution, aes(x=income, y=total_count)) +
  geom_bar(stat="identity", fill="#ff9999") +
  theme_minimal() +
  labs(title="Number of household by Income Groups", 
       y="Number of Households", 
       x="Income Bracket") +
  geom_text(aes(label=total_count), vjust=-0.5) +
  theme(axis.text.x = element_text(angle=45, hjust=1))

1.2 Impact of Coupons on Spending

Next, by comparing transaction values based on coupon redemption, we determine the monetary impact of coupons. This plot indicates that increased coupon redemption translates to higher transaction values. This supports the hypothesis that if an area where coupons are not being used, are then marketed to (under the assumption coupons are used more often because of campaigns/advertising) Regork can make more money.

# Aggregate sales value by whether a coupon was redeemed or not
sales_impact <- transactions %>%
  group_by(coupon_redeemed = (coupon_disc != 0)) %>%
  summarise(
    total_sales_value = sum(sales_value),
    num_transactions = n(),
    avg_transaction_value = total_sales_value / num_transactions
  )

# Plotting the average transaction value with vs without coupon redemption
ggplot(sales_impact, aes(x=factor(coupon_redeemed), y=avg_transaction_value)) +
  geom_bar(stat="identity", fill=c("#ff9999", "#69b3a2")) +
  theme_minimal() +
  labs(
    title="Average Transaction Value: With vs. Without Coupons",
    x="Coupon Redeemed",
    y="Average Transaction Value"
  ) +
  scale_x_discrete(labels=c("Without Coupon", "With Coupon"))

  1. Analysis and Findings

2.1 Age & Income vs. Coupon Redemption

Note: The redemption rate is used, as opposed to the total number of coupon redemptions, because it offers a more normalized comparison.

Now the analysis starts. First, looking at coupon redemption rate of different age groups, one can see the redemption rate is proportional to the number of households in this age group.

ggplot(correlation_data, aes(x = age, y = redemption_rate)) +
  geom_bar(stat="identity", fill="#69b3a2") +
  labs(title = "Redemption Rate by Age", x = "Age", y = "Redemption Rate (%)") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

Once again we see a similar proportionality between the number of households and coupon redemption rate. However, when exploring the converging demographics of age and income, a new story is seen.

ggplot(correlation_data, aes(x = income, y = redemption_rate)) +
  geom_bar(stat="identity", fill="#69b3a2") +
  labs(title = "Redemption rate by Income", x = "Income Bracket", y = "Redemption Rate (%)") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

This plot is integral to the analysis. Looking at the age group of 55-64, a discrepancy between the redemption rate for income of 50-74k and the redemption rate for just the age group is seen. This was the data that jumpstarted the analysis. Why is this income group, within an age group that does not use coupons, have the highest redemption rate? Well initially, we could assume a similar methodology as the previous plots, that the 50-74k has the greatest number of people and therefore the redemption rate is higher. However, looking at the demographic data of number households per income bracket in the 55-64 age group says otherwise.

ggplot(correlation_data2, aes(x = age, y = redemption_rate, fill = income)) +
  geom_bar(stat = "identity", position = "dodge") +  
  labs(title = "Coupon Redemption Rate by Age and Income", 
       x = "Age Group", 
       y = "Coupon Redemption Rate (%)", 
       fill = "Income Bracket") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  ylim(0, 1500000)

2.2 Deep Dive: Age Group 55-64

Notice the largest group is the income range 35-49k, followed by, the expected, 50-74k income range. Well, looking back at the above plot we can see 35-49k is the 6th smallest for coupon redemption.Therefore, we do not see an increase in coupon redemptions due to number of households. Therefore, we can assume another variable is acting on this demographic data.

# Filtering the data
household_age_income <- demographics %>%
  filter(age == "55-64") %>%
  group_by(income) %>%
  summarize(total_households = n()) %>%
  arrange(-total_households)

# Visualization
ggplot(household_age_income, aes(x=reorder(income, total_households), y=total_households)) +
  geom_bar(stat="identity", fill="#69b3a2") +
  theme_minimal() +
  labs(title="Total Households by Income for Age 55-64",
       x="Income Bracket", 
       y="Number of Households") +
  theme(axis.text.x = element_text(angle=45, hjust=1))

2.3 Evaluating Campaign Types

Now that it is known that there is some variable affecting the coupon usage of 55-64 year olds, specifically within the income range of 35-49k, the analysis moves towards looking into the possible variables in hopes of finding an area for Regork to invest.

The first variable looked at is campaign types and their effect on the number of coupon redemptions. Far and away Type A is the best for generating coupon redemptions. However, this is fairly broad, so the analysis looks into income ranges for the age range of interest: 55-64.

# Merging the datasets
campaign_redemptions_merged <- campaign_redemptions %>%
  left_join(campaign_descriptions, by = "campaign_id")

# Visualization
ggplot(campaign_redemptions_merged, aes(x=reorder(campaign_type, -total_redeemed), y=total_redeemed)) +
  geom_bar(stat="identity", fill="#ff9999") +
  theme_minimal() +
  labs(title="Impact of Campaign Types on Coupon Redemptions", 
       x="Campaign Type", 
       y="Number of Coupons Redeemed") +
  theme(axis.text.x = element_text(angle=45, hjust=1))

This plot below gives a much more in depth look at this age range and the effect campaigns have on their coupon redemption numbers.The types of campaigns seem to mostly impact the 50-74k and 75-99k income ranges.Type A does alright for our income range of interest, 35-49k, but if the goal is to make the income range as successful for coupon redemptions as 50-74k is, then new campaigns should be created in order to affect these lower redemption, income brackets.

# Merge with campaign_descriptions for campaign_type information
merged_data <- transactions %>%
  left_join(demographics, by = "household_id") %>%
  left_join(campaigns) %>%
  left_join(campaign_descriptions, by = "campaign_id")

# Generate the age_income_campaigns data
age_income_campaigns <- merged_data %>%
  filter(age == "55-64") %>%
  group_by(income, campaign_type) %>%
  summarize(total_redeemed = sum(coupon_disc != 0)) %>%
  arrange(income, -total_redeemed)

# Visualization
ggplot(age_income_campaigns, aes(x=campaign_type, y=total_redeemed, fill=income)) +
  geom_bar(stat="identity", position="dodge") +
  theme_minimal() +
  labs(title="Effective Campaigns by Income for Age 55-64",
       x="Campaign Type", 
       y="Coupons Redeemed") +
  theme(axis.text.x = element_text(angle=45, hjust=1))

2.4 Product Preferences & Demographics

Campaigns have now been shown to positively affect coupon redemptions. Following this it has been determined that some income brackets may not be targeted as well as others during these campaigns. In an effort to understand that more, the next stage of the analysis is to look at the top bought products for each income bracket, and then narrow the scope back on the age and income ranges of interest, 55-64 year olds within 35-49k.

The plot below shows the top bought products, by 55-64 year olds, throughout each income bracket. The data is generally hard to digest on this scale, so narrowing to the income bracket 35-49k can make it easier to understand what the campaigns are missing.

# Filtering, summarizing, and selecting the top products
top_10_products_55_64 <- merged_data %>%
  filter(age == "55-64") %>%
  group_by(product_id) %>%
  summarize(total_purchased = n()) %>%
  arrange(-total_purchased) %>%
  head(10) %>%
  left_join(products, by = "product_id")

# Getting income brackets for these top products
top_10_product_income <- merged_data %>%
  filter(product_id %in% top_10_products_55_64$product_id, income != "50-74K") %>%
  group_by(product_id, income) %>%
  summarize(total_purchased_income = n()) %>%
  left_join(products, by = "product_id")

# Visualization
ggplot(top_10_product_income, aes(x=reorder(product_type, -total_purchased_income), y=total_purchased_income, fill=income)) +
  geom_bar(stat="identity", position="dodge") +
  theme_minimal() +
  labs(title="Top Products Purchased by 55-64 Year Olds across Income Brackets",
       x="Product Type", 
       y="Number of Products Purchased") +
  theme(axis.text.x = element_text(angle=45, hjust=1))

The visualization below shows the top products purchased by 55-64 year olds with an income of 35-49k. By using this, the analysis is able to then compare what types of products are purchased during campaigns and see if this explains the lower coupon redemption for this income and age bracket.

merged_data <- transactions %>%
  left_join(demographics, by = "household_id") %>%
  left_join(campaigns)%>%
  left_join(campaign_descriptions, by = "campaign_id") %>%
  left_join(products, by = "product_id")


# Filtering for age 55-64 and income 35-49K, summarizing, and selecting the top products
top_10_products_55_64_income_35_49K <- merged_data %>%
  filter(age == "55-64", income == "35-49K") %>%
  group_by(product_id) %>%
  summarize(total_purchased = n()) %>%
  arrange(-total_purchased) %>%
  head(10) %>%
  left_join(products, by = "product_id")

# Visualization
ggplot(top_10_products_55_64_income_35_49K, aes(x=reorder(product_type, -total_purchased), y=total_purchased)) +
  geom_bar(stat="identity", fill="#69b3a2") +
  theme_minimal() +
  labs(title="Top Products Purchased by 55-64 Year Olds with Income 35-49K",
       x="Product Type", 
       y="Number of Products Purchased") +
  theme(axis.text.x = element_text(angle=45, hjust=1))

Below is the plot of the most bought products during campaigns. It is easy to see that none of these products match the products of the age group and income bracket of 55-65 year olds within 35-49k. This seems to explain, to some degree, the lack of coupons redeemed by this group.

# Merge transactions with coupon_redemptions
merged_data <- transactions %>%
  left_join(coupon_redemptions, by = "household_id")

# Filter for purchases made during campaigns
campaign_purchases <- merged_data %>%
  filter(!is.na(campaign_id))

# Merging campaign purchases with the products dataset to get the product type
campaign_purchases_with_type <- campaign_purchases %>%
  left_join(products, by = "product_id")


# Group by product type and summarize
product_types_campaign <- products %>%
  filter(!is.na(product_type))%>%
  group_by(product_type) %>%
  summarize(total_purchased = n()) %>%
  arrange(-total_purchased) %>%
  head(10)  # get Top product types

# Visualization
ggplot(product_types_campaign, aes(x=reorder(product_type, -total_purchased), y=total_purchased)) +
  geom_bar(stat="identity", fill="#69b3a2") +
  theme_minimal() +
  labs(title="Top Product Types Purchased During Campaigns",
       x="Product Type", 
       y="Number of Products Purchased") +
  theme(axis.text.x = element_text(angle=45, hjust=1))

Summary

The purchasing behaviors and coupon redemption trends within the 55-64 age group in the 35-49K income bracket present untapped potential for Regork. While the volume of households in this demographic is substantial, their coupon redemption rates lag behind, hinting at a missed engagement opportunity. The positive correlation between coupon usage and higher transaction values highlights the importance of tapping into this potential. By tailoring marketing strategies and focusing on products that resonate with the 55-64 age group in the specified income bracket, businesses can expect to see a marked increase in coupon redemptions, leading to increased sales and enhanced customer loyalty. The data highlights the need for targeted, relevant campaigns to capitalize on underutilized demographics. The analysis can be done for every age and income group that is underutilized. For future growth opportunities, or better analysis, one could use a similar strategy of seeking the campaign types shortcomings for every group. Furthermore, future studies could look into promotions or do a basket analysis as well in order to determine the full scope of these underutilized groups interests.