The goal of this analysis is to find an area of demographics where coupon redemptions are lacking for the grocery chain Regork. The hypothesis is the more coupons used the more revenue for Regork. This analysis examines the purchasing habits and coupon redemption behaviors of the 55-64 age group, with a primary focus on the 35-49K income bracket. A tailored marketing approach can enhance the probability of coupon redemptions. Such targeted strategies result in elevated customer satisfaction and increased coupon redemptions, which will lead to increased revenue. The analysis highlights an underutilized demographic that can be marketed to using campaigns and advertising. The analysis also gives a great starting point for future growth: to target all underutilized coupon redemption groups.
```r
library(completejourney)
library(dplyr)
library(tidyverse)
library(ggplot2)
library(lubridate)
```
completejourney: Provides datasets from a hypothetical retail scenario.
tidyverse: A collection of R packages designed for data science. Within this analysis:
dplyr: Used for data manipulation, enabling operations such as filtering, summarizing, and joining datasets.
tidyr: Assists in tidying data, making it more organized for analysis. Offers functions to change data structures, fill missing values, and separate or combine columns.
ggplot2: A tool for creating detailed visualizations.
lubridate: Simplifies working with dates and times in R, making it more straightforward to manipulate date-time data.
transactions <- get_transactions()
transactions
## # A tibble: 1,469,307 × 11
## household_id store_id basket_id product_id quantity sales_value retail_disc
## <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl>
## 1 900 330 31198570044 1095275 1 0.5 0
## 2 900 330 31198570047 9878513 1 0.99 0.1
## 3 1228 406 31198655051 1041453 1 1.43 0.15
## 4 906 319 31198705046 1020156 1 1.5 0.29
## 5 906 319 31198705046 1053875 2 2.78 0.8
## 6 906 319 31198705046 1060312 1 5.49 0.5
## 7 906 319 31198705046 1075313 1 1.5 0.29
## 8 1058 381 31198676055 985893 1 1.88 0.21
## 9 1058 381 31198676055 988791 1 1.5 1.29
## 10 1058 381 31198676055 9297106 1 2.69 0
## # ℹ 1,469,297 more rows
## # ℹ 4 more variables: coupon_disc <dbl>, coupon_match_disc <dbl>, week <int>,
## # transaction_timestamp <dttm>
promotions <- get_promotions()
promotions
## # A tibble: 20,940,529 × 5
## product_id store_id display_location mailer_location week
## <chr> <chr> <fct> <fct> <int>
## 1 1000050 316 9 0 1
## 2 1000050 337 3 0 1
## 3 1000050 441 5 0 1
## 4 1000092 292 0 A 1
## 5 1000092 293 0 A 1
## 6 1000092 295 0 A 1
## 7 1000092 298 0 A 1
## 8 1000092 299 0 A 1
## 9 1000092 304 0 A 1
## 10 1000092 306 0 A 1
## # ℹ 20,940,519 more rows
coupons
## # A tibble: 116,204 × 3
## coupon_upc product_id campaign_id
## <chr> <chr> <chr>
## 1 10000085207 9676830 26
## 2 10000085207 9676943 26
## 3 10000085207 9676944 26
## 4 10000085207 9676947 26
## 5 10000085207 9677008 26
## 6 10000085207 9677052 26
## 7 10000085207 9677385 26
## 8 10000085207 9677479 26
## 9 10000085207 9677791 26
## 10 10000085207 9677878 26
## # ℹ 116,194 more rows
campaigns
## # A tibble: 6,589 × 2
## campaign_id household_id
## <chr> <chr>
## 1 1 105
## 2 1 1238
## 3 1 1258
## 4 1 1483
## 5 1 2200
## 6 1 293
## 7 1 529
## 8 1 536
## 9 1 568
## 10 1 630
## # ℹ 6,579 more rows
demographics
## # A tibble: 801 × 8
## household_id age income home_ownership marital_status household_size
## <chr> <ord> <ord> <ord> <ord> <ord>
## 1 1 65+ 35-49K Homeowner Married 2
## 2 1001 45-54 50-74K Homeowner Unmarried 1
## 3 1003 35-44 25-34K <NA> Unmarried 1
## 4 1004 25-34 15-24K <NA> Unmarried 1
## 5 101 45-54 Under 15K Homeowner Married 4
## 6 1012 35-44 35-49K <NA> Married 5+
## 7 1014 45-54 15-24K <NA> Married 4
## 8 1015 45-54 50-74K Homeowner Unmarried 1
## 9 1018 45-54 35-49K Homeowner Married 5+
## 10 1020 45-54 25-34K Homeowner Married 2
## # ℹ 791 more rows
## # ℹ 2 more variables: household_comp <ord>, kids_count <ord>
products
## # A tibble: 92,331 × 7
## product_id manufacturer_id department brand product_category product_type
## <chr> <chr> <chr> <fct> <chr> <chr>
## 1 25671 2 GROCERY Natio… FRZN ICE ICE - CRUSH…
## 2 26081 2 MISCELLANEOUS Natio… <NA> <NA>
## 3 26093 69 PASTRY Priva… BREAD BREAD:ITALI…
## 4 26190 69 GROCERY Priva… FRUIT - SHELF S… APPLE SAUCE
## 5 26355 69 GROCERY Priva… COOKIES/CONES SPECIALTY C…
## 6 26426 69 GROCERY Priva… SPICES & EXTRAC… SPICES & SE…
## 7 26540 69 GROCERY Priva… COOKIES/CONES TRAY PACK/C…
## 8 26601 69 DRUG GM Priva… VITAMINS VITAMIN - M…
## 9 26636 69 PASTRY Priva… BREAKFAST SWEETS SW GDS: SW …
## 10 26691 16 GROCERY Priva… PNT BTR/JELLY/J… HONEY
## # ℹ 92,321 more rows
## # ℹ 1 more variable: package_size <chr>
campaign_descriptions
## # A tibble: 27 × 4
## campaign_id campaign_type start_date end_date
## <chr> <ord> <date> <date>
## 1 1 Type B 2017-03-03 2017-04-09
## 2 2 Type B 2017-03-08 2017-04-09
## 3 3 Type C 2017-03-13 2017-05-08
## 4 4 Type B 2017-03-29 2017-04-30
## 5 5 Type B 2017-04-03 2017-05-07
## 6 6 Type C 2017-04-19 2017-05-21
## 7 7 Type B 2017-04-24 2017-05-28
## 8 8 Type A 2017-05-08 2017-06-25
## 9 9 Type B 2017-05-31 2017-07-02
## 10 10 Type B 2017-06-28 2017-07-30
## # ℹ 17 more rows
coupon_redemptions
## # A tibble: 2,102 × 4
## household_id coupon_upc campaign_id redemption_date
## <chr> <chr> <chr> <date>
## 1 1029 51380041013 26 2017-01-01
## 2 1029 51380041313 26 2017-01-01
## 3 165 53377610033 26 2017-01-03
## 4 712 51380041013 26 2017-01-07
## 5 712 54300016033 26 2017-01-07
## 6 2488 51200092776 26 2017-01-10
## 7 2488 51410010050 26 2017-01-10
## 8 1923 53000012033 26 2017-01-14
## 9 1923 54300021057 26 2017-01-14
## 10 1923 57047091041 26 2017-01-14
## # ℹ 2,092 more rows
# Relevant Data
data_merged <- transactions %>%
left_join(coupon_redemptions, by = "household_id") %>%
left_join(demographics, by = "household_id") %>%
filter(!is.na(age), !is.na(income)) # Filter out NA values for age and income
#demographics for number of households per age
age_distribution_households <- demographics %>%
group_by(age) %>%
summarize(total_count = n())
# Age distribution
age_distribution <- data_merged %>%
group_by(age) %>%
summarize(total_count = n())
# Distribution of income groups
income_distribution <- demographics %>%
group_by(income) %>%
summarize(total_count = n())
# Analyzing demographics of coupon redeemers
coupon_demographics <- data_merged %>%
filter(!is.na(coupon_upc)) %>%
group_by(age, income, home_ownership, marital_status, household_size, kids_count) %>%
summarize(total_redeemed = n_distinct(household_id))
# Compare with overall demographics
overall_demographics <- demographics %>%
filter(!is.na(age), !is.na(income)) %>%
group_by(age, income, home_ownership, marital_status, household_size, kids_count) %>%
summarize(total_households = n())
# Analyzing demographics of coupon redeemers
coupon_demographics2 <- data_merged %>%
filter(!is.na(coupon_upc)) %>%
group_by(age, income, home_ownership, marital_status, household_size, kids_count) %>%
summarize(total_redeemed = n())
# Correlation of Demographics and redemptions
correlation_data <- left_join(coupon_demographics, overall_demographics, by = c("age", "income", "home_ownership", "marital_status", "household_size", "kids_count"))
correlation_data <- correlation_data %>%
mutate(redemption_rate = (total_redeemed / total_households) * 100)
# Correlation of Demographics and redemptions for plot: age and income (there are two because I was having trouble with n_distinct() for this section)
correlation_data2 <- left_join(coupon_demographics2, overall_demographics, by = c("age", "income", "home_ownership", "marital_status", "household_size", "kids_count"))
correlation_data2 <- correlation_data2 %>%
mutate(redemption_rate = (total_redeemed / total_households) * 100)
# Finding top products with redeemed coupons
top_products <- data_merged %>%
filter(!is.na(coupon_upc)) %>%
group_by(product_id) %>%
summarize(total_redeemed = n()) %>%
arrange(-total_redeemed)
#redemption rates by campaign
campaign_redemptions <- data_merged %>%
filter(!is.na(coupon_upc)) %>%
group_by(campaign_id) %>%
summarize(total_redeemed = n())
The analysis first looks at basic demographic data to later compare to coupon redemption rates.
In the plot below the key things to note are the ages of interest, 55-64 years old. More specifically, notice the difference between the 25-34, 35-44, and 45-55 age groups when compared to the 55-64 age group. This is a difference of 83, 135, and 229 number of households respectively.# Plotting the distribution of age groups
ggplot(age_distribution_households, aes(x=age, y=total_count)) +
geom_bar(stat="identity", fill="#ff9999") +
theme_minimal() +
labs(title="Number of housholds by Age Groups",
y="Number of Households",
x="Age Group") +
geom_text(aes(label=total_count), vjust=-0.5)
Before we compare the previous plot to any data, it is important to look at basic demographic income data in order to understand what income brackets fall short in coupon redemption rates.
Notice on the below plot, the largest group of households is 50-74k,followed by 35-49k. It is expected that the coupon redemption rate would be proportional to the number of households in both age and income ranges. In other words, it would be expected that these income brackets will have the highest redemptions, when compared to other income brackets, because they have more people to use coupons. This idea applies to the age distribution as well. It will be seen in the next few plots that this is not the case and implies a variable that is affecting their coupon usage per demographic.# Plotting the distribution of income groups
ggplot(income_distribution, aes(x=income, y=total_count)) +
geom_bar(stat="identity", fill="#ff9999") +
theme_minimal() +
labs(title="Number of household by Income Groups",
y="Number of Households",
x="Income Bracket") +
geom_text(aes(label=total_count), vjust=-0.5) +
theme(axis.text.x = element_text(angle=45, hjust=1))
Next, by comparing transaction values based on coupon redemption, we determine the monetary impact of coupons. This plot indicates that increased coupon redemption translates to higher transaction values. This supports the hypothesis that if an area where coupons are not being used, are then marketed to (under the assumption coupons are used more often because of campaigns/advertising) Regork can make more money.
# Aggregate sales value by whether a coupon was redeemed or not
sales_impact <- transactions %>%
group_by(coupon_redeemed = (coupon_disc != 0)) %>%
summarise(
total_sales_value = sum(sales_value),
num_transactions = n(),
avg_transaction_value = total_sales_value / num_transactions
)
# Plotting the average transaction value with vs without coupon redemption
ggplot(sales_impact, aes(x=factor(coupon_redeemed), y=avg_transaction_value)) +
geom_bar(stat="identity", fill=c("#ff9999", "#69b3a2")) +
theme_minimal() +
labs(
title="Average Transaction Value: With vs. Without Coupons",
x="Coupon Redeemed",
y="Average Transaction Value"
) +
scale_x_discrete(labels=c("Without Coupon", "With Coupon"))
Note: The redemption rate is used, as opposed to the total number of coupon redemptions, because it offers a more normalized comparison.
Now the analysis starts. First, looking at coupon redemption rate of different age groups, one can see the redemption rate is proportional to the number of households in this age group.ggplot(correlation_data, aes(x = age, y = redemption_rate)) +
geom_bar(stat="identity", fill="#69b3a2") +
labs(title = "Redemption Rate by Age", x = "Age", y = "Redemption Rate (%)") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
Once again we see a similar proportionality between the number of households and coupon redemption rate. However, when exploring the converging demographics of age and income, a new story is seen.
ggplot(correlation_data, aes(x = income, y = redemption_rate)) +
geom_bar(stat="identity", fill="#69b3a2") +
labs(title = "Redemption rate by Income", x = "Income Bracket", y = "Redemption Rate (%)") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
This plot is integral to the analysis. Looking at the age group of 55-64, a discrepancy between the redemption rate for income of 50-74k and the redemption rate for just the age group is seen. This was the data that jumpstarted the analysis. Why is this income group, within an age group that does not use coupons, have the highest redemption rate? Well initially, we could assume a similar methodology as the previous plots, that the 50-74k has the greatest number of people and therefore the redemption rate is higher. However, looking at the demographic data of number households per income bracket in the 55-64 age group says otherwise.
ggplot(correlation_data2, aes(x = age, y = redemption_rate, fill = income)) +
geom_bar(stat = "identity", position = "dodge") +
labs(title = "Coupon Redemption Rate by Age and Income",
x = "Age Group",
y = "Coupon Redemption Rate (%)",
fill = "Income Bracket") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
ylim(0, 1500000)
Notice the largest group is the income range 35-49k, followed by, the expected, 50-74k income range. Well, looking back at the above plot we can see 35-49k is the 6th smallest for coupon redemption.Therefore, we do not see an increase in coupon redemptions due to number of households. Therefore, we can assume another variable is acting on this demographic data.
# Filtering the data
household_age_income <- demographics %>%
filter(age == "55-64") %>%
group_by(income) %>%
summarize(total_households = n()) %>%
arrange(-total_households)
# Visualization
ggplot(household_age_income, aes(x=reorder(income, total_households), y=total_households)) +
geom_bar(stat="identity", fill="#69b3a2") +
theme_minimal() +
labs(title="Total Households by Income for Age 55-64",
x="Income Bracket",
y="Number of Households") +
theme(axis.text.x = element_text(angle=45, hjust=1))
Now that it is known that there is some variable affecting the coupon usage of 55-64 year olds, specifically within the income range of 35-49k, the analysis moves towards looking into the possible variables in hopes of finding an area for Regork to invest.
The first variable looked at is campaign types and their effect on the number of coupon redemptions. Far and away Type A is the best for generating coupon redemptions. However, this is fairly broad, so the analysis looks into income ranges for the age range of interest: 55-64.# Merging the datasets
campaign_redemptions_merged <- campaign_redemptions %>%
left_join(campaign_descriptions, by = "campaign_id")
# Visualization
ggplot(campaign_redemptions_merged, aes(x=reorder(campaign_type, -total_redeemed), y=total_redeemed)) +
geom_bar(stat="identity", fill="#ff9999") +
theme_minimal() +
labs(title="Impact of Campaign Types on Coupon Redemptions",
x="Campaign Type",
y="Number of Coupons Redeemed") +
theme(axis.text.x = element_text(angle=45, hjust=1))
This plot below gives a much more in depth look at this age range and the effect campaigns have on their coupon redemption numbers.The types of campaigns seem to mostly impact the 50-74k and 75-99k income ranges.Type A does alright for our income range of interest, 35-49k, but if the goal is to make the income range as successful for coupon redemptions as 50-74k is, then new campaigns should be created in order to affect these lower redemption, income brackets.
# Merge with campaign_descriptions for campaign_type information
merged_data <- transactions %>%
left_join(demographics, by = "household_id") %>%
left_join(campaigns) %>%
left_join(campaign_descriptions, by = "campaign_id")
# Generate the age_income_campaigns data
age_income_campaigns <- merged_data %>%
filter(age == "55-64") %>%
group_by(income, campaign_type) %>%
summarize(total_redeemed = sum(coupon_disc != 0)) %>%
arrange(income, -total_redeemed)
# Visualization
ggplot(age_income_campaigns, aes(x=campaign_type, y=total_redeemed, fill=income)) +
geom_bar(stat="identity", position="dodge") +
theme_minimal() +
labs(title="Effective Campaigns by Income for Age 55-64",
x="Campaign Type",
y="Coupons Redeemed") +
theme(axis.text.x = element_text(angle=45, hjust=1))
Campaigns have now been shown to positively affect coupon redemptions. Following this it has been determined that some income brackets may not be targeted as well as others during these campaigns. In an effort to understand that more, the next stage of the analysis is to look at the top bought products for each income bracket, and then narrow the scope back on the age and income ranges of interest, 55-64 year olds within 35-49k.
The plot below shows the top bought products, by 55-64 year olds, throughout each income bracket. The data is generally hard to digest on this scale, so narrowing to the income bracket 35-49k can make it easier to understand what the campaigns are missing.# Filtering, summarizing, and selecting the top products
top_10_products_55_64 <- merged_data %>%
filter(age == "55-64") %>%
group_by(product_id) %>%
summarize(total_purchased = n()) %>%
arrange(-total_purchased) %>%
head(10) %>%
left_join(products, by = "product_id")
# Getting income brackets for these top products
top_10_product_income <- merged_data %>%
filter(product_id %in% top_10_products_55_64$product_id, income != "50-74K") %>%
group_by(product_id, income) %>%
summarize(total_purchased_income = n()) %>%
left_join(products, by = "product_id")
# Visualization
ggplot(top_10_product_income, aes(x=reorder(product_type, -total_purchased_income), y=total_purchased_income, fill=income)) +
geom_bar(stat="identity", position="dodge") +
theme_minimal() +
labs(title="Top Products Purchased by 55-64 Year Olds across Income Brackets",
x="Product Type",
y="Number of Products Purchased") +
theme(axis.text.x = element_text(angle=45, hjust=1))
The visualization below shows the top products purchased by 55-64 year olds with an income of 35-49k. By using this, the analysis is able to then compare what types of products are purchased during campaigns and see if this explains the lower coupon redemption for this income and age bracket.
merged_data <- transactions %>%
left_join(demographics, by = "household_id") %>%
left_join(campaigns)%>%
left_join(campaign_descriptions, by = "campaign_id") %>%
left_join(products, by = "product_id")
# Filtering for age 55-64 and income 35-49K, summarizing, and selecting the top products
top_10_products_55_64_income_35_49K <- merged_data %>%
filter(age == "55-64", income == "35-49K") %>%
group_by(product_id) %>%
summarize(total_purchased = n()) %>%
arrange(-total_purchased) %>%
head(10) %>%
left_join(products, by = "product_id")
# Visualization
ggplot(top_10_products_55_64_income_35_49K, aes(x=reorder(product_type, -total_purchased), y=total_purchased)) +
geom_bar(stat="identity", fill="#69b3a2") +
theme_minimal() +
labs(title="Top Products Purchased by 55-64 Year Olds with Income 35-49K",
x="Product Type",
y="Number of Products Purchased") +
theme(axis.text.x = element_text(angle=45, hjust=1))
Below is the plot of the most bought products during campaigns. It is easy to see that none of these products match the products of the age group and income bracket of 55-65 year olds within 35-49k. This seems to explain, to some degree, the lack of coupons redeemed by this group.
# Merge transactions with coupon_redemptions
merged_data <- transactions %>%
left_join(coupon_redemptions, by = "household_id")
# Filter for purchases made during campaigns
campaign_purchases <- merged_data %>%
filter(!is.na(campaign_id))
# Merging campaign purchases with the products dataset to get the product type
campaign_purchases_with_type <- campaign_purchases %>%
left_join(products, by = "product_id")
# Group by product type and summarize
product_types_campaign <- products %>%
filter(!is.na(product_type))%>%
group_by(product_type) %>%
summarize(total_purchased = n()) %>%
arrange(-total_purchased) %>%
head(10) # get Top product types
# Visualization
ggplot(product_types_campaign, aes(x=reorder(product_type, -total_purchased), y=total_purchased)) +
geom_bar(stat="identity", fill="#69b3a2") +
theme_minimal() +
labs(title="Top Product Types Purchased During Campaigns",
x="Product Type",
y="Number of Products Purchased") +
theme(axis.text.x = element_text(angle=45, hjust=1))
The purchasing behaviors and coupon redemption trends within the 55-64 age group in the 35-49K income bracket present untapped potential for Regork. While the volume of households in this demographic is substantial, their coupon redemption rates lag behind, hinting at a missed engagement opportunity. The positive correlation between coupon usage and higher transaction values highlights the importance of tapping into this potential. By tailoring marketing strategies and focusing on products that resonate with the 55-64 age group in the specified income bracket, businesses can expect to see a marked increase in coupon redemptions, leading to increased sales and enhanced customer loyalty. The data highlights the need for targeted, relevant campaigns to capitalize on underutilized demographics. The analysis can be done for every age and income group that is underutilized. For future growth opportunities, or better analysis, one could use a similar strategy of seeking the campaign types shortcomings for every group. Furthermore, future studies could look into promotions or do a basket analysis as well in order to determine the full scope of these underutilized groups interests.