We have been tasked with identifying and analyzing potential growth opportunities where Regork can invest in order to increase revenue and profits. Companies like Regork are always looking to make an impact and to win the “hearts and minds” of the consumer. As a grocer, sometimes this can be difficult when you deal with commodities. How can you gain the attention of the consumer and get them into the store?
Do the targeted campaigns generate a greater amount of customer engagement?
Regork has much of the data to start with and paint a picture of the consumer. With this data, we can see that the targeted Type A campaigns have a good effect on the larger families of 5+, but not so much on the smaller families. Utilizing exploratory data analysis (EDA) techniques and visualizations will allow us to identify potential solutions for increasing customer engagement through campaigns, while penetrating larger household sizes.
The following R packages are required in order to run the code in this R project:
library(completejourney) # grocery store shopping transactions data from group of 2,469 households
library(corrplot) # visualization of a correlation matrix
library(dplyr) # manipulating and transforming data (i.e., filtering, joining, etc.)
library(forcats) # working with categorical variables or factors
library(GGally) # plotting system for R based on "Grammar of Graphics"; extension to "ggplot2"
library(ggrepel) # position and repel overlapping text labels with "ggplot2"
library(ggplot2) # data visualization plotting system using "Grammar of Graphics"
library(ggthemes) # additional plotting themes, scales, and geoms for "ggplot2"
library(grid) # grid graphics plotting functions and capabilities
library(gridExtra) # additional grid capability functions for "grid"
library(here) # quick way to locate R project files
library(jpeg) # reading and writing JPEG images
library(knitr) # dynamic report generation in R
library(lubridate) # functions used for working with dates and times
library(naniar) # find data quality issues; summarize, visualize, and manipulate missing data
library(png) # reading and writing PNG images
library(readr) # reading rectangular text data
library(stringr) # manipulation text
library(tidyr) # functions used for tidying or cleaning up messy data
library(tidyverse) # tidying data and working with other R packages
library(treemapify) # plotting treemaps in "ggplot2"
The data for this R project can be accessed from the CompleteJourney website. The CompleteJourney datasets are based on grocery shopping transactions from a group of 2,469 households. Entities such as demographics, products, coupons, campaigns, etc., was collected over a timeframe from January 2017 - December 2017.
The first task in the data preparation process is to get the full transactions and promotions datasets. Both datasets and the remaining CompleteJourney datasets were previewed in a table-like format.
# get the completejourney - transactions dataset
transactions <- get_transactions()
transactions
## # A tibble: 1,469,307 × 11
## house…¹ store…² baske…³ produ…⁴ quant…⁵ sales…⁶ retai…⁷ coupo…⁸ coupo…⁹ week
## <chr> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <int>
## 1 900 330 311985… 1095275 1 0.5 0 0 0 1
## 2 900 330 311985… 9878513 1 0.99 0.1 0 0 1
## 3 1228 406 311986… 1041453 1 1.43 0.15 0 0 1
## 4 906 319 311987… 1020156 1 1.5 0.29 0 0 1
## 5 906 319 311987… 1053875 2 2.78 0.8 0 0 1
## 6 906 319 311987… 1060312 1 5.49 0.5 0 0 1
## 7 906 319 311987… 1075313 1 1.5 0.29 0 0 1
## 8 1058 381 311986… 985893 1 1.88 0.21 0 0 1
## 9 1058 381 311986… 988791 1 1.5 1.29 0 0 1
## 10 1058 381 311986… 9297106 1 2.69 0 0 0 1
## # … with 1,469,297 more rows, 1 more variable: transaction_timestamp <dttm>,
## # and abbreviated variable names ¹household_id, ²store_id, ³basket_id,
## # ⁴product_id, ⁵quantity, ⁶sales_value, ⁷retail_disc, ⁸coupon_disc,
## # ⁹coupon_match_disc
# get the completejourney - promotions dataset
promotions <- get_promotions()
promotions
## # A tibble: 20,940,529 × 5
## product_id store_id display_location mailer_location week
## <chr> <chr> <fct> <fct> <int>
## 1 1000050 316 9 0 1
## 2 1000050 337 3 0 1
## 3 1000050 441 5 0 1
## 4 1000092 292 0 A 1
## 5 1000092 293 0 A 1
## 6 1000092 295 0 A 1
## 7 1000092 298 0 A 1
## 8 1000092 299 0 A 1
## 9 1000092 304 0 A 1
## 10 1000092 306 0 A 1
## # … with 20,940,519 more rows
campaigns
## # A tibble: 6,589 × 2
## campaign_id household_id
## <chr> <chr>
## 1 1 105
## 2 1 1238
## 3 1 1258
## 4 1 1483
## 5 1 2200
## 6 1 293
## 7 1 529
## 8 1 536
## 9 1 568
## 10 1 630
## # … with 6,579 more rows
campaign_descriptions
## # A tibble: 27 × 4
## campaign_id campaign_type start_date end_date
## <chr> <ord> <date> <date>
## 1 1 Type B 2017-03-03 2017-04-09
## 2 2 Type B 2017-03-08 2017-04-09
## 3 3 Type C 2017-03-13 2017-05-08
## 4 4 Type B 2017-03-29 2017-04-30
## 5 5 Type B 2017-04-03 2017-05-07
## 6 6 Type C 2017-04-19 2017-05-21
## 7 7 Type B 2017-04-24 2017-05-28
## 8 8 Type A 2017-05-08 2017-06-25
## 9 9 Type B 2017-05-31 2017-07-02
## 10 10 Type B 2017-06-28 2017-07-30
## # … with 17 more rows
coupons
## # A tibble: 116,204 × 3
## coupon_upc product_id campaign_id
## <chr> <chr> <chr>
## 1 10000085207 9676830 26
## 2 10000085207 9676943 26
## 3 10000085207 9676944 26
## 4 10000085207 9676947 26
## 5 10000085207 9677008 26
## 6 10000085207 9677052 26
## 7 10000085207 9677385 26
## 8 10000085207 9677479 26
## 9 10000085207 9677791 26
## 10 10000085207 9677878 26
## # … with 116,194 more rows
coupon_redemptions
## # A tibble: 2,102 × 4
## household_id coupon_upc campaign_id redemption_date
## <chr> <chr> <chr> <date>
## 1 1029 51380041013 26 2017-01-01
## 2 1029 51380041313 26 2017-01-01
## 3 165 53377610033 26 2017-01-03
## 4 712 51380041013 26 2017-01-07
## 5 712 54300016033 26 2017-01-07
## 6 2488 51200092776 26 2017-01-10
## 7 2488 51410010050 26 2017-01-10
## 8 1923 53000012033 26 2017-01-14
## 9 1923 54300021057 26 2017-01-14
## 10 1923 57047091041 26 2017-01-14
## # … with 2,092 more rows
demographics
## # A tibble: 801 × 8
## household_id age income home_ownership marital…¹ house…² house…³ kids_…⁴
## <chr> <ord> <ord> <ord> <ord> <ord> <ord> <ord>
## 1 1 65+ 35-49K Homeowner Married 2 2 Adul… 0
## 2 1001 45-54 50-74K Homeowner Unmarried 1 1 Adul… 0
## 3 1003 35-44 25-34K <NA> Unmarried 1 1 Adul… 0
## 4 1004 25-34 15-24K <NA> Unmarried 1 1 Adul… 0
## 5 101 45-54 Under 15K Homeowner Married 4 2 Adul… 2
## 6 1012 35-44 35-49K <NA> Married 5+ 2 Adul… 3+
## 7 1014 45-54 15-24K <NA> Married 4 2 Adul… 2
## 8 1015 45-54 50-74K Homeowner Unmarried 1 1 Adul… 0
## 9 1018 45-54 35-49K Homeowner Married 5+ 2 Adul… 3+
## 10 1020 45-54 25-34K Homeowner Married 2 2 Adul… 0
## # … with 791 more rows, and abbreviated variable names ¹marital_status,
## # ²household_size, ³household_comp, ⁴kids_count
products
## # A tibble: 92,331 × 7
## product_id manufacturer_id department brand product_c…¹ produ…² packa…³
## <chr> <chr> <chr> <fct> <chr> <chr> <chr>
## 1 25671 2 GROCERY National FRZN ICE ICE - … 22 LB
## 2 26081 2 MISCELLANEOUS National <NA> <NA> <NA>
## 3 26093 69 PASTRY Private BREAD BREAD:… <NA>
## 4 26190 69 GROCERY Private FRUIT - SH… APPLE … 50 OZ
## 5 26355 69 GROCERY Private COOKIES/CO… SPECIA… 14 OZ
## 6 26426 69 GROCERY Private SPICES & E… SPICES… 2.5 OZ
## 7 26540 69 GROCERY Private COOKIES/CO… TRAY P… 16 OZ
## 8 26601 69 DRUG GM Private VITAMINS VITAMI… 300 CT…
## 9 26636 69 PASTRY Private BREAKFAST … SW GDS… <NA>
## 10 26691 16 GROCERY Private PNT BTR/JE… HONEY 12 OZ
## # … with 92,321 more rows, and abbreviated variable names ¹product_category,
## # ²product_type, ³package_size
This section of the data preparation combines all the specified dataframes that are used with the visualizations in this R project. References to “na.rm = TRUE” have been incorporated to remove “NA” values from calculations.
filtered_trans <- demographics %>%
left_join(transactions) %>%
mutate(transaction_quarter = quarter(transaction_timestamp, type = "year.quarter"))
## Joining, by = "household_id"
full_view <- filtered_trans %>%
inner_join(products, key = "product_id") %>%
inner_join(coupon_redemptions) %>%
inner_join(coupons) %>%
inner_join(campaign_descriptions) %>%
inner_join(campaigns)
## Joining, by = "product_id"
## Joining, by = "household_id"
## Joining, by = c("product_id", "coupon_upc", "campaign_id")
## Joining, by = "campaign_id"
## Joining, by = c("household_id", "campaign_id")
savings_view <- full_view %>%
dplyr::mutate(coupon_savings = (coupon_disc + coupon_match_disc)) %>%
dplyr::mutate(customer_amount = (sales_value - (coupon_savings))) %>%
# dplyr::mutate(campaign_id = as.integer(campaign_id)) %>%
group_by(transaction_quarter, week, household_id, household_size, campaign_id, campaign_type, age, income, marital_status, quantity, sales_value) %>%
summarise(total_saving = sum(coupon_savings))
## `summarise()` has grouped output by 'transaction_quarter', 'week',
## 'household_id', 'household_size', 'campaign_id', 'campaign_type', 'age',
## 'income', 'marital_status', 'quantity'. You can override using the `.groups`
## argument.
spend_view <- full_view %>%
dplyr::mutate(coupon_savings = (coupon_disc + coupon_match_disc)) %>%
dplyr::mutate(customer_amount = (sales_value - (coupon_savings))) %>%
# dplyr::mutate(campaign_id = as.integer(campaign_id)) %>%
group_by(transaction_quarter, week, household_id, household_size, campaign_id, campaign_type, age, income, marital_status) %>%
summarise(total_spend = sum(customer_amount))
## `summarise()` has grouped output by 'transaction_quarter', 'week',
## 'household_id', 'household_size', 'campaign_id', 'campaign_type', 'age',
## 'income'. You can override using the `.groups` argument.
household_percent <- full_view %>%
count(campaign_type, household_size) %>%
group_by(household_size) %>%
mutate(pct = n / sum(n))
df1 <- coupon_redemptions %>%
inner_join(demographics) %>%
group_by(household_size)
## Joining, by = "household_id"
df1 %>% map_int(n_distinct)
## household_id coupon_upc campaign_id redemption_date age
## 295 440 24 237 6
## income home_ownership marital_status household_size household_comp
## 12 5 3 5 4
## kids_count
## 4
df6 <- transactions_sample %>%
inner_join(products) %>%
inner_join(demographics) %>%
group_by(household_size) %>%
summarize(total_sales = sum(sales_value), total_quantity = sum(quantity)) %>%
mutate(dollar_per_item = total_sales / total_quantity)
## Joining, by = "product_id"
## Joining, by = "household_id"
Demo_Coupon <- demographics %>%
inner_join(coupon_redemptions)
## Joining, by = "household_id"
Coupons_used <- coupons %>%
inner_join(coupon_redemptions)
## Joining, by = c("coupon_upc", "campaign_id")
Counted_Group <- Coupons_used %>%
count(coupon_upc)
Top_5_UPCs <- Counted_Group %>%
top_n(n = 5, wt = n)
Ordered_Top_5_UPCs <- arrange(Top_5_UPCs, desc(Top_5_UPCs$n))
Coupon_Joined <- Demo_Coupon %>%
semi_join(Ordered_Top_5_UPCs)
## Joining, by = "coupon_upc"
The Exploratory Data Analysis (EDA) section outlines the visualizations that are used in the R project to address the problem statement and/or business question. Formats such as bar charts, line charts, scatter plots, correlation matrix, etc., are as different styles for the visualizations.
The first step to analyzing the problem is to first assess the current state of our campaign strategy. We wanted to analyze which campaigns were being sent to which household size groups. Our first graph represents the number of coupons sent to each household size, subset by campaign type. This allowsed us to have a frameset for our problem and solution.
# campaigns received based on household size by campaign type
campaigns %>%
inner_join(demographics, by = "household_id") %>%
inner_join(campaign_descriptions, by = "campaign_id") %>%
group_by(household_id, campaign_id) %>%
ggplot(aes(x = household_size, fill = campaign_type)) +
geom_bar() +
labs(
fill = "Campaign Type",
x = "Household Size",
y = "Campaigns Received",
title = "Campaigns Received by Household Size",
subtitle = "Data represents the number of campaigns received based on campaign type by household size.\nThe top campaign type is A with its highest count being received within a household size of 2.",
caption = "Data Source R Package: completejourney"
) +
theme(
plot.subtitle = element_text(face = "italic"),
plot.caption = element_text(hjust = 1, size = 7),
legend.title = element_text(size = 9),
legend.text = element_text(size = 7)
)
The following are tables that display the counts of campaigns received based on household size and campaign type.
# count of campaigns received based on household size
campaigns %>%
inner_join(demographics, by = "household_id") %>%
inner_join(campaign_descriptions, by = "campaign_id") %>%
group_by(household_size) %>%
count(household_size)
## # A tibble: 5 × 2
## # Groups: household_size [5]
## household_size n
## <ord> <int>
## 1 1 1155
## 2 2 1482
## 3 3 576
## 4 4 300
## 5 5+ 322
# count of campaigns by campaign type
campaigns %>%
inner_join(demographics, by = "household_id") %>%
inner_join(campaign_descriptions, by = "campaign_id") %>%
group_by(campaign_type) %>%
count(campaign_type)
## # A tibble: 3 × 2
## # Groups: campaign_type [3]
## campaign_type n
## <ord> <int>
## 1 Type A 2007
## 2 Type B 1466
## 3 Type C 362
# count of campaigns received by household size based on campaign type
campaigns %>%
inner_join(demographics, by = "household_id") %>%
inner_join(campaign_descriptions, by = "campaign_id") %>%
group_by(household_size) %>%
count(campaign_type)
## # A tibble: 15 × 3
## # Groups: household_size [5]
## household_size campaign_type n
## <ord> <ord> <int>
## 1 1 Type A 647
## 2 1 Type B 403
## 3 1 Type C 105
## 4 2 Type A 802
## 5 2 Type B 558
## 6 2 Type C 122
## 7 3 Type A 280
## 8 3 Type B 228
## 9 3 Type C 68
## 10 4 Type A 126
## 11 4 Type B 139
## 12 4 Type C 35
## 13 5+ Type A 152
## 14 5+ Type B 138
## 15 5+ Type C 32
The next graph shows the average spending by household size, and we subset it by campaign type. We decided to take the average spending to make sure the data was not skewed and to help depict a more accurate graph showing the similarities and differences of spending for each campaign type. As we can see by the graph, each housing size has slightly different habits towards which campaign is redeemed the most.
# average spending by household size based on campaign type
transactions %>%
inner_join(demographics, by = "household_id") %>%
inner_join(campaigns) %>%
inner_join(campaign_descriptions) %>%
group_by(campaign_type, household_size) %>%
summarize(average_spending = mean(sales_value, na.rm = TRUE)) %>%
arrange(desc(average_spending), household_size) %>%
ggplot(aes(x = household_size, y = average_spending, color = campaign_type)) +
geom_point(size = 5) +
geom_text_repel(
aes(label = scales::dollar(round(average_spending, 2))),
color = "black"
) +
guides(color = guide_legend(title = "Campaign Type")) +
scale_y_continuous(name = "Average Spending", labels = scales::dollar) +
labs(
x = "Household Size",
y = "Average Spending",
title = "Average Spending by Household Size",
subtitle = "Data represents the average spending based on household size and by campaign type.",
caption = "Data Source R Package: completejourney"
) +
theme(
plot.subtitle = element_text(face = "italic"),
plot.caption = element_text(hjust = 1, size = 7),
legend.title = element_text(size = 9),
legend.text = element_text(size = 7)
)
## Joining, by = "household_id"
## Joining, by = "campaign_id"
## `summarise()` has grouped output by 'campaign_type'. You can override using the
## `.groups` argument.
We also wanted to confirm the seasonality of the campaigns to make sure there were no issues with timing of the coupons. Our seasonality graph shows that Type A is effective year-round, which is another positive factor showing that the targeted campaigns are effective.
# products sales per month by campaign type; top 10 coupons
products %>%
inner_join(transactions, by = "product_id") %>%
inner_join(coupons, by = "product_id") %>%
inner_join(campaigns) %>%
inner_join(campaign_descriptions) %>%
mutate(product_category = str_replace_all(product_category, pattern = "FRZN", replacement = "FROZEN")) %>%
mutate(sales_month = month(transaction_timestamp, label = TRUE)) %>%
group_by(sales_month, coupon_upc, campaign_type) %>%
summarize(total_sales = sum(sales_value, na.rm = TRUE)) %>%
arrange(desc(total_sales)) %>%
slice(1:10) %>%
ggplot(aes(x = sales_month, y = total_sales, color = campaign_type)) +
geom_jitter() +
guides(color = guide_legend(title = "Campaign Type")) +
scale_y_continuous(name = "Total Sales", labels = scales::dollar) +
labs(
x = "Month",
title = "Total Product Sales Per Month",
subtitle = "Data represents the total sales based on the top 10 coupons.\nCampaign Type A displays the highest product sales per month..",
caption = "Data Source R Package: completejourney"
) +
theme(
plot.subtitle = element_text(face = "italic"),
plot.caption = element_text(hjust = 1, size = 7),
legend.title = element_text(size = 9),
legend.text = element_text(size = 7)
)
## Joining, by = c("household_id", "campaign_id")
## Joining, by = "campaign_id"
## `summarise()` has grouped output by 'sales_month', 'coupon_upc'. You can
## override using the `.groups` argument.
# total sales by household size based on campaign type
first_join_df <- campaigns %>%
inner_join(demographics, by = "household_id") %>%
inner_join(campaign_descriptions, by = "campaign_id")
second_join_df <- first_join_df %>%
inner_join(transactions, by = "household_id") %>%
group_by(household_id, household_size, campaign_type) %>%
summarize(total_sales = sum(sales_value, na.rm = TRUE)) %>%
ggplot(aes(x = campaign_type, y = total_sales, color = household_size)) +
geom_point() +
guides(color = guide_legend(title = "Household Size")) +
scale_y_log10(name = "Total Sales", labels = scales::dollar) +
labs(
x = "Campaign Type",
y = "Total Sales",
title = "Total Sales by Campaign Type",
subtitle = "Data represents the total sales by campaign type based on the household size.",
caption = "Data Source R Package: completejourney"
) +
theme(
plot.subtitle = element_text(face = "italic"),
plot.caption = element_text(hjust = 1, size = 7),
legend.title = element_text(size = 9),
legend.text = element_text(size = 7)
)
## `summarise()` has grouped output by 'household_id', 'household_size'. You can
## override using the `.groups` argument.
second_join_df
## [1] "#7F3B08" "#B75C07" "#E68D25" "#FDC57E" "#FEEDD6" "#E9EAF3" "#BEBADA"
## [8] "#8B7FB4" "#582F8C" "#2D004B"
Our next two graphs show the spending and savings per campaign type. As
we can see by the graphs, Type A campaigns were prominently found in the
spending and savings, compared to Types B and C. We felt that these
graphs gave us a great depiction of overall usage of Type A compared to
Types B and C. (We can then tie into how we view Type A is the dominant
campaign being the targeted one and roll into solution on how to make
campaigns better?)
# update titles, subtitles, add captions, etc. by 10/5/2022
# spending by household size where quantity less than 40000 and sales value < 400
filtered_trans %>%
select(sales_value, quantity, household_size) %>%
filter(quantity < 40000, sales_value < 400) %>%
ggplot(aes(x = quantity, y = sales_value, color = household_size)) +
geom_point() +
scale_y_continuous(labels = scales::dollar) +
labs(title = "spend by household size", subtitle = "OMG")
# savings based on household size
savings_view %>%
ggplot(aes(x = quantity, y = sales_value, color = household_size)) +
geom_point() +
scale_y_continuous(labels = scales::dollar) +
labs(title = "spend by household size", subtitle = "OMG")
# update title, subtitle, add caption, etc. by 10/5/2022
# savings based on household size; quarterly; filtering outlier
savings_view %>%
filter(household_id != 1228) %>%
arrange(campaign_id) %>%
ggplot(aes(y = total_saving, x = transaction_quarter, fill = household_size)) +
geom_bar(position = 'dodge', stat = 'identity') +
scale_y_continuous(labels = scales::dollar) +
labs(title = "savings by household size", subtitle = "OMG")
# update title, subtitle, add caption, etc. by 10/5/2022
# savings based campaign type; by week
savings_view %>%
filter(household_id != 1228) %>%
arrange(campaign_id) %>%
ggplot(aes(y = total_saving, x = week, color = campaign_type)) +
geom_point(position = 'dodge', stat = 'identity') +
scale_y_continuous(labels = scales::dollar) +
labs(title = "savings by campaign", subtitle = "OMG")
## Warning: Width not defined. Set with `position_dodge(width = ?)`
# update title, subtitle, add caption, etc. by 10/5/2022
# spending based on household size; quarterly
spend_view %>%
filter(household_id != 1228) %>%
arrange(campaign_id) %>%
ggplot(aes(y = total_spend, x = transaction_quarter, fill = household_size)) +
geom_bar(position = 'dodge', stat = 'identity') +
scale_y_continuous(labels = scales::dollar) +
labs(title = "spend by household size", subtitle = "OMG")
# update title, subtitle, add caption, etc. by 10/5/2022
# spending by household size; filtering outlier
spend_view %>% #line view
filter(household_id != 1228) %>%
arrange(campaign_id) %>%
ggplot(aes(y = total_spend, x = week)) +
geom_line(data = spend_view, aes(week, total_spend, color = household_size)) +
scale_y_continuous(labels = scales::dollar) +
theme(legend.justification = 'centre',
legend.position = 'bottom',
legend.direction = "horizontal",
legend.key.height = unit(0.5, "cm"),
legend.key.width = unit(0.5,"cm")) +
labs(title = "spend by household size", subtitle = "OMG")
# update title, subtitle, add caption, etc. by 10/5/2022
# campaigns per household based on percentage
household_percent %>%
ggplot(aes(household_size, fill = campaign_type)) +
geom_col(aes(household_size, pct)) +
labs(title = "Campaigns per household (percentage)", subtitle = "subtitle") +
theme_wsj() +
theme(text = element_text(size = 7), axis.text = element_text(size = 7),
plot.title = element_text(size = 14), plot.subtitle = element_text(size = 9)) +
scale_y_continuous()
# update title, color, subtitle, add caption by 10/5/2022
# sales value per quantity by income based on household size
ggplot(df6, aes(household_size, dollar_per_item)) +
geom_point(stat = "identity", color = "red", size = 2) +
labs(title = "Total Sales Value per Quantity By Income",
subtitle = "Total sales value per quantity for all products purchased by each income group",
x = "Household Size",
y = "Total Sales Value per Quantity") +
coord_flip()
# total coupon redemption by age based on household size
ggplot(df1, aes(x = household_size, fill = household_size)) +
geom_bar(stat = "count") +
labs(title = "Total Coupon Redemption Volume By Household Size",
subtitle = "Total coupon redemption number by each household size between January 1 and December 31, 2017",
x = "Household Size",
y = "Total Coupon Redemption Volume")
# coupon usage based on household size
Coupon_Plot <- ggplot(Coupon_Joined, aes(household_size, fill = coupon_upc)) +
geom_bar() +
labs(title = "Coupon Usage Based on Household Size",
subtitle = "Based on Top 5 Coupon UPCs Redeemed",
x = "Household Size",
y = "Total Coupons Used")
plot(Coupon_Plot)
summarize problem statement, insights, limitations, etc.