Introduction / The Business Question

The task at hand is to create a data evaluation of Kroger coupon use in 2017. What follows is the findings that I had while analyzing the data and creating understandable metrics.

When looking at a series of data there are many ways to evaluate and form this data into a successful analytical discovery. The question is: Does the income range of customers affect coupon usage by customers in different aspects of data categories? I wanted to find where we should manage our target market in certain times of year and usage on types of coupon in varying household incomes.

Synopsis

breakdown of midterm

Packages used

library(cowplot)
library(completejourney)
## Welcome to the completejourney package! Learn more about these data
## sets at http://bit.ly/completejourney.
library(ggplot2)
library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ lubridate 1.9.4     ✔ tibble    3.2.1
## ✔ purrr     1.0.2     ✔ tidyr     1.3.1
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter()    masks stats::filter()
## ✖ dplyr::lag()       masks stats::lag()
## ✖ lubridate::stamp() masks cowplot::stamp()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(lubridate)
library(dplyr)
library(scales)
## 
## Attaching package: 'scales'
## 
## The following object is masked from 'package:purrr':
## 
##     discard
## 
## The following object is masked from 'package:readr':
## 
##     col_factor

cowplot: Enchanced ggplot2 by providing tools for combining multiple plots into a single figure

completejourney: Provides consumer transaction datasets for analyzing grocery shopping behavior

ggplot2: A powerful data visualization package based on the grammar of graphics

tidyverse: A collection of R packages for data manipulation, visualization, and analysis.

lubridate: Simplifies working with date-time objects in R

dyplr: Provides efficient functions for data wrangling using a readable syntax

scales: Controls axis scaling, formatting, and transformations in ggplot2

Data usage

transactions <- get_transactions()
transactions
## # A tibble: 1,469,307 × 11
##    household_id store_id basket_id   product_id quantity sales_value retail_disc
##    <chr>        <chr>    <chr>       <chr>         <dbl>       <dbl>       <dbl>
##  1 900          330      31198570044 1095275           1        0.5         0   
##  2 900          330      31198570047 9878513           1        0.99        0.1 
##  3 1228         406      31198655051 1041453           1        1.43        0.15
##  4 906          319      31198705046 1020156           1        1.5         0.29
##  5 906          319      31198705046 1053875           2        2.78        0.8 
##  6 906          319      31198705046 1060312           1        5.49        0.5 
##  7 906          319      31198705046 1075313           1        1.5         0.29
##  8 1058         381      31198676055 985893            1        1.88        0.21
##  9 1058         381      31198676055 988791            1        1.5         1.29
## 10 1058         381      31198676055 9297106           1        2.69        0   
## # ℹ 1,469,297 more rows
## # ℹ 4 more variables: coupon_disc <dbl>, coupon_match_disc <dbl>, week <int>,
## #   transaction_timestamp <dttm>
promotions <- get_promotions()
promotions
## # A tibble: 20,940,529 × 5
##    product_id store_id display_location mailer_location  week
##    <chr>      <chr>    <fct>            <fct>           <int>
##  1 1000050    316      9                0                   1
##  2 1000050    337      3                0                   1
##  3 1000050    441      5                0                   1
##  4 1000092    292      0                A                   1
##  5 1000092    293      0                A                   1
##  6 1000092    295      0                A                   1
##  7 1000092    298      0                A                   1
##  8 1000092    299      0                A                   1
##  9 1000092    304      0                A                   1
## 10 1000092    306      0                A                   1
## # ℹ 20,940,519 more rows
campaigns
## # A tibble: 6,589 × 2
##    campaign_id household_id
##    <chr>       <chr>       
##  1 1           105         
##  2 1           1238        
##  3 1           1258        
##  4 1           1483        
##  5 1           2200        
##  6 1           293         
##  7 1           529         
##  8 1           536         
##  9 1           568         
## 10 1           630         
## # ℹ 6,579 more rows
campaign_descriptions
## # A tibble: 27 × 4
##    campaign_id campaign_type start_date end_date  
##    <chr>       <ord>         <date>     <date>    
##  1 1           Type B        2017-03-03 2017-04-09
##  2 2           Type B        2017-03-08 2017-04-09
##  3 3           Type C        2017-03-13 2017-05-08
##  4 4           Type B        2017-03-29 2017-04-30
##  5 5           Type B        2017-04-03 2017-05-07
##  6 6           Type C        2017-04-19 2017-05-21
##  7 7           Type B        2017-04-24 2017-05-28
##  8 8           Type A        2017-05-08 2017-06-25
##  9 9           Type B        2017-05-31 2017-07-02
## 10 10          Type B        2017-06-28 2017-07-30
## # ℹ 17 more rows
coupons
## # A tibble: 116,204 × 3
##    coupon_upc  product_id campaign_id
##    <chr>       <chr>      <chr>      
##  1 10000085207 9676830    26         
##  2 10000085207 9676943    26         
##  3 10000085207 9676944    26         
##  4 10000085207 9676947    26         
##  5 10000085207 9677008    26         
##  6 10000085207 9677052    26         
##  7 10000085207 9677385    26         
##  8 10000085207 9677479    26         
##  9 10000085207 9677791    26         
## 10 10000085207 9677878    26         
## # ℹ 116,194 more rows
coupon_redemptions
## # A tibble: 2,102 × 4
##    household_id coupon_upc  campaign_id redemption_date
##    <chr>        <chr>       <chr>       <date>         
##  1 1029         51380041013 26          2017-01-01     
##  2 1029         51380041313 26          2017-01-01     
##  3 165          53377610033 26          2017-01-03     
##  4 712          51380041013 26          2017-01-07     
##  5 712          54300016033 26          2017-01-07     
##  6 2488         51200092776 26          2017-01-10     
##  7 2488         51410010050 26          2017-01-10     
##  8 1923         53000012033 26          2017-01-14     
##  9 1923         54300021057 26          2017-01-14     
## 10 1923         57047091041 26          2017-01-14     
## # ℹ 2,092 more rows
demographics
## # A tibble: 801 × 8
##    household_id age   income    home_ownership marital_status household_size
##    <chr>        <ord> <ord>     <ord>          <ord>          <ord>         
##  1 1            65+   35-49K    Homeowner      Married        2             
##  2 1001         45-54 50-74K    Homeowner      Unmarried      1             
##  3 1003         35-44 25-34K    <NA>           Unmarried      1             
##  4 1004         25-34 15-24K    <NA>           Unmarried      1             
##  5 101          45-54 Under 15K Homeowner      Married        4             
##  6 1012         35-44 35-49K    <NA>           Married        5+            
##  7 1014         45-54 15-24K    <NA>           Married        4             
##  8 1015         45-54 50-74K    Homeowner      Unmarried      1             
##  9 1018         45-54 35-49K    Homeowner      Married        5+            
## 10 1020         45-54 25-34K    Homeowner      Married        2             
## # ℹ 791 more rows
## # ℹ 2 more variables: household_comp <ord>, kids_count <ord>
products
## # A tibble: 92,331 × 7
##    product_id manufacturer_id department    brand  product_category product_type
##    <chr>      <chr>           <chr>         <fct>  <chr>            <chr>       
##  1 25671      2               GROCERY       Natio… FRZN ICE         ICE - CRUSH…
##  2 26081      2               MISCELLANEOUS Natio… <NA>             <NA>        
##  3 26093      69              PASTRY        Priva… BREAD            BREAD:ITALI…
##  4 26190      69              GROCERY       Priva… FRUIT - SHELF S… APPLE SAUCE 
##  5 26355      69              GROCERY       Priva… COOKIES/CONES    SPECIALTY C…
##  6 26426      69              GROCERY       Priva… SPICES & EXTRAC… SPICES & SE…
##  7 26540      69              GROCERY       Priva… COOKIES/CONES    TRAY PACK/C…
##  8 26601      69              DRUG GM       Priva… VITAMINS         VITAMIN - M…
##  9 26636      69              PASTRY        Priva… BREAKFAST SWEETS SW GDS: SW …
## 10 26691      16              GROCERY       Priva… PNT BTR/JELLY/J… HONEY       
## # ℹ 92,321 more rows
## # ℹ 1 more variable: package_size <chr>

Graph 1

In the first graph I was able to pull the overall usage of coupons by Kroger customers in the year 2017 by the overall household income of customers. The goal of this graph is to pinpoint where we are seeing lots of coupon usage and where we see very little or moderate coupon usage. Knowing this data allows us to see where our coupon marketing strategies lie. Once we can see the main three groups we should target, which based on the data we see is the ranges of 50-74K, 75-99k, and 35-49K in that order are the household income ranges that use the most coupons. After reading this data we know that we should be marketing more to the lower income households that are not using coupons as much as they could. As well as keep pumping out marketing material and coupons to the middle class income range because this has clearly been successful.

season_redempt1 <- coupon_redemptions %>%
  inner_join(demographics, by = "household_id") %>%
  group_by(income) %>%
  summarise(coupon_upc = n()) %>%
  arrange(income) %>%  
  ggplot(aes(x = income, y = coupon_upc)) +
  geom_line(aes(group = 1), color = "blue") +  
  geom_point(data = . %>% top_n(20, coupon_upc), aes(x = income, y = coupon_upc), color = "red", size = 3) +
  labs(x = "Income", y = "Coupon Usage") +
  ggtitle("Coupon Usage by Household Income") +  
  geom_text(data = . %>% top_n(20, coupon_upc), aes(x = income, y = coupon_upc, label = coupon_upc), 
            color = "black", vjust = -0.5, size = 3)  

season_redempt1

Graph 2

In the second graph we wanted to look at the total coupon redemptions by month with those top 3 income ranges we are focusing on, 50-74K, 75-99k, and 35-49K. The graph shows us that the months of August and November are extremely popular for coupon usage. The goal of this graph is to see where we have lots of activity in certain months. The data can help us show ratios of when to put out more or less coupons in certain months. The holiday season, by no surprise, is the most popular time for coupon usage so we should continue producing lots of coupons during this time, as well as promote more coupon use. Something we found interesting was the month of August, we can see that back to school shopping promotes a huge coupon usage in these income ranges of customers. We can conclude that in the month of August we should produce more back to school item coupons for customers.

income_time_series <- coupon_redemptions %>%
  inner_join(demographics, by = "household_id") %>%
  mutate(income_range = case_when(
    income %in% c("50-74K") ~ "50-74K",
    income %in% c("75-99K") ~ "75-99K",
    income %in% c("35-49K") ~ "35-49K",
    income %in% c("25-34K") ~ "25-34K"
  )) %>%
  filter(income_range %in% c("50-74K", "75-99K", "35-49K", "25-34K")) %>%
  mutate(month = floor_date(redemption_date, "month")) %>%
  group_by(month, income_range) %>%
  summarize(total_redemptions = n()) %>%
  ggplot(aes(x = month, y = total_redemptions, fill = income_range)) +
  geom_bar(stat = "identity", position = "dodge") +
  labs(x = "Month",
       y = "Total Redemptions",
       fill = "Income Range",
       subtitle = "This graph looks at the total coupon redemptions for different income ranges, grouped by month.",
       caption = "Source: Coupon Redemptions and Demographics data sets from the completejourney R package.") +
  ggtitle("Total Coupon Redemptions by Income Range and Month") +
  theme(axis.text.x = element_text(angle = 45, size = rel(1), margin = margin(.5, unit = "cm"), vjust = 1),
        legend.position = "top",
        axis.title.x = element_text(vjust = 4),
        plot.title = element_text(hjust = 0.3),
        legend.justification = -.03) +
  scale_x_date(date_breaks = "1 month", date_labels = "%b %Y")  
## `summarise()` has grouped output by 'month'. You can override using the
## `.groups` argument.
income_time_series

Graph 3

In this third graph I am evaluating the categories of coupons used in the two highest coupon usage months, August and November. The goal of this graph is to see our most used types of coupons in the busy season. When we are able to pinpoint the certain categories we can ask ourselves the questions of why are these coupons successful, then making an educated guess we can push out more coupons like the ones that are doing well. For example if there is a coupon for lunch meat, then we should also push out more coupons for bread, because they go together. We can also evaluate the numbers and see if coupon allocation is in the proper categories in these months. This we can do for all months and better organize what type of coupons we are pushing out to customers in certain times of the year because of popularity. The graph shows us how popular beef and lunch meat coupons are in the month of August, and November shows us beauty products are more popular in November, therefore we should continue marketing these coupons to our customers.

library(tidyverse)
library(scales)

top_by_cat <- coupon_redemptions %>%
  inner_join(coupons, by = "coupon_upc", relationship = "many-to-many") %>%
  inner_join(products, by = "product_id") %>%
  mutate(period = case_when(
    month(redemption_date) %in% c(8) ~ "August",
    month(redemption_date) %in% c(11) ~ "November",
    month(redemption_date) %in% c(12) ~ "December",
    month(redemption_date) %in% c(9) ~ "September"
  )) %>%
  filter(period %in% c("August", "November", "December", "September")) %>%
  group_by(period, product_category) %>%
  summarise(coupon_redemptions = n(), .groups = "drop") %>%
  top_n(5) %>%
  arrange(desc(period), desc(coupon_redemptions)) %>%
  mutate(product_category = reorder(product_category, -coupon_redemptions)) %>%
  ggplot(aes(x = coupon_redemptions, y = product_category)) +
  geom_bar(stat = "identity", fill = "skyblue") +
  labs(x = "Coupon Redemptions", 
       y = "Product Category", 
       title = "Highest Coupon Use by Months", 
       caption = "Source: Coupon Redemptions, Coupons, and Products data sets from the completejourney R package.") +
  coord_flip() +
  geom_text(
    aes(label = paste(round(coupon_redemptions / 1e3, 2), "K")),
    position = position_dodge(width = 1),
    vjust = -.2, size = 2.5
  ) +
  facet_wrap(~ period, scales = 'free_x') + 
  scale_x_continuous(labels = label_number(suffix = "K", scale = 1e-3)) +
  theme(axis.text.x = element_text(size = 5, angle = 90, hjust = 0.95, vjust = 0.2),
        axis.text.y = element_text(size = 7))
## Selecting by coupon_redemptions
top_by_cat

Graph 4

The fourth and final graph is about being as specific as possible. In this graph we are looking at the three most popular coupon campaign types in the most popular month, November, in the income by households of 50-74k, which is the most popular as well. I wanted to see how well each of the campaign types did when coupons were the most used by the most used category. The data shows us that the type A campaign clearly worked the best in the most popular coupon month. While type C was clearly lacking and type B was far behind as well. This can tell us that we need to work more on type C and B campaigns because we are struggling with that. Then for campaign A continue what is working there because it is clearly successful. We can ask ourselves what we can learn for type C and B campaigns that we got from the type A campaign.

joined_data <- coupon_redemptions %>% inner_join(campaigns, by = c(“campaign_id”, “household_id”)) joined_data <- coupon_redemptions %>% inner_join(coupons, by = “coupon_upc”) %>% left_join(campaigns, by = “campaign_id”) %>%
inner_join(demographics, by = “household_id”) nov_coup_inc <- joined_data %>% mutate(month = month(redemption_date, label = TRUE, abbr = FALSE)) %>%
filter(month == “November” & income == “50-74K”) %>%
group_by(campaign_id, campaign_type) %>%
summarise(total_redemptions = n(), .groups = “drop”) %>%
slice_max(order_by = total_redemptions, n = 10) %>%
arrange(desc(total_redemptions)) %>%
ggplot(aes(x = total_redemptions, y = reorder(campaign_type, total_redemptions), fill = campaign_type)) +
geom_bar(stat = “identity”, show.legend = FALSE) +
geom_text(aes(label = total_redemptions), hjust = -0.2, size = 4) +
labs( x = “Total Coupon Redemptions”, y = “Campaign Type”, title = “Most Popular Campaign Types in November (Income: 50-74K)”, caption = “Source: Complete Journey Dataset” ) + scale_x_continuous(labels = scales::comma) +
theme_minimal() + theme(axis.text.x = element_text(size = 10), axis.text.y = element_text(size = 12), plot.title = element_text(face = “bold”, size = 14))

nov_coup_inc

Summary of Findings

The goal of this project was to be able to create a data evaluation that determined how different household incomes used coupons. We asked this question for the purpose of finding out where the allocation of different types of coupons should be distributed to our customers. We were able to pinpoint certain times of year where coupons and customers are more active, then furthermore evaluating which campaigns were the most successful in coupon usage based on incomes. Overall we made lots of findings in our data evaluation for Kroger. We are able to notice lots of things in the different elements of coupons and their users. With the findings we can help allocate coupons correctly based on income, time of the year, product type, and campaign success. The evaluation showed that the the very most successful coupon category is a customer who’s household income ranges from 50-75k, in the month of November under the marketing campaign type A. We even evaluated the most used coupons by products in certain months. We found that beauty product coupons are extremely popular in the month of November. With all of this information we found we can help the finance, marketing, operations, and overall sales departments because of our findings. Overall the project was a success in pinpointing certain coupons under certain incomes and time of year.