Final Project - Group 2

Introduction

We have been tasked with identifying and analyzing potential growth opportunities where Regork can invest in order to increase revenue and profits. Companies like Regork are always looking to make an impact and to win the “hearts and minds” of the consumer. As a grocer, sometimes this can be difficult when you deal with commodities. How can you gain the attention of the consumer and get them into the store?

Do the targeted campaigns generate a greater amount of customer engagement?

Regork has much of the data to start with and paint a picture of the consumer. With this data, we can see that the targeted Type A campaigns have a good effect on the larger families of 5+, but not so much on the smaller families. Utilizing exploratory data analysis (EDA) techniques and visualizations will allow us to identify potential solutions for increasing customer engagement through campaigns, while penetrating larger household sizes.

look at the rubric to complete this section; please revise as you see fit (Mike started this section; I added some small verbiage)

Packages Required

The following R packages are required in order to run the code in this R project:

library(completejourney)          # grocery store shopping transactions data from group of 2,469 households
library(corrplot)                 # visualization of a correlation matrix
library(dplyr)                    # manipulating and transforming data (i.e., filtering, joining, etc.)
library(forcats)                  # working with categorical variables or factors
library(GGally)                   # plotting system for R based on "Grammar of Graphics"; extension to "ggplot2"
library(ggrepel)                  # position and repel overlapping text labels with "ggplot2"
library(ggplot2)                  # data visualization plotting system using "Grammar of Graphics"
library(ggthemes)                 # additional plotting themes, scales, and geoms for "ggplot2"
library(grid)                     # grid graphics plotting functions and capabilities
library(gridExtra)                # additional grid capability functions for "grid"
library(here)                     # quick way to locate R project files
library(jpeg)                     # reading and writing JPEG images
library(knitr)                    # dynamic report generation in R
library(lubridate)                # functions used for working with dates and times
library(naniar)                   # find data quality issues; summarize, visualize, and manipulate missing data
library(png)                      # reading and writing PNG images
library(readr)                    # reading rectangular text data
library(stringr)                  # manipulation text
library(tidyr)                    # functions used for tidying or cleaning up messy data
library(tidyverse)                # tidying data and working with other R packages
library(treemapify)               # plotting treemaps in "ggplot2"

Data Preparation

The data for this R project can be accessed from the CompleteJourney website. The CompleteJourney datasets are based on grocery shopping transactions from a group of 2,469 households. Entities such as demographics, products, coupons, campaigns, etc., was collected over a timeframe from January 2017 - December 2017.

The first task in the data preparation process is to get the full transactions and promotions datasets. Both datasets and the remaining CompleteJourney datasets were previewed in a table-like format.

# get the completejourney - transactions dataset
transactions <- get_transactions()
transactions

## # A tibble: 1,469,307 × 11
##    house…¹ store…² baske…³ produ…⁴ quant…⁵ sales…⁶ retai…⁷ coupo…⁸ coupo…⁹  week
##    <chr>   <chr>   <chr>   <chr>     <dbl>   <dbl>   <dbl>   <dbl>   <dbl> <int>
##  1 900     330     311985… 1095275       1    0.5     0          0       0     1
##  2 900     330     311985… 9878513       1    0.99    0.1        0       0     1
##  3 1228    406     311986… 1041453       1    1.43    0.15       0       0     1
##  4 906     319     311987… 1020156       1    1.5     0.29       0       0     1
##  5 906     319     311987… 1053875       2    2.78    0.8        0       0     1
##  6 906     319     311987… 1060312       1    5.49    0.5        0       0     1
##  7 906     319     311987… 1075313       1    1.5     0.29       0       0     1
##  8 1058    381     311986… 985893        1    1.88    0.21       0       0     1
##  9 1058    381     311986… 988791        1    1.5     1.29       0       0     1
## 10 1058    381     311986… 9297106       1    2.69    0          0       0     1
## # … with 1,469,297 more rows, 1 more variable: transaction_timestamp <dttm>,
## #   and abbreviated variable names ¹household_id, ²store_id, ³basket_id,
## #   ⁴product_id, ⁵quantity, ⁶sales_value, ⁷retail_disc, ⁸coupon_disc,
## #   ⁹coupon_match_disc

# get the completejourney - promotions dataset
promotions <- get_promotions()
promotions

## # A tibble: 20,940,529 × 5
##    product_id store_id display_location mailer_location  week
##    <chr>      <chr>    <fct>            <fct>           <int>
##  1 1000050    316      9                0                   1
##  2 1000050    337      3                0                   1
##  3 1000050    441      5                0                   1
##  4 1000092    292      0                A                   1
##  5 1000092    293      0                A                   1
##  6 1000092    295      0                A                   1
##  7 1000092    298      0                A                   1
##  8 1000092    299      0                A                   1
##  9 1000092    304      0                A                   1
## 10 1000092    306      0                A                   1
## # … with 20,940,519 more rows

campaigns

## # A tibble: 6,589 × 2
##    campaign_id household_id
##    <chr>       <chr>       
##  1 1           105         
##  2 1           1238        
##  3 1           1258        
##  4 1           1483        
##  5 1           2200        
##  6 1           293         
##  7 1           529         
##  8 1           536         
##  9 1           568         
## 10 1           630         
## # … with 6,579 more rows

campaign_descriptions

## # A tibble: 27 × 4
##    campaign_id campaign_type start_date end_date  
##    <chr>       <ord>         <date>     <date>    
##  1 1           Type B        2017-03-03 2017-04-09
##  2 2           Type B        2017-03-08 2017-04-09
##  3 3           Type C        2017-03-13 2017-05-08
##  4 4           Type B        2017-03-29 2017-04-30
##  5 5           Type B        2017-04-03 2017-05-07
##  6 6           Type C        2017-04-19 2017-05-21
##  7 7           Type B        2017-04-24 2017-05-28
##  8 8           Type A        2017-05-08 2017-06-25
##  9 9           Type B        2017-05-31 2017-07-02
## 10 10          Type B        2017-06-28 2017-07-30
## # … with 17 more rows

coupons

## # A tibble: 116,204 × 3
##    coupon_upc  product_id campaign_id
##    <chr>       <chr>      <chr>      
##  1 10000085207 9676830    26         
##  2 10000085207 9676943    26         
##  3 10000085207 9676944    26         
##  4 10000085207 9676947    26         
##  5 10000085207 9677008    26         
##  6 10000085207 9677052    26         
##  7 10000085207 9677385    26         
##  8 10000085207 9677479    26         
##  9 10000085207 9677791    26         
## 10 10000085207 9677878    26         
## # … with 116,194 more rows

coupon_redemptions

## # A tibble: 2,102 × 4
##    household_id coupon_upc  campaign_id redemption_date
##    <chr>        <chr>       <chr>       <date>         
##  1 1029         51380041013 26          2017-01-01     
##  2 1029         51380041313 26          2017-01-01     
##  3 165          53377610033 26          2017-01-03     
##  4 712          51380041013 26          2017-01-07     
##  5 712          54300016033 26          2017-01-07     
##  6 2488         51200092776 26          2017-01-10     
##  7 2488         51410010050 26          2017-01-10     
##  8 1923         53000012033 26          2017-01-14     
##  9 1923         54300021057 26          2017-01-14     
## 10 1923         57047091041 26          2017-01-14     
## # … with 2,092 more rows

demographics

## # A tibble: 801 × 8
##    household_id age   income    home_ownership marital…¹ house…² house…³ kids_…⁴
##    <chr>        <ord> <ord>     <ord>          <ord>     <ord>   <ord>   <ord>  
##  1 1            65+   35-49K    Homeowner      Married   2       2 Adul… 0      
##  2 1001         45-54 50-74K    Homeowner      Unmarried 1       1 Adul… 0      
##  3 1003         35-44 25-34K    <NA>           Unmarried 1       1 Adul… 0      
##  4 1004         25-34 15-24K    <NA>           Unmarried 1       1 Adul… 0      
##  5 101          45-54 Under 15K Homeowner      Married   4       2 Adul… 2      
##  6 1012         35-44 35-49K    <NA>           Married   5+      2 Adul… 3+     
##  7 1014         45-54 15-24K    <NA>           Married   4       2 Adul… 2      
##  8 1015         45-54 50-74K    Homeowner      Unmarried 1       1 Adul… 0      
##  9 1018         45-54 35-49K    Homeowner      Married   5+      2 Adul… 3+     
## 10 1020         45-54 25-34K    Homeowner      Married   2       2 Adul… 0      
## # … with 791 more rows, and abbreviated variable names ¹marital_status,
## #   ²household_size, ³household_comp, ⁴kids_count

products

## # A tibble: 92,331 × 7
##    product_id manufacturer_id department    brand    product_c…¹ produ…² packa…³
##    <chr>      <chr>           <chr>         <fct>    <chr>       <chr>   <chr>  
##  1 25671      2               GROCERY       National FRZN ICE    ICE - … 22 LB  
##  2 26081      2               MISCELLANEOUS National <NA>        <NA>    <NA>   
##  3 26093      69              PASTRY        Private  BREAD       BREAD:… <NA>   
##  4 26190      69              GROCERY       Private  FRUIT - SH… APPLE … 50 OZ  
##  5 26355      69              GROCERY       Private  COOKIES/CO… SPECIA… 14 OZ  
##  6 26426      69              GROCERY       Private  SPICES & E… SPICES… 2.5 OZ 
##  7 26540      69              GROCERY       Private  COOKIES/CO… TRAY P… 16 OZ  
##  8 26601      69              DRUG GM       Private  VITAMINS    VITAMI… 300 CT…
##  9 26636      69              PASTRY        Private  BREAKFAST … SW GDS… <NA>   
## 10 26691      16              GROCERY       Private  PNT BTR/JE… HONEY   12 OZ  
## # … with 92,321 more rows, and abbreviated variable names ¹product_category,
## #   ²product_type, ³package_size

This section of the data preparation combines all the specified dataframes that are used with the visualizations in this R project. References to “na.rm = TRUE” have been incorporated to remove “NA” values from calculations.

filtered_trans <- demographics %>%
  left_join(transactions) %>%
  mutate(transaction_quarter = quarter(transaction_timestamp, type = "year.quarter"))

## Joining, by = "household_id"

full_view <- filtered_trans %>%
  inner_join(products, key = "product_id") %>%
  inner_join(coupon_redemptions) %>%
  inner_join(coupons) %>%
  inner_join(campaign_descriptions) %>%
  inner_join(campaigns)

## Joining, by = "product_id"
## Joining, by = "household_id"
## Joining, by = c("product_id", "coupon_upc", "campaign_id")
## Joining, by = "campaign_id"
## Joining, by = c("household_id", "campaign_id")

savings_view <- full_view %>%
  dplyr::mutate(coupon_savings = (coupon_disc + coupon_match_disc)) %>%
  dplyr::mutate(customer_amount = (sales_value - (coupon_savings))) %>%
#  dplyr::mutate(campaign_id = as.integer(campaign_id)) %>%
  group_by(transaction_quarter, week, household_id, household_size, campaign_id, campaign_type, age, income, marital_status, quantity, sales_value) %>%
  summarise(total_saving = sum(coupon_savings))

## `summarise()` has grouped output by 'transaction_quarter', 'week',
## 'household_id', 'household_size', 'campaign_id', 'campaign_type', 'age',
## 'income', 'marital_status', 'quantity'. You can override using the `.groups`
## argument.

spend_view <- full_view %>%
  dplyr::mutate(coupon_savings = (coupon_disc + coupon_match_disc)) %>%
  dplyr::mutate(customer_amount = (sales_value - (coupon_savings))) %>%
# dplyr::mutate(campaign_id = as.integer(campaign_id)) %>%
  group_by(transaction_quarter, week, household_id, household_size, campaign_id, campaign_type, age, income, marital_status) %>%
  summarise(total_spend = sum(customer_amount))

## `summarise()` has grouped output by 'transaction_quarter', 'week',
## 'household_id', 'household_size', 'campaign_id', 'campaign_type', 'age',
## 'income'. You can override using the `.groups` argument.

household_percent <- full_view %>% 
  count(campaign_type, household_size) %>%
  group_by(household_size) %>%
  mutate(pct = n / sum(n))

df1 <- coupon_redemptions %>%
  inner_join(demographics) %>%
  group_by(household_size)

## Joining, by = "household_id"

df1 %>% map_int(n_distinct)

##    household_id      coupon_upc     campaign_id redemption_date             age 
##             295             440              24             237               6 
##          income  home_ownership  marital_status  household_size  household_comp 
##              12               5               3               5               4 
##      kids_count 
##               4

df6 <- transactions_sample %>%
  inner_join(products) %>%
  inner_join(demographics) %>%
  group_by(household_size) %>%
  summarize(total_sales = sum(sales_value), total_quantity = sum(quantity)) %>%
  mutate(dollar_per_item = total_sales / total_quantity)

## Joining, by = "product_id"
## Joining, by = "household_id"

Demo_Coupon <- demographics %>%
  inner_join(coupon_redemptions)

## Joining, by = "household_id"

Coupons_used <- coupons %>%
  inner_join(coupon_redemptions)

## Joining, by = c("coupon_upc", "campaign_id")

Counted_Group <- Coupons_used %>%
  count(coupon_upc)

Top_5_UPCs <- Counted_Group %>%
  top_n(n = 5, wt = n)

Ordered_Top_5_UPCs <- arrange(Top_5_UPCs, desc(Top_5_UPCs$n))

Coupon_Joined <- Demo_Coupon %>%
  semi_join(Ordered_Top_5_UPCs)

## Joining, by = "coupon_upc"

Exploratory Data Analysis

The Exploratory Data Analysis (EDA) section outlines the visualizations that are used in the R project to address the problem statement and/or business question. Formats such as bar charts, line charts, scatter plots, correlation matrix, etc., are as different styles for the visualizations.

The first step to analyzing the problem is to first assess the current state of our campaign strategy. We wanted to analyze which campaigns were being sent to which household size groups. Our first graph represents the number of coupons sent to each household size, subset by campaign type. This allowsed us to have a frameset for our problem and solution.

# campaigns received based on household size by campaign type

campaigns %>%
  inner_join(demographics, by = "household_id") %>%
  inner_join(campaign_descriptions, by = "campaign_id") %>%
  group_by(household_id, campaign_id) %>%
  ggplot(aes(x = household_size, fill = campaign_type)) +
  geom_bar() +
  labs(
    fill = "Campaign Type",
    x = "Household Size",
    y = "Campaigns Received",
    title = "Campaigns Received by Household Size",
    subtitle = "Data represents the number of campaigns received based on campaign type by household size.\nThe top campaign type is A with its highest count being received within a household size of 2.",
    caption = "Data Source R Package: completejourney" 
    ) +
  theme(
    plot.subtitle = element_text(face = "italic"),
    plot.caption = element_text(hjust = 1, size = 7),
    legend.title = element_text(size = 9),
    legend.text = element_text(size = 7)
  )

The following are tables that display the counts of campaigns received based on household size and campaign type.

# count of campaigns received based on household size
campaigns %>%
  inner_join(demographics, by = "household_id") %>%
  inner_join(campaign_descriptions, by = "campaign_id") %>%
  group_by(household_size) %>%
  count(household_size)

## # A tibble: 5 × 2
## # Groups:   household_size [5]
##   household_size     n
##   <ord>          <int>
## 1 1               1155
## 2 2               1482
## 3 3                576
## 4 4                300
## 5 5+               322

# count of campaigns by campaign type
campaigns %>%
  inner_join(demographics, by = "household_id") %>%
  inner_join(campaign_descriptions, by = "campaign_id") %>%
  group_by(campaign_type) %>%
  count(campaign_type)

## # A tibble: 3 × 2
## # Groups:   campaign_type [3]
##   campaign_type     n
##   <ord>         <int>
## 1 Type A         2007
## 2 Type B         1466
## 3 Type C          362

# count of campaigns received by household size based on campaign type
campaigns %>%
  inner_join(demographics, by = "household_id") %>%
  inner_join(campaign_descriptions, by = "campaign_id") %>%
  group_by(household_size) %>%
  count(campaign_type)

## # A tibble: 15 × 3
## # Groups:   household_size [5]
##    household_size campaign_type     n
##    <ord>          <ord>         <int>
##  1 1              Type A          647
##  2 1              Type B          403
##  3 1              Type C          105
##  4 2              Type A          802
##  5 2              Type B          558
##  6 2              Type C          122
##  7 3              Type A          280
##  8 3              Type B          228
##  9 3              Type C           68
## 10 4              Type A          126
## 11 4              Type B          139
## 12 4              Type C           35
## 13 5+             Type A          152
## 14 5+             Type B          138
## 15 5+             Type C           32

The next graph shows the average spending by household size, and we subset it by campaign type. We decided to take the average spending to make sure the data was not skewed and to help depict a more accurate graph showing the similarities and differences of spending for each campaign type. As we can see by the graph, each housing size has slightly different habits towards which campaign is redeemed the most.

# average spending by household size based on campaign type

transactions %>%
  inner_join(demographics, by = "household_id") %>%
  inner_join(campaigns) %>%
  inner_join(campaign_descriptions) %>%
  group_by(campaign_type, household_size) %>%
  summarize(average_spending = mean(sales_value, na.rm = TRUE)) %>%
  arrange(desc(average_spending), household_size) %>%
  ggplot(aes(x = household_size, y = average_spending, color = campaign_type)) +
  geom_point(size = 5) +
  geom_text_repel(
    aes(label = scales::dollar(round(average_spending, 2))),
    color = "black"
  ) +
  guides(color = guide_legend(title = "Campaign Type")) +
  scale_y_continuous(name = "Average Spending", labels = scales::dollar) +
  labs(
    x = "Household Size",
    y = "Average Spending",
    title = "Average Spending by Household Size",
    subtitle = "Data represents the average spending based on household size and by campaign type.",
    caption = "Data Source R Package: completejourney"
    ) +
  theme(
    plot.subtitle = element_text(face = "italic"),
    plot.caption = element_text(hjust = 1, size = 7),
    legend.title = element_text(size = 9),
    legend.text = element_text(size = 7)
  )

## Joining, by = "household_id"
## Joining, by = "campaign_id"
## `summarise()` has grouped output by 'campaign_type'. You can override using the
## `.groups` argument.

We also wanted to confirm the seasonality of the campaigns to make sure there were no issues with timing of the coupons. Our seasonality graph shows that Type A is effective year-round, which is another positive factor showing that the targeted campaigns are effective.

# products sales per month by campaign type; top 10 coupons

products %>%
  inner_join(transactions, by = "product_id") %>%
  inner_join(coupons, by = "product_id") %>%
  inner_join(campaigns) %>%
  inner_join(campaign_descriptions) %>%
  mutate(product_category = str_replace_all(product_category, pattern = "FRZN", replacement = "FROZEN")) %>%
  mutate(sales_month = month(transaction_timestamp, label = TRUE)) %>%
  group_by(sales_month, coupon_upc, campaign_type) %>%
  summarize(total_sales = sum(sales_value, na.rm = TRUE)) %>%
  arrange(desc(total_sales)) %>%
  slice(1:10) %>%
  ggplot(aes(x = sales_month, y = total_sales, color = campaign_type)) +
  geom_jitter() +
  guides(color = guide_legend(title = "Campaign Type")) +
  scale_y_continuous(name = "Total Sales", labels = scales::dollar) +
  labs(
    x = "Month",
    title = "Total Product Sales Per Month",
    subtitle = "Data represents the total sales based on the top 10 coupons.\nCampaign Type A displays the highest product sales per month..",
    caption = "Data Source R Package: completejourney"
    ) +
  theme(
    plot.subtitle = element_text(face = "italic"),
    plot.caption = element_text(hjust = 1, size = 7),
    legend.title = element_text(size = 9),
    legend.text = element_text(size = 7)
  )

## Joining, by = c("household_id", "campaign_id")
## Joining, by = "campaign_id"
## `summarise()` has grouped output by 'sales_month', 'coupon_upc'. You can
## override using the `.groups` argument.

this needs a write-up…

# total sales by household size based on campaign type

first_join_df <- campaigns %>%
  inner_join(demographics, by = "household_id") %>%
  inner_join(campaign_descriptions, by = "campaign_id")

second_join_df <- first_join_df %>%
  inner_join(transactions, by = "household_id") %>%
  group_by(household_id, household_size, campaign_type) %>%
  summarize(total_sales = sum(sales_value, na.rm = TRUE)) %>%
  ggplot(aes(x = campaign_type, y = total_sales, color = household_size)) +
  geom_point() +
  guides(color = guide_legend(title = "Household Size")) +
  scale_y_log10(name = "Total Sales", labels = scales::dollar) +
  labs(
    x = "Campaign Type",
    y = "Total Sales",
    title = "Total Sales by Campaign Type",
    subtitle = "Data represents the total sales by campaign type based on the household size.",
    caption = "Data Source R Package: completejourney"
    ) +
  theme(
    plot.subtitle = element_text(face = "italic"),
    plot.caption = element_text(hjust = 1, size = 7),
    legend.title = element_text(size = 9),
    legend.text = element_text(size = 7)
  )

## `summarise()` has grouped output by 'household_id', 'household_size'. You can
## override using the `.groups` argument.

second_join_df

this needs a write-up…(once decision is made to keep???; select one)

##  [1] "#7F3B08" "#B75C07" "#E68D25" "#FDC57E" "#FEEDD6" "#E9EAF3" "#BEBADA"
##  [8] "#8B7FB4" "#582F8C" "#2D004B"

Our next two graphs show the spending and savings per campaign type. As we can see by the graphs, Type A campaigns were prominently found in the spending and savings, compared to Types B and C. We felt that these graphs gave us a great depiction of overall usage of Type A compared to Types B and C. (We can then tie into how we view Type A is the dominant campaign being the targeted one and roll into solution on how to make campaigns better?)

# update titles, subtitles, add captions, etc. by 10/5/2022

# spending by household size where quantity less than 40000 and sales value < 400

filtered_trans %>%
  select(sales_value, quantity, household_size) %>%
  filter(quantity < 40000, sales_value < 400) %>%
  ggplot(aes(x = quantity, y = sales_value, color = household_size)) +
  geom_point() + 
  scale_y_continuous(labels = scales::dollar) +
  labs(title = "spend by household size", subtitle = "OMG")

# savings based on household size

savings_view %>%
  ggplot(aes(x = quantity, y = sales_value, color = household_size)) +
  geom_point() + 
  scale_y_continuous(labels = scales::dollar) +
  labs(title = "spend by household size", subtitle = "OMG")

this needs a write-up…

# update title, subtitle, add caption, etc. by 10/5/2022

# savings based on household size; quarterly; filtering outlier

savings_view %>% 
  filter(household_id != 1228) %>%
  arrange(campaign_id) %>%
  ggplot(aes(y = total_saving, x = transaction_quarter, fill = household_size)) +
  geom_bar(position = 'dodge', stat = 'identity') +
  scale_y_continuous(labels = scales::dollar) +
  labs(title = "savings by household size", subtitle = "OMG")

this needs a write-up…

# update title, subtitle, add caption, etc. by 10/5/2022

# savings based campaign type; by week

savings_view %>% 
  filter(household_id != 1228) %>%
  arrange(campaign_id) %>%
  ggplot(aes(y = total_saving, x = week, color = campaign_type)) +
  geom_point(position = 'dodge', stat = 'identity') +
  scale_y_continuous(labels = scales::dollar) +
  labs(title = "savings by campaign", subtitle = "OMG")

## Warning: Width not defined. Set with `position_dodge(width = ?)`

this needs a write-up…

# update title, subtitle, add caption, etc. by 10/5/2022

# spending based on household size; quarterly 

spend_view %>% 
  filter(household_id != 1228) %>%
  arrange(campaign_id) %>%
  ggplot(aes(y = total_spend, x = transaction_quarter, fill = household_size)) +
  geom_bar(position = 'dodge', stat = 'identity') +
  scale_y_continuous(labels = scales::dollar) +
  labs(title = "spend by household size", subtitle = "OMG")

this needs a write-up…

# update title, subtitle, add caption, etc. by 10/5/2022

# spending by household size; filtering outlier

spend_view %>%  #line view
  filter(household_id != 1228) %>%
  arrange(campaign_id) %>%
  ggplot(aes(y = total_spend, x = week)) +
  geom_line(data = spend_view, aes(week, total_spend, color = household_size)) +
  scale_y_continuous(labels = scales::dollar) +
  theme(legend.justification = 'centre',
                legend.position = 'bottom',
                legend.direction = "horizontal",
                legend.key.height = unit(0.5, "cm"),
                legend.key.width = unit(0.5,"cm")) +
  labs(title = "spend by household size", subtitle = "OMG")

this needs a write-up…

# update title, subtitle, add caption, etc. by 10/5/2022

# campaigns per household based on percentage

household_percent %>% 
  ggplot(aes(household_size, fill = campaign_type)) +
  geom_col(aes(household_size, pct)) +
  labs(title = "Campaigns per household (percentage)", subtitle = "subtitle") +
  theme_wsj() +
  theme(text = element_text(size = 7), axis.text = element_text(size = 7), 
        plot.title = element_text(size = 14), plot.subtitle = element_text(size = 9)) +
  scale_y_continuous()

this needs a write-up…

# update title, color, subtitle, add caption by 10/5/2022

# sales value per quantity by income based on household size

ggplot(df6, aes(household_size, dollar_per_item)) +
  geom_point(stat = "identity", color = "red", size = 2) +
  labs(title = "Total Sales Value per Quantity By Income",
       subtitle = "Total sales value per quantity for all products purchased by each income group",
       x = "Household Size",
       y = "Total Sales Value per Quantity") +
  coord_flip()

this needs a write-up…

# total coupon redemption by age based on household size

ggplot(df1, aes(x = household_size, fill = household_size)) +
  geom_bar(stat = "count") +
  labs(title = "Total Coupon Redemption Volume By Household Size",
       subtitle = "Total coupon redemption number by each household size between January 1 and December 31, 2017",
       x = "Household Size",
       y = "Total Coupon Redemption Volume")

this needs a write-up…

# coupon usage based on household size

Coupon_Plot <- ggplot(Coupon_Joined, aes(household_size, fill = coupon_upc)) +
  geom_bar() +
  labs(title = "Coupon Usage Based on Household Size",
       subtitle = "Based on Top 5 Coupon UPCs Redeemed",
       x = "Household Size",
       y = "Total Coupons Used")

plot(Coupon_Plot)

Summary

summarize problem statement, insights, limitations, etc.