Midterm Project

Introduction

Provide an introduction that explains the business problem you are addressing. Why should the Regork CEO be interested in this?

At Regork, our goal is to solve business problems to grow our company. With the holiday season nearing us, there are many opportunities for sales that we as a company don’t want to miss. Since there is a noticeable slowdown in sales between Holidays, we want to make the best of these Holiday sales. The CEO should be interested as this is the season to make the most money and we should make the most of it.

Provide a short explanation of how you addressed this problem statement (the data used and the analytic methodology employed).

Packages required Complete Journey Tidyverse Lubricate Data Sets Used Transactions Demographics Product Coupon_redemtion

We used the packages above to take a deep dive into Regork’s past sales. Looking at different demographics such as income, age, number of kids in the household, and items bought with candy. Ggplot helped to visualize all this data so we easily can see which demographics are best to advertise to.

Explain how your analysis will help the Regork CEO (or your manager). What is your proposed solution?

Our goal is that our plots help the CEO easily visualize sales throughout different demographics. With major holidays for candy sales coming up, the creation of these graphs may help to give insight. With this data, we know where to place candy in a store so it will sell better, and what demographics to advertise to so this would sell better.

The following R packages are required in order to run the code in this R project:

library(tidyverse)

## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

library(completejourney)

## Welcome to the completejourney package! Learn more about these data
## sets at http://bit.ly/completejourney.

library(lubridate)

Data Sets Used

The four data sets which were used for this analysis:

Transactions: household_id store_id basket_id product_id quantity sales_value retail_disc coupon_disc coupon_match_disc week transaction_timestamp

Products: product_id manufacturer_id department brand product_category product_type package_size

Demographics: household_id age income home_ownership marital_status household_size household_comp kids_count

Coupon_Redemptions: household_id coupon_upc campaign_id redemption_date

The following R code aims to fetch transaction data and store it in a data frame named ‘transactions’. We will use the function get_transactions() to retrieve the transaction data. The fetched data will be organized and stored in a structured format using a data frame. The resulting data frame, named ‘transactions’, will be utilized for further analysis and processing

transactions <- get_transactions()

Candy Transactions: This combines transaction data with product details and demographic information to specifically focus on transactions related to packaged candies. This is achieved through a series of data manipulations, including joining the transactions and products datasets based on product IDs, filtering for entries with a product category indicative of packaged candy using regular expressions, and finally, joining the resulting dataset with demographic information based on household IDs.

candyData <- transactions %>%
  inner_join(products, by = "product_id") %>%
  filter(str_detect(product_category, regex("candy - packaged", ignore_case = TRUE))) %>%
  inner_join(demographics, by = "household_id")

Coupon Data: Combines coupon redemption data with demographic information to gain insights into the utilization of coupons among various demographic groups. This is achieved through an inner join operation, connecting coupon_redemptions and demographics datasets based on household IDs.

couponData <- coupon_redemptions %>%
  inner_join(demographics, by = "household_id")

Joined transactions, products, and demographics to determine how frequently items in a specific product category are bought by households with varying numbers of kids. The analysis is summarized by grouping the data based on product category and kids count

productCatergories <- transactions %>%
  inner_join(products, by = "product_id") %>%
  inner_join(demographics, by = "household_id") %>%
  group_by(product_category, kids_count) %>%
  summarize(timesBought = n())

## `summarise()` has grouped output by 'product_category'. You can override using
## the `.groups` argument.

For this plot we looked at the average candy sales by amount of kids in household,throughout different months. Households with one child, on average, had the most candy sales. You can see throughout all the graphs candy sales peak at around April and October, this is most likely due to candy purchases for Easter and Halloween.

candyData %>%
  mutate(month = month(transaction_timestamp)) %>%
  mutate(day = day(transaction_timestamp)) %>%
  mutate(date = date(transaction_timestamp)) %>%
  group_by(date, kids_count) %>%
  summarize(sales = mean(sales_value)) %>%
  ggplot(aes(x = date, y = sales)) +
    geom_point(color = 'darkblue') +
    facet_wrap(~kids_count) +
    ggtitle("Sales of candy throughout the year by number of kids in the household") +
    scale_x_date("Day of Year(2017)", date_breaks = "1 month", date_labels = "%b") +
    scale_y_continuous("Average sales of candy each day", labels = scales::dollar) +
    theme(panel.spacing.x = unit(0.5, "cm", data = NULL))

## `summarise()` has grouped output by 'date'. You can override using the
## `.groups` argument.

The bar plot of total sales of packaged candy based on the number of kids in households reveals a compelling trend. Surprisingly, households with zero children (0 kids) exhibit the highest total sales. This is because in the data, their are more households without children, which is why their total sales are higher.Conversely, as the number of kids increases, there is a gradual decline in total sales, this si due to less total data as number of kids increase.

candyData %>%
  group_by(kids_count) %>%
  summarize(sales = sum(sales_value)) %>%
  ggplot(aes(x = kids_count, y = sales)) +
    geom_bar(stat = 'identity', fill = 'darkblue') +
    ggtitle("Total sales in household by the number of kids") +
    scale_x_discrete("Number of Kids") +
    scale_y_continuous("Total sales", labels = scales::dollar)

Here you can see the majority of sales come from the middle class (35K-74K) with more kids. The upper class have next to no candy sales.

candyData %>%
  group_by(kids_count, income) %>%
  summarize(sales = sum(sales_value)) %>%
  ggplot(aes(x = kids_count, y = sales)) +
    geom_bar(stat = 'identity', fill = 'darkblue') +
    facet_wrap(~income) + 
    ggtitle("Total sales in household by the number of kids and income") +
    scale_x_discrete("Number of Kids") +
    scale_y_continuous("Total sales", labels = scales::dollar)

## `summarise()` has grouped output by 'kids_count'. You can override using the
## `.groups` argument.

The total sales in households based on number of kids and age shows that the middle aged group, primarily the one aged 45-54 buy the most candy with zero kids in the household and the 35-44 age range buys the most on average with kids in the household.

candyData %>%
  group_by(kids_count, age) %>%
  summarize(sales = sum(sales_value)) %>%
  ggplot(aes(x = kids_count, y = sales)) +
    geom_bar(stat = 'identity', fill = 'darkblue') +
    facet_wrap(~age) + 
    ggtitle("Total sales in household by the number of kids and age") +
    scale_x_discrete("Number of Kids") +
    scale_y_continuous("Total sales", labels = scales::dollar)

## `summarise()` has grouped output by 'kids_count'. You can override using the
## `.groups` argument.

These trends, unsuprisly, are similar to the trends of candy bought based on income and kids in household.

couponData %>%
  group_by(kids_count, income) %>%
  summarize(totalRedemptions = n()) %>%
  ggplot(aes(x = kids_count, y = totalRedemptions)) +
    geom_bar(stat = 'identity', fill = 'darkgreen') +
    facet_wrap(~income) + 
    ggtitle("Total number of coupons used in household by the number of kids and income") +
    scale_x_discrete("Number of Kids") +
    scale_y_continuous("Total redemptions")

## `summarise()` has grouped output by 'kids_count'. You can override using the
## `.groups` argument.

Throughout all households, candy is most often bought with soft drinks, when compared to milk and bread.

top_3_types <- productCatergories %>%
  group_by(kids_count) %>%
  top_n(3, timesBought) %>%
  arrange(kids_count)

top_3_types %>%
  ggplot(aes(x = kids_count, y = timesBought, fill = product_category)) +
    geom_bar(position = 'dodge', stat = 'identity') +
  ggtitle("Top three product categories households are buying by number of kids") +
  scale_x_discrete("Number of Kids") +
  scale_y_continuous("Number of transactions")

Summary (i) Summarize the problem statement you addressed. The problem we addressed was increasing candy sales for Regork during major candy holidays such as Halloween, Christmas, Easter, and Valentine’s Day from both customers who already buy candy and those that do not.

Summarize how you addressed this problem statement (the data used and the methodology employed).
By analyzing the data with the complete journey package we were able to ascertain how each consumer base purchased candy during holiday seasons as well as what else they like to purchase. By analyzing coupon usage we can determine which consumers are more likely to buy candy if given coupons
Summarize the interesting insights that your analysis provided. The first insight that stuck out was that the 50k-74k range was both the consumer group that used coupons the most frequently but also were the group who bought the most candy.

Another insight that stood out is despite what we may think it is actually the families in every demographic we analyzed that have no kids that buy the most candy.

Summarize the implications to the consumer of your analysis. What would you propose to the Regork CEO? Our analysis’s main purpose was to show what groups purchase candy during the holidays, and how we can improve sales overall amongst all demographics. What we suggest to Regork is to bundle baked breads, buns/rolls or milk products with candy for the middle class market making between 50k and 74k because they purchase the most candy which would boost those sales. We also recommend that consumers who don’t buy as much candy like the upper class get a coupon to pair with soft drinks, which is the most purchased item in every household based on children.
Discuss the limitations of your analysis and how you, or someone else, could improve or build on it. A limitation of our project would be that lack of yearly trends. If more data was available we could get a better picture of the sales over time as it could provide some more insight. With our project only providing monthly analysis we feel it is still accurate, but more data could strengthen our claims

Midterm Project

Caroline Bowling, Stephanie Martin, and Wade Burke

2024-10-10